Categories
Misc

Is it possible to vectorize (perform on batch) random cropping per element?

Nevermind, solved.

EDIT: Got it working, there were two bugs. First, I had mistakenly initialized batch_size twice during all my editing, and so I was mismatching between batch_size depending on where in the code I was. The second bug, which I still haven’t entirely fixed, is that if the batch size does not evenly divide the input generator the code fails, even though I have it set with a take amount that IS divisible by the batch_size. I get the same error even if I set steps per epoch to 1 (so it should never be reaching the end of the batches). I can only assume it’s an error during graph generation where it’s trying to define that last partial batch even though it will never be trained over. Hmm.

EDIT EDIT: Carefully following the size of my dataset throughout my pipeline, I discovered the source of the second issue, which is actually just that I didn’t delete my cache files when I previously had a larger take. The last thing I would still like to do is fix the code such that it actually CAN handle variable length batches so I don’t have to worry about making sure I don’t have partial batches. However, from what I can see, tf.unstack along variable length dimensions is straight up not supported, so this will require refactoring my computation to use some other method, like loops maybe. To be honest, though, it’s not worth my time to do so right now when I can just use drop_remainder = True and drop the last incomplete batch. In my real application there will be a lots of data per epoch, so losing 16 or so random examples from each epoch is rather minor.

So, I am making a project where I randomly crop images. In my data pipeline, I was trying to write code such that I could crop batches of my data at once, as the docs suggested that vectorizing my operations would reduce scheduling overhead.

However, I have run into some issues. If I use tf.image.random_crop, the problem is that the same random crop will be used on every image in the batch. I, however, want different random crops for every image. Moreover, since where I randomly crop an image will affect my labels, I need to track every random crop performed per image and adjust the label for that image.

I was able to write code that seems like it would work by using unstack, doing my operation per element, then restacking, like so:

images = tf.unstack( img, num=None, axis=0, name='unstack' ) xshifts = [] yshifts = [] newimages =[] for image in images: if not is_valid: x = np.random.randint(0, width - dx + 1) y = np.random.randint(0, height - dy + 1) else: x = 0 y = 0 newimages.append(image[ y:(y+dy), x:(x+dx), :]) print(image[ y:(y+dy), x:(x+dx), :]) xshifts.append((float(x) / img.shape[2]) * img_real_size) yshifts.append((float(y) / img.shape[1]) * img_real_size) images = tf.stack(newimages,0) 

But oh no! Whenever I use this code in a map function, it doesn’t work because in unstack I set num=None, which requires it to infer how much to unstack from the actual batch size. But because tensorflow for reasons decided that batches should have size none when specifying things, the code fails because you can’t infer size from None. If I patch the code to put in num=batch_size, it changes my datasets output signature to be hard coded to batch_size, which seems like it shouldn’t be a problem, except this happens.

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input shape axis 0 must equal 2, got shape [1,448,448,3] [[{{node unstack}}]] 

Which is to say, it’s failing because instead of receiving the expected batch input with the appropriate batch_size (2 for testing), it’s receiving a single input image, and 2 does not equal 1. The documentation strongly implied to me that if I batch my dataset before mapping (which I do), then the map function should be receiving the entire batch and should therefore be vectorized. But is this not the case? I double checked my batch and dataset sizes to make sure that it isn’t just an error arising due to some smaller final batch.

To sum up: I want to crop my images uniquely per image and alter labels as I do so. I also want to do this per batch, not just per image. The code I wrote that does this requires me to unstack my batch. But unstacking my batch can’t have num=None. But tensorflow batches have shape none, so the input of my method has shape none at specification. But if I change unstack’s num argument to anything but None, it changes my output specification to that number (which isn’t None), and the output signature of my method must ALSO have shape none at specification. How can I get around this?

Or, if someone can figure out why my batched dataset, batched before my map function, is apparently feeding in single samples instead of full batches, that would also solve the mystery.

submitted by /u/Drinniol
[visit reddit] [comments]

Categories
Misc

Meet the Researcher: Marco Aldinucci, Convergence of HPC and AI to Fight Against COVID

‘Meet the Researcher’ is a series in which we spotlight different researchers in academia who use NVIDIA technologies to accelerate their work.  This month we spotlight Marco Aldinucci, Full Professor at the University of Torino, Italy, whose research focuses on parallel programming models, language, and tools.  Since March 2021, Marco Aldinucci has been the Director … Continued

‘Meet the Researcher’ is a series in which we spotlight different researchers in academia who use NVIDIA technologies to accelerate their work. 

This month we spotlight Marco Aldinucci, Full Professor at the University of Torino, Italy, whose research focuses on parallel programming models, language, and tools

Since March 2021, Marco Aldinucci has been the Director of the brand new “HPC Key Technologies and Tools” national lab at the Italian National Interuniversity Consortium for Informatics (CINI), which affiliates researchers interested in HPC and cloud from 35 Italian universities. 

He is the recipient of the HPC Advisory Council University Award in 2011, the NVIDIA Research Award in 2013, and the IBM Faculty Award in 2015. He has participated in over 30 EU and national research projects on parallel computing, attracting over 6M€ of research funds to the University of Torino. He is also the founder of HPC4AI, the competence center on HPC-AI convergence that federates four labs in the two universities of Torino.

What are your research areas of focus?

I like to define myself as a pure computer scientist with a natural inclination for multi-disciplinarity. Parallel and High-Performance Computing is useful when applied to other scientific domains, such as chemistry, geology, physics, mathematics, medicine. It is, therefore, crucial for me to work with domain experts, and do so while maintaining my ability to delve into performance issues regardless of a particular application. To me, Artificial Intelligence is also a class of applications. 

When did you know that you wanted to be a researcher and wanted to pursue this field?

I was a curious child and I would say that discovering new things is simply the condition that makes me feel most satisfied. I got my MSc and Ph.D. at the University of Pisa, Italy’s first Computer Science department, established in the late 1960s. The research group on Parallel Computing was strong. As a Ph.D. student, I have deeply appreciated their approach to distilling and abstracting computational paradigms independent of the specific domain and, therefore, somehow universal. That is mesmerizing.

What motivated you to pursue your recent research area of focus in supercomputing and the fight against COVID?

People remember the importance of research only in times of need, but sometimes it’s too late. When COVID arrived, all of us researchers felt the moral duty to be the front-runners in investing our energy and our time well beyond regular working hours. When CLAIRE (“Confederation of Laboratories for Artificial Intelligence Research in Europe”) proposed I lead a task force of volunteer scientists to help develop tools to fight against COVID, I immediately accepted. It was the task force on the HPC plus AI-based classification of interstitial pneumonia. I presented the results in my talk at GTC ’21; they are pretty interesting.

** Click here to watch the presentation from GTC ’21, “The Universal Cloud-HPC Pipeline for the AI-Assisted Explainable Diagnosis of of COVID-19 Pneumonia“.

The CLAIRE-COVID19 universal pipeline, designed to compare different training algorithms to define a baseline for such techniques and to allow the community to quantitatively measure AI’s progress in the diagnosis of COVID-19 and similar diseases. [source]

What problems or challenges does your research address?

I am specifically interested in what I like to call “the modernization of HPC applications,” which is the convergence of HPC and AI, but also all the methodologies needed to build portable HPC applications running on the compute continuum, from HPC to cloud to edge. The portability of applications and performances is a severe issue for traditional HPC programming models.   

In the long term, writing scalable parallel programs that are efficient, portable, and correct must be no more onerous than writing sequential programs. To date, parallel programming has not embraced much more than low-level libraries, which often require the application’s architectural redesign. In the hierarchy of abstractions, they are only slightly above toggling absolute binary in the machine’s front panel. This approach cannot effectively scale to support the mainstream of software development where human productivity, total cost, and time to the solution are equally, if not more, important aspects. Modern AI toolkits and their cloud interfaces represent a tremendous modernization opportunity because they contaminate the small world of MPI and the batch jobs with new concepts: modular design, the composition of services, segregation of effects and data, multi-tenancy, rapid prototyping, massive GPU exploitation, interactive interfaces. After the integration with AI, HPC will not be what it used to be.

What challenges did you face during the research process, and how did you overcome them?

Technology is rapidly evolving, and keeping up with the new paradigms and innovative products that appear almost daily is the real challenge. Having a solid grounding in computer science and math is essential to being an HPC and AI researcher in this ever-changing world. Technology evolves every day, but revolutions are rare events. I happen to say to my students: it doesn’t matter how many programming languages ​​you know; the problem is how much effort is needed to learn the next.

How is your work impacting the community?

I have always imagined the HPC realm as organized around three pillars: 1) infrastructures, 2) enabling technologies for computing, and 3) applications. We were strong on infrastructures and applications in Italy, but excellencies in technologies for computing were spread around different universities as leopard spots. 

For this, we recently started a new national laboratory called “HPC Key technologies and Tools” (HPC-KTT). I am the founding Director. HPC-KTT co-affiliates hundreds of researchers from 35 Italian universities to reach the critical mass to impact international research with our methods and tools. In the first year of activity, we gathered EU research projects in competitive calls for a total cost 95M€ (ADMIRE, ACROSS, TEXTAROSSA, EUPEX, The European Pilot). We have just started, more information can be found in:

M. Aldinucci et al, “The Italian research on HPC key technologies across EuroHPC,” in ACM Computing Frontiers, Virtual Conference, Italy, 2021. doi:10.1145/3457388.3458508

What are some of your most proudest breakthroughs?

I routinely use both NVIDIA hardware and software. Most of the important research results I recently achieved in multi-disciplinary teams have been achieved thanks to NVIDIA technologies that are capable to accelerate machine learning tasks. Among recent results, I can mention a couple of important papers that appeared on “the Lancet” and “Medical image analysis“, but also HPC4AI (High-Performance Computing for Artificial Intelligence), a new data center I started at my university. HPC4AI runs an OpenStack cloud with almost 5000 cores, 100 GPUs (V100/T4), and six different storage systems. HPC4AI is the living laboratory where researchers and students of the University of Torino understand how to build performant data-centric applications across the entire HPC-cloud stack, from bare metal configuration to algorithms to services.

What’s next for your research?

We are working on two new pieces of software: a next-generation Workflow Management System called StreamFlow and CAPIO (Cross-Application Programmable I/O), a system for fast data transfer between parallel applications with support for parallel in-transit data filtering. We can sue them separately, but together, they express the most significant potential.

StreamFlow enables the design of workflows and pipelines portable across the cloud (on Kubernetes), HPC systems (via SLURM/PBS), or across them. It adopts an open standard interface (CWL) to describe the data dependencies among workflow steps but separate deployment instructions, making it possible to re-deploy the same containerized code onto different platforms. Using StreamFlow, we make the CLAIRE COVID universal pipeline” and QuantumESPRESSO almost everywhere, from a single NVIDIA DGX Station to the CINECA MARCONI100 supercomputer (11th in the TOP500 List — NVIDIA GPUs, and dual-rail Mellanox EDR InfiniBand), and across them. And they are quite different systems. 

CAPIO, which is still under development, aims at efficiently (in parallel, in memory) moving data across different steps of the pipelines. The nice design feature of CAPIO is that it turns files into streams across applications without requiring code changes. It supports parallel and programmable in-transit data filtering. The essence of many AI pipelines is moving a lot of data around the system; we are embracing the file system interface to get compositionality and segregation. We do believe we will get performance as well.

Any advice for new researchers, especially to those who are inspired and motivated by your work? 

Contact us, we are hiring.

Categories
Misc

cuSOLVERMp v0.0.1 Now Available: Through Early Access

cuSOLVERMp provides a distributed-memory multi-node and multi-GPU solution for solving systems of linear equations at scale! In the future, it will also solve eigenvalue and singular value problems.

Today, cuSOLVERMp version 0.0.1 is now available at no charge for members of the NVIDIA Developer Program.

Download Now

What’s New

  • Support for LU solver, with and with pivoting.
  • The Early Access release targets P9 + IBM’s Spectrum MPI

About cuSOLVERMp

cuSOLVERMp provides a distributed-memory multi-node and multi-GPU solution for solving systems of linear equations at scale! In the future, it will also solve eigenvalue and singular value problems.

Future releases will be hosted in the HPC SDK. It will provide additional functionality and support for x86_64 + OpenMPI.

Learn more:

GTC 2021: S31754 Recent Developments in NVIDIA Math Libraries

GTC 2021: S31286 A Deep Dive into the Latest HPC Software

GTC 2021: CWES1098 Tensor Core-Accelerated Math Libraries for Dense and Sparse Linear Algebra in AI and HPC

Blog post coming soon!

Categories
Misc

Enabling Predictive Maintenance Using Root Cause Analysis, NLP, and NVIDIA Morpheus

Background Predictive maintenance is used for early fault detection, diagnosis, and prediction when maintenance is needed in various industries including oil and gas, manufacturing, and transportation. Equipment is continuously monitored to measure things like sound, vibration, and temperature to alert and report potential issues. To accomplish this in computers, the first step is to determine … Continued

Background

Predictive maintenance is used for early fault detection, diagnosis, and prediction when maintenance is needed in various industries including oil and gas, manufacturing, and transportation. Equipment is continuously monitored to measure things like sound, vibration, and temperature to alert and report potential issues. To accomplish this in computers, the first step is to determine the root cause of any type of failure or error. The current industry-standard practice uses complex rulesets to continuously monitor specific components, but such systems typically only alert on previously observed faults. In addition, these regular expressions (regex) rulesets do not scale. As data becomes more voluminous and heterogeneous, maintaining these rulesets presents a neverending catch-up task. Since they only alert on what has been seen in the past, they cannot detect new root causes with patterns that were unknown to analysts before.

The approach

To create a more proactive approach for predictive maintenance, we’ve implemented uses Natural Language Processing (NLP) to monitor and interpret kernel logs. The RAPIDS CLX team collaborated with the NVIDIA Enterprise Experience (NVEX) team to test and run a proof-of-concept (POC) to evaluate this NLP-based solution. The project seeks to:

  • Drastically reduce the time spent manually analyzing kernel logs of NVIDIA DGX systems by pinpointing important lines in the vast amount of logs,
  • Probabilistically classify sequences, giving the team the capability to fine-tune a threshold to decide whether a line in the log is a root cause or not.
Figure 1: Workflow using NVIDIA DGX Systems for Log Parsing in and Predictive Maintenance Use Case.

A complete example of a root cause workflow can be found in the RAPIDS CLX GitHub repository. For final deployment past the POC, the team is using NVIDIA Morpheus, an open AI framework for developers to implement cybersecurity-specific inference pipelines. Morpheus provides a simple interface for security developers and data scientists to create and deploy end-to-end pipelines that address cybersecurity, information security, and general log-based pipelines. It is built on a number of other pieces of technology, including RAPIDS, Triton, TensorRT, Streamz, CLX, and more.

The POC is outlined as follows:

  • The first step identifies root causes that caused past failures. NVEX provides a dataset that contains lines in kernel logs that have been marked as a root cause to date.  
  • Next, the problem is framed as a classification problem by sorting the logs into two groups, ordinary, and root cause. Ordinary logs are labeled as 0 and root cause lines as 1.

We have fine-tuned a pre-trained BERT model from HuggingFace to perform classification. More information about the BERT model can be found in the original paper. The code block below shows the pre-trained model called “bert-base-uncased” is loaded to be used for sequence classification.

seq_classifier.init_model("bert-base-uncased")seq_classifier = SequenceClassifier()
seq_classifier.init_model("bert-base-uncased")

Training with our own datasets fine-tuned this model. 

seq_classifier.train_model(X_train["log"], y_train, epochs=1)
Epoch: 100%|██████████| 1/1 [25:29

High validation accuracy (close to one) implies most of the predictions are aligned with the original labels.

Evaluation

Once the training was completed, we ran inference on a separate set of logs (the test set).

seq_classifier.evaluate_model(X_test["log"], y_test)
0.9992423076467732

Like validation accuracy, test set accuracy is also close to one, which means most of the predicted classes are the same as the original labels. We performed an inference run for classification with two goals:

  • Check the number of false positives. In our context, this means the number of lines in the kernel logs that are predicted to be a root cause but are not of interest.
  • Check the number of false negatives. In our context, this refers to the lines that are root causes but predicted to be ordinary.

Unlike the conventional evaluation of classification tasks, having a labelled test set does not translate into interpretable results as one of our main targets is to predict previously unseen root causes. The best way to understand how the model performs is to check the resulting confusion matrix.

Table 1: Confusion matrix for root cause analysis prediction

In our use case, the confusion matrix gives the following outputs:

  • TN (True Negatives): These are the ordinary lines that were not labeled as a root cause, and the model correctly marks 82668 of them.
  • FN (False Negatives): Zero false negatives mean the model does not mark any of the known root causes as ordinary.
  • NRC (New Root Causes): 65 new lines that were marked as ordinary are predicted to be root causes. These are the lines that would have been missed with the existing methods.
  • KRC (Known Root Causes): This is the number of lines correctly marked as root cause.

NVEX analysts have reviewed our predictions and noticed some interesting logs that were not marked as a root cause of issues with the conventional methods. With regex-based methods, such new issues might have cost a significant amount of person-hours to triage, develop, and harden.

Applying our solution to more use cases

In the next phase, we plan to position similar solutions in NVIDIA platforms to alert users of potential problems or execute corrective actions with Morpheus. By building on the success of root cause analysis here, we seek to extend this into a predictive maintenance task by continuous monitoring of the logs. This use case is certainly not limited to DGX systems. For example, telecommunication infrastructure equipment, including radio, core, and transmission devices, generate a diverse set of logs. Their outages may result in loss of service and severe fines. Identifying the root cause of outages imposes a significant cost, both in terms of dollars spent and person-hours. We believe all systems that generate text-based logs, especially the ones that run mission-critical applications, would benefit from such NLP based predictive maintenance solutions immensely as it would reduce the mean time to resolution.

Categories
Misc

Dive into the Future of Streaming with NVIDIA CloudXR

Recently, at GTC21, the NVIDIA CloudXR team ran a Connect with Experts session about the CloudXR SDK. We shared how CloudXR can deliver limitless virtual and augmented reality over networks (including 5G) to low cost, low-powered headsets and devices, all while maintaining the high-quality experience traditionally reserved for high-end headsets that are plugged into high-performance … Continued

Recently, at GTC21, the NVIDIA CloudXR team ran a Connect with Experts session about the CloudXR SDK. We shared how CloudXR can deliver limitless virtual and augmented reality over networks (including 5G) to low cost, low-powered headsets and devices, all while maintaining the high-quality experience traditionally reserved for high-end headsets that are plugged into high-performance computers.

Q&A session

At the end of this session, we hosted a Q&A with our panel of professional visualization solution architects and received a large number of questions from our audience. VR and AR director David Weinstein and senior manager Greg Jones from the NVIDIA CloudXR team provided answers to the top questions:

How do I get started with CloudXR?

Apply for CloudXR through the NVIDIA DevZone. If you have any questions about your application status, contact cloudxr-outreach@nvidia.com.

What did you announce at GTC?

There were three key CloudXR announcements at GTC. You can get more information about each by clicking the post links.

How are you addressing instances of running XR with large crowds such as convention centers or large public places?

The number of users at a single physical location is gated by the wireless spectrum capacity at that given location.

Do you need separate apps on both the client and the server?

The CloudXR SDK provides sample CloudXR clients (including source code) for a variety of client devices. The server side of CloudXR gets installed as a SteamVR plug-in and can stream all OpenVR applications.

Can I use CloudXR if I do not have high-end hardware?

CloudXR will run with a variety of hardware. For the server side, all VR-Ready GPUs from the Pascal and later architectures are supported. For the client side, CloudXR has been tested with HTC Vive, HTC Vive Pro, HTC Focus Plus, Oculus Quest, Oculus Quest 2, Valve Index, and HoloLens2.

Can the server be shared for multiple simultaneous clients or is this one server per one client only?

Currently, we only support one server per one client device. In a virtualized environment this means one virtual machine per one client.

Is connectivity (server to network to client) bidirectional?

Yes, the connectivity is bidirectional. The pose information and controller input data is streamed from the client to the server; frames, audio and haptics are streamed from the server to the client.

CloudXR configurations include options for cloud, server, desktop, laptop, mobile, and VR headset.
Figure 1. CloudXR configurations

What type of applications run with NVIDIA CloudXR?

OpenVR applications run with CloudXR.

More information

To learn more, visit the CloudXR page where there are plenty of videos, blog posts, webinars, and more to help you get started. Did you miss GTC21? The AR/VR sessions are available for free through NVIDIA On-Demand.

Categories
Misc

Run State of the Art NLP Workloads at Scale with RAPIDS, HuggingFace, and Dask

Introduction Modern natural language processing (NLP) mixes modeling, feature engineering, and general text processing. Deep learning NLP models can provide fantastic performance for tasks like named-entity recognition (NER), sentiment classification, and text summarization. However, end-to-end workflow pipelines with these models often struggle with a performance at scale, especially when the pipelines involve extensive pre-and post-inference … Continued

This post was originally published on the RAPIDS AI Blog.

TLDR: Learn how to use RAPIDS, HuggingFace, and Dask for high-performance NLP. See how to build end-to-end NLP pipelines in a fast and scalable way on GPUs. This covers feature engineering, deep learning inference, and post-inference processing.

Introduction

Modern natural language processing (NLP) mixes modeling, feature engineering, and general text processing. Deep learning NLP models can provide fantastic performance for tasks like named-entity recognition (NER), sentiment classification, and text summarization. However, end-to-end workflow pipelines with these models often struggle with a performance at scale, especially when the pipelines involve extensive pre-and post-inference processing.

In our previous blog post, we covered how RAPIDS accelerates string processing and feature engineering. This post explains how to leverage RAPIDS for feature engineering and string processing, HuggingFace for deep learning inference, and Dask for scaling out for end-to-end acceleration on GPUs.

An NLP pipeline often involves the following steps:

  • Pre-processing
  • Tokenization
  • Inference
  • Post Inference Processing
Figure 1: NLP workflow using Rapids and HuggingFace.

Pre-Processing:

Pre-Processing for NLP pipelines involves general data ingestion, filtration, and general reformatting. With the RAPIDS ecosystem, each piece of the workflow is accelerated on GPUs. Check out our recent blog where we showcased these capabilities in more detail.

Once we have pre-processed our data, we need to tokenize it so that the appropriate machine learning model can ingest it.

Subword Tokenization:

Tokenization is the process of breaking down the text into standard units that a machine can understand. It is a fundamental step across NLP methods from traditional like CountVectorizer to advanced deep learning methods like Transformers.

One approach to tokenization is breaking a sentence into words. For example, the sentence, “I love apples” can be broken down into, “I,” “love,” “apples”. But this delimiter based tokenization runs into problems like:

  • Needing a large vocabulary as you will need to store all words in the dictionary.
  • Uncertainty of combined words like “check-in,” i.e., what exactly constitutes a word, is often ambiguous.
  • Some languages don’t segment by spaces.

To solve these problems, we use subword tokenization. Subword tokenization is a recent strategy from machine translation that breaks into subword units, strings of characters like “ing,” “any,” “place.” For example, the word “anyplace” can be broken down into “any” and “place,” so you don’t need an entry for each word in your vocabulary.

When BERT(Bidirectional Encoder Representations from Transformers) was released in 2018, it included a new subword algorithm called WordPiece. This tokenization is used to create input for NLP DL models like BERT, Electra, DistilBert, and more.

GPU Subword Tokenization

We first introduced the GPU BERT subword tokenizer in a previous blog as part of CLX for cybersecurity applications. Since then, we migrated the implementation into RAPIDS cuDF and exposed it as a string function, subword tokenization, making it easier to use in typical DataFrame workflows.

This tokenizer takes a series of strings and returns tokenized cupy arrays: 

Example of using: cudf.str.subword_tokenize

Advantages of cuDF’s GPU subword Tokenizer:

The advantages of using cudf.str.subword_tokenize include:

  • The tokenizer itself is up to 483x faster than HuggingFace’s Fast RUST tokenizer BertTokeizerFast.batch_encode_plus.
  • Tokens are extracted and kept in GPU memory and then used in subsequent tensors, all without leaving GPUs and avoiding expensive CPU copies.

Once our inputs are tokenized using the subword tokenizer, they can be fed into NLP DL models like BERT for inference.

HuggingFace Overview:

HuggingFace provides access to several pre-trained transformer model architectures ( BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pre-trained models in 100+ languages.In our workflow, we used BERT and DISTIILBERT from HuggingFace to do named entity recognition.

Example of NER in action from https://huggingface.co/dslim/bert-base-NER

Combining RAPIDS, HuggingFace, and Dask:

This section covers how we put RAPIDS, HuggingFace, and Dask together to achieve 5x better performance than the leading Apache Spark and OpenNLP for TPCx-BB query 27 equivalent pipeline at the 10TB scale factor with 136 V100 GPUs while using a near state of the art NER model. We expect to see even better results with A100 as A100’s BERT inference speed is up to 6x faster than V100’s.

In this workflow, we are given 26 Million synthetic reviews, and the task is to find the competitor company names in the product reviews for a given product. We then return the review id, product id, competitor company name, and the related sentence from the online review. To get a competitor’s name, we need to do NER on the reviews and find all the tokens in the review labeled as an organization.

Our previous implementation relied on spaCy for NER but, spaCy currently needs your inputs on CPU and thus was slow as it required a copy to CPU memory and back to GPU memory. With the new cudf.str.subword_tokenize, we can go from cudf.string.series to subword tensors without leaving the GPU unlocking many new SOTA language models.

In this task, we experimented with two of HuggingFace’s models for NER fine-tuned on CoNLL 2003(English) :

Research by Zhu, Mengdi et al. (2019) showcased that BERT-based model architectures achieve near state art performance, significantly improving the performance on existing public-NER toolkits like spaCy, NLTK, and StanfordNER.

For example, the bert-base model on average across datasets achieves a 13.63% better F1 than spaCy, so not only did we get faster but also reached near state of the art performance.

Check out the workflow code here.

Conclusion:

This workflow is just one example of leveraging GPUs to do end to end accelerating natural language processing. With cudf.str.subword_tokenizenow, most of the NLP tasks such as question answering, text-classification, summarization, translation, token classification are all within reach for an end to end acceleration leveraging RAPIDS and HuggingFace.Stay tuned for more examples and in, the meantime, try out RAPIDS in your NLP work on Google Colab or blazingsql notebooks, see our documentation docs page, and if you see something missing, we welcome feature requests on GitHub!

Categories
Misc

tfp.distributions.Normal scale parameter

Hi, what does the scale parameter mean in tfp.distributions.Normal | TensorFlow Probability ?

How can I change the parameter based on my input model? I have 4 input to predict 1 output. In one example I saw I have 1 input that predicts 1 output and the scale parameter is set to 1?

submitted by /u/Filippo9559
[visit reddit] [comments]

Categories
Misc

Tensoflow V1

Hello, I had an older tensorflow model written in V1 of tensorflow. It has been quite a while, and I have forgotten how the model works, and how to print the summary of the model in tensorflow v1. What is the syntax to print the summary, I tried model.summary() but it returns an error.

These are the lines that the program runs to initiate training.

for i in range(epochs):traind_scores = []ii = 0epoch_loss = []while(ii + batch_size) <= len(X_train):X_batch = X_train[ii:ii+batch_size]y_batch = y_train[ii:ii+batch_size]

o, c, _ = session.run([outputs, loss, trained_optimizer], feed_dict={inputs:X_batch, targets:y_batch})

epoch_loss.append(c)traind_scores.append(o)ii += batch_sizeprint(‘Epoch {}/{}’.format(i, epochs), ‘ Current loss: {}’.format(np.mean(epoch_loss)))

can someone tell me how to get the model summary?

submitted by /u/Fun-Programmer4564
[visit reddit] [comments]

Categories
Misc

How to Build a Winning Deep Learning Powered Recommender System-Part 3

Recommender systems (RecSys) have become a key component in many online services, such as e-commerce, social media, news service, or online video streaming. However with the growth in importance, the growth in scale of industry datasets, and more sophisticated models, the bar has been raised for computational resources required for recommendation systems.  To meet the … Continued

Recommender systems (RecSys) have become a key component in many online services, such as e-commerce, social media, news service, or online video streaming. However with the growth in importance, the growth in scale of industry datasets, and more sophisticated models, the bar has been raised for computational resources required for recommendation systems. 

To meet the computational demands for large-scale DL recommender systems, NVIDIA introduced Merlin – a Framework for Deep Recommender Systems. Now NVIDIA teams have won two consecutive RecSys competitions in a row: the ACM RecSys Challenge 2020, and more recently the  WSDM WebTour 21 Challenge organized by Booking.com.  The Booking.com challenge focused on predicting the last city destination for a traveler’s trip given their previous booking history within the trip. NVIDIA’s interdisciplinary team included colleagues from NVIDIA’s KGMON (Kaggle Grandmasters), NVIDIA’s RAPIDS (Data Science), and NVIDIA’s Merlin (Recommender Systems) who collaborated on the winning solution.

This post is the third of a three-part series that gives an overview of the NVIDIA team’s first-place solution for the booking.com  challenge focused on predicting the last city destination for a traveler trip given their previous booking history within the trip. The first post gives an overview of recommender system concepts. The second post discusses deep learning for recommender systems. This third post discusses the winning solution, the steps involved, and also what made a difference in the outcome.  Specifically, this blog post explains the booking.com RecSys challenge competition goals, exploratory data analysis, feature preprocessing and extraction, the algorithms used, model training and validation.

The Booking.com Challenge Problem Overview 

Many of the booking.com users go on trips which include more than one destination and booking.com recommends a next destination. For instance, if a user was making hotel reservations for Amsterdam, Rotterdam, and Copenhagen then booking.com would immediately suggest popular cities for extending their trip such as Stockholm, Oslo, or Berlin.

The image shows city options for extending a user's trip.
Figure 1: Booking.com suggests popular options for extending a user’s trip as they make their booking.

In this booking.com example for a trip to Italy, the sequence of cities shown in the blue route is more likely than the red route. Similarly, if the sequence of cities so far is Venice, Florence, Rome, then the next destination is more likely to be Palermo than Milan.

The image shows city sequences for Italy are more likely to be in a logical order based on distance and direction.
Figure 2: The sequence of cities shown in the blue route is more likely than the red route.

The goal of this challenge was to use a dataset based on millions of booking.com real anonymized hotel reservations to come up with a strategy for making the best recommendation for their last destination in real-time.   Specifically, the goal was to predict (and recommend) the final city (city_id) of each trip (utrip_id). The quality of the predictions was evaluated based on the top four recommended cities for each trip by using Top-4 Accuracy metric (4 representing the four suggestion slots on the Booking.com website). When the true city is one of the top 4 suggestions, it was considered correct.

The image shows a neural network predicting the final destination given a sequence of destinations.
Figure 3: The goal was to predict the final destination of each trip given a dataset of real hotel reservations.

More than 800 participants signed up for the contest grouped in 40 competing teams.  NVIDIA’s interdisciplinary team was represented by Benedikt Schifferer, Chris Deotte, Jean-François Puget, Gabriel de Souza, Pereira Moreira, Gilberto Titericz, Jiwei Liu, and Ronay Ak. The winning NVIDIA team achieved Accuracy @ 4 of 0.5939, using a blend of Transformers, GRUs, and feed-forward multi-layer perceptron.

The Competition Process

RecSys or Kaggle competitions work by asking users or teams to provide solutions to well-defined problems. Competitors download the training and test files, train models on the labeled training file, generate predictions on the test file, and then upload a prediction file as a submission. After you submit your solution, you get a ranking. At the end of the competition, the top scores are announced as winners.

A general data science process and competition tip are to set up a fast experimentation pipeline on GPUs, where you train, improve the features and model, and then validate repeatedly. The NVIDIA team used a fast experimentation pipeline on GPUs consisting of preprocessing and feature engineering with RAPIDS cuDF, a library for GPU-accelerated dataframe transformations, combined with TensorFlow and PyTorch for deep learning. The RAPIDS suite of open-source software libraries, built on CUDA, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs, while still using familiar interfaces like Pandas and Scikit-Learn APIs.

The image shows a RAPIDS software stack with end-to-end data preparation model training and visualization.
Figure 4: End-to-End Data science pipeline with GPUs and RAPIDS.

Exploratory Data Analysis

Exploratory data analysis (EDA) is performed before, during, and after feature engineering to understand the dataset better.  EDA uses data visualization, statistics, and queries to find important variables, interesting relations among the variables, anomalies, patterns, and insights.

The training dataset consists of a csv file with 1.5 million anonymized hotel reservations, based on real data, with the following features:

  • User_id – User ID
  • Check-in – Reservation check-in date
  • Checkout – Reservation check-out date
  • Affiliate_id – An anonymized ID of affiliate channels where the booker came from (e.g. direct, some third-party referrals, paid search engine, etc.)
  • Device_class – desktop/mobile
  • Booker_country – Country from which the reservation was made (anonymized)
  • Hotel_country – Country of the hotel (anonymized)
  • City_id – city_id of the hotel’s city (anonymized)
  • Utrip_id – Unique identification of user’s trip (a group of multi-destination bookings within the same trip)

Each reservation is a part of a customer’s trip (identified by utrip_id) which includes at least 4 consecutive reservations.  The evaluation dataset is constructed similarly, however the city_id of the final reservation of each trip is concealed and requires a prediction.

The sequence of user trip reservations can be obtained by sorting on the user_id and check-in date. Below, we read in the train and test data from the csv files using cuDF, sort on the userid, check-in date to obtain the sequence of trip reservations for a user.  A count on the sorted dataset reveals 269k trips.

train = cudf.read_csv('../00_Data/booking_train_set.csv').sort_values(by=['user_id','checkin'])
test = cudf.read_csv('../00_Data/booking_test_set.csv').sort_values(by=['user_id','checkin'])


print(train.shape, test.shape)
train.head()

Visualizing the data can help with preprocessing and feature selection by revealing trends in the data. Histograms or bar charts help visualize the distribution of a feature. For example, the count vs. the city id frequency chart below shows that the distribution of the city reservation frequency was long-tailed as one would expect – some cities are much more popular for tourism and business than others. To help the models to focus less on very unpopular cities (city reservation frequency

Figure 5: Distribution of frequency of city_id in train and test dataset. Around 10,000 city ids appeared only once in the dataset.

Feature Pre-Processing, Selection, and Generation

Feature engineering and feature selection is an iterative process that starts with engineering new features, then training a model, and then evaluating the model predictions against the target labels. The goal is to determine which features improve the model’s prediction accuracy. You repeat this process, along with hyperparameter tuning, until you are satisfied with the model’s accuracy.

The diagram shows data discovery consisting of feature extraction, model training, testing, and tuning.
Figure 6: Machine learning is an iterative process involving feature engineering, training, testing, and tuning.

Framing this problem under the recommender systems taxonomy, the cities are the items we want to recommend for a trip. The trip is analogous to a session in the session-based recommendation task (see part 2 of this series), which is generally a sequence of user interactions – city hotel reservations in this case.

The image shows the trip features and the last city being used to learn trip and city embeddings which are used by a trained model to infer similar cities.
Figure 7: The recommender model learns the trip and city embeddings based on the trip features and last city (label) which are then used by the trained model to infer similar next cities.

Feature generation creates new features using knowledge about the problem and data. Feature columns can be combined, subtracted, counted, aggregated, and transformed to create new features to describe the user session (the Trip) and the cities. The NVIDIA team created the following additional features from the original 9:

  • Trip context date, time features: day-of-week, week-of-year, month, weekend, season, stay length (checkout – check-in), days since the last booking (check-in – previous checkout).
  • Trip context sequence features: the first city in the trip, lagged (previous 5) cities and countries from the trip.
  • Trip context statistics: trip length (number of reservations), trip duration (days), reservation order (in ascending and descending orders).
  • Past user trip statistics: number of user’s past reservations, number of user’s past trips, number of user’s past visited cities, number of user’s past visited countries.
  • Geographic seasonal city popularity: features based on the conditional probabilities of a city c from a country co, being visited at a month m or at a week-of-year w, as follows: P(c | m), P(c | m, co), P(c | w), P(c | w, co).

The dataset with 1.5M bookings is relatively small in comparison to other recommendation datasets. Techniques were explored to increase the training data by data augmentation and the team discovered that doubling the dataset with reversed trips improved the model’s accuracy. A trip is an ordered sequence of cities and although there are many permutations to visit a set of cities, there is a logical ordering implied by distances between cities and available transportation connections. These characteristics are commutative. For example, a trip of Boston->New York->Princeton->Philadelphia->Washington DC can be booked in reverse order, but not many people would book in a random order like Boston->Princeton->Washington DC->New York->Philadelphia.

Machine Learning Algorithms Used by the Winning Team

Ensemble methods combine multiple machine learning algorithms to obtain a better model.  For the winning solution, the final submission was a simple ensemble (weighted average) of the predictions from models of three different neural architectures: Multilayer perceptron with Session-based Matrix Factorization net (MLP-SMF), Gated Recurrent Unit with MultiStage Session-based Matrix Factorization net (GRU-MS-SMF), and XLNet (Transformer) with Session-based Matrix Factorization net (XLNet-SMF). As the cardinality of the label (city) is not large, all models treated the recommendation as a multi-class classification problem, by using softmax cross-entropy loss function. 

In deep learning, the last layer of a neural network used for classification can often be interpreted as a logistic regression. In this context, one can see a deep learning algorithm as multiple feature learning stages, which then pass their features into a logistic regression that classifies an input. The softmax function is a generalization of logistic regression often used to normalize the output of a neural network to a probability distribution over predicted output classes for multi-class classification.

The image shows trip embeddings and city embeddings from a DNN feeding into a session matrix factorization layer.
Figure 8: The final submission was an ensemble of the predictions from models of three different neural architectures.

Session-based Matrix Factorization Layer 

A shared component among the three different neural architectures was a Session-based Matrix Factorization (SMF) layer, which learns a linear mapping between the item (city) embeddings and the session (trip) embeddings to generate recommendations (the scores (logits)  for cities)  by a dot product operation. 

The image shows trip embeddings and city embeddings from a DNN feeding into a session matrix factorization layer.
Figure 9: A shared component among the three architectures was a Session-based Matrix Factorization layer.

This design was inspired by the MF-based Collaborative Filtering (see part 1), which learns latent factors for users and items by performing a dot product of their embeddings to predict the relevance of an item for a user.

The image shows 3 matrices, a sparse trip city interaction matrix as the product of two dense matrices, trip and city factor matrices.
Figure 10: Matrix factorization factors a sparse user item interaction matrix R (u-by-i) into a u-by-f matrix (U) and a f-by-i matrix (I). In this case the user is the trip session and the item is the city.

The large majority of users had only a single trip available in the dataset. So, instead of trying to model the user preference, the last layer of the network is used to represent the session (trip) embedding.  Then, the dot product is computed between the session embedding s and all the set I of item embeddings i, where i is an element of I, to model items relevance probability distribution r of each item (city) being the next for that session (trip), such as r = softmax(s * I).

MLP with Session-based Matrix Factorization head (MLP-SMF)

The MLP with Session-based Matrix Factorization head (MLP-SMF) uses feedforward and embedding layers, as seen in Figure 11.

The image shows the MLP-SMF model architecture.
Figure 11: MLP-SMF model architecture.

Categorical input features are fed through an embedding layer and continuous input features are individually projected via a linear layer to embeddings, followed by batch normalization and ReLU non-linear activation. All embedding dimensions are made equal. The embeddings of continuous input features and categorical input features, except the lag features, are combined via summation. The output is concatenated with the embeddings of the 5 last cities and countries (lag features).

The embedding tables for the city lags are shared, and similarly for hotel country lags. The lag embeddings are concatenated, but the model should still be able to learn the sequential patterns of cities by the order of lag features, i.e., city lag1’s embedding vector is always in the same position of the concatenated vector. The concatenated vector is fed through 3 feed-forward layers with batch normalization, PReLU activation function and dropout, to form the session (trip) embedding. It is used by the Session-based Matrix Factorization head to produce the scores for all cities. 

GRU with MultiStage Session-based Matrix Factorization head (GRU-MS-SMF)

The GRU with MultiStage Session-based Matrix Factorization head uses a GRU cell for embedding the historical user actions (previous 5 visited cities), similar to GRU4Rec.  (see part 2 for more information on GRUs and session based recommenders).  Missing values (sequences with less than 5 length) are padded with 0s. The last GRU hidden state is concatenated with the embeddings of the other categorical input features, as shown in Figure 12. 

The image shows the GRU-MS-SMF model architecture.
Figure 12: GRU-MS-SMF model architecture.

The embedding tables for the previous cities are shared. The model uses only categorical input features, including some numerical features modeled as embeddings such as trip length and reservation order. The concatenated embeddings are fed through a MultiStage Session-based Matrix Factorization head. The first stage in the MS-SMF is a softmax head over all items, which is used to select the top-50 cities for the next stage. In the second stage, the Session-based Matrix Factorization head is applied using only the top-50 cities of the first stage and two representations of the trip embeddings (the outputs of the last and second-to-last MLP layers, after the concatenation), resulting in two additional heads. The final output is a weighted sum of all three heads, with trainable weights. The multi-stage head works as a 2-stage ranking problem. The first head ranks all items and the other two heads can focus on the reranking of the top-50 items from the first stage. This approach can potentially scale to large item catalogs, i.e., in the order of millions. This dataset did not require such scalability, but this multi-stage design might be effective for deployment in production.

XLNet with Session-based Matrix Factorization head (XLNet-SMF)

The XLNet with Session-based Matrix Factorization head (XLNet-SMF) uses a Transformer architecture named XLNet, originally proposed for the permutation-based language modeling task in Natural Language Processing (NLP). In this case, the sequence of items in the session (trip) are modeled instead of the sequence of word tokens (see part 2 for more information on transformers and session based recommenders). 

The image shows the XLNet-SMF model architecture.
Figure 13: XLNet-SMF model architecture.

The XLNet training task was adapted for Masked Language Modeling (also known as Cloze task), like proposed by BERT4Rec for sequential recommendation. In that approach, for each training step, a proportion of all items are masked from the input sequence (i.e., replaced with a single trainable embedding), and then the original ids of the masked items are predicted using other items of the sequence, from both left and right sides. When a masked item is not the last one, this approach allows the usage of privileged information of the future reservations in the trip during training. Therefore, during inference, only the last item of the sequence is masked, to match the sequential recommendation task and to not leak future information of the trip.

= The image shows that during masked language sequence model training, the model is allowed to use items on the right (future information) for predictions. During evaluation, the last item of the sequence is masked to prevent future information leak.
Figure 14: During masked language sequence model training, the model is allowed to use items on the right (future information) for predictions. During evaluation, the last item of the sequence is masked to prevent future information leak.

For this network, each example is a trip, represented as a sequence of its reservations. The reservation embedding is generated by concatenating the features and projecting using MLP layers. The sequence of reservation embeddings is fed to the XLNet Transformer stacked blocks, in which the output of each block is the input of the next block. The final transformer block output generates the trip embedding.

Finally, the Matrix Factorization head (dot product between the trip embedding and city embeddings) is used to produce a probability distribution over the cities for the masked items in the sequence.

Evaluating the Model

To evaluate a model’s accuracy, you test the model’s predictions, in this case the final city (city_id) of each trip,  against the labeled outcome. To do this, you split the training dataset, which has labeled data, train the model on part of the data, and evaluate the predictions with the rest. For this competition, the evaluation metric was Precision@4. Precision at 4 corresponds to the proportion of top-scoring results that are relevant, in this case scoring as correct if the top-4 recommendations include the last city.   

Training and Hyperparameter Tuning

Unlike most competitions, this competition allowed only two submissions (predictions on the unlabeled test file).  Because it was key to select the best features, hyperparameters and models for the final submission, it was crucial to design a good validation set. For this reason the team used k-fold cross validation, a method that maximizes the training dataset for training, tuning and evaluating a model. With cross-validation, the data is randomly split into k partitions (folds). Each fold is used one time as the validation dataset, while the rest  (Out-Of-Fold – OOF) are used for training. Models are trained using the OOF training sets and evaluated with the validation sets, resulting in k model accuracy measurements. 

The image shows the K-fold cross-validation process. The data is split into k folds (partitions). Each fold is used one time as the validation dataset, while the rest (Out-Of-Fold - OOF) are used for training.
Figure 15.  With k-fold cross-validation, the data is split into k folds (partitions). Each fold is used one time as the validation dataset, while the rest (Out-Of-Fold – OOF) are used for training.

The NVIDIA team used 5-fold cross validation, using both the train data (from train.csv) and test data (test.csv), this is unusual since the test.csv data does not have the final city, but it does have a sequence of cities.  For each fold, the fold train data was used for evaluation and data (both train and test) from the other folds (Out-Of-Fold – OOF) was used to train the model. The full OOF dataset was used, predicting the last city given previous ones. The Cross-Validation (CV) score was the average Precision@4 from each of the five folds. 

The image shows for each fold, the train set fold was used for evaluation and both the train and test set Out-Of-Folds for training.
Figure 16: For each fold, the train set fold was used for evaluation and both the train and test set Out-Of-Folds for training. 

Hyperparameter optimization tunes the model’s properties that can be set for training to find the most accurate possible model; for example the learning rate, learning rate decay, or batch size.  The choice of optimal hyperparameters is between underfitting and overfitting a model—where the model predictions match how the training data behaves and is also generalized enough to make accurate predictions on unseen data.  When working with neural networks, the choice of hyperparameters can make the difference between poor and superior predictive performance. To find out more about how the NVIDIA team tuned the model hyperparameters see the appendix in this whitepaper:  Using Deep Learning to Win the Booking.com WSDM WebTour21 Challenge on Sequential Recommendations.

Ensembling

Ensembling is a proven approach to improve the accuracy of models, by combining their predictions and improving generalization. K-fold cross-validation and bagging techniques were used to ensemble models from the three architectures, as shown in Figure 17. 

The image shows the ensemble algorithm.
Figure 17: Ensemble Algorithm.

In general, the higher the diversity of the model’s predictions, the more the ensembling technique can potentially improve the final scores. In this case, the correlation of the predicted city scores between each two combinations of the three architectures was around 80%, which resulted in a representative improvement with ensembling in the final CV score.

Final Model Predictions Submission 

The final step (which in this competition allowed only two submissions per team) was to submit the teams top four city predictions per each trip on the test set in a csv file named submission.csv with the following columns:

utrip_id, city_id_1, city_id_2, city_id_3, city_id_4 
1000031_1, 8655, 8652, 4323, 4332
The image shows using the trained mode is to get and submit predictions on the test data.
Figure 16. After feature engineering, training, and tuning, the final step is to submit predictions on the test data.

The NVIDIA team’s final leaderboard result was 0.5939 for precision@4 and scored 2.8% better than the second solution.  

In order to clarify the contribution from each architecture and from the ensembling algorithm for the final result, the table below shows the cross-validation Precision@4 results by architecture, individual and after bagging ensembling, and the final ensemble from the three architectures combined. XLNet-SMF was the most accurate single model. The MLP-SMF architecture achieved a CV score comparable to the other two architectures that explicitly model the sequences using GRU and XLNet (Transformer). As the MLP-SMF model was lightweight, it was much faster to train than the other architectures, which sped up its experimentation cycle and improvements.

The table shows the final results by architecture and for the final ensemble - MLP-SMF uses 8 bags, GRU-MS-SMF uses 7 bags and XLNet-SMF uses 5 bags.
Figure 18. Final results by architecture and for the final ensemble – MLP-SMF uses 8 bags, GRU-MS-SMF uses 7 bags and XLNet-SMF uses 5 bags.

Summary

In this blog, we walked through the process of how NVIDIA’S team won the Booking.com WSDM WebTour 21 Challenge. We went over the domain problem, winning techniques in EDA, feature preprocessing, and generation, the DL models, and validation for improving predictions. The NVIDIA team designed three different deep learning architectures based on MLP, GRU, and Transformer neural building blocks. Some techniques resulted in improved performance for all models, like the Session-based Matrix Factorization head and the data augmentation with reversed trips. The diversity of the model architectures resulted in significant accuracy improvements by ensembling model predictions. We hope this post, the solution, and the links below, are useful to others interested in building recommendation systems. 

Additional Resources:

Categories
Misc

NVIDIA Releases Updates to CUDA-X AI Software

NVIDIA CUDA-X AI is a deep learning software stack for researchers and software developers to build high performance GPU-accelerated applications for conversational AI, recommendation systems and computer vision.

NVIDIA CUDA-X AI is a deep learning software stack for researchers and software developers to build high performance GPU-accelerated applications for conversational AI, recommendation systems and computer vision.

Learn what’s new in the latest releases of the CUDA-X AI tools and libraries. For more information on NVIDIA’s developer tools, join live webinars, training, and Connect with the Experts sessions now on GTC On-Demand.

Refer to each package’s release notes in documentation for additional information.

NVIDIA Jarvis Open Beta 

At GTC, NVIDIA announced major capabilities to the fully accelerated conversational AI framework. It includes highly accurate automated speech recognition, real-time machine translation for multiple languages and text-to-speech capabilities to create expressive conversational AI agents.

 Highlights include:

  • Speech recognition model trained on thousands of audio hours with greater than 90% accuracy
  • Real-time machine translation for five languages that run under 100ms per sentence
  • Expressive TTS that delivers 30x higher throughput with FastPitch+HiFiGAN vs Tacotron2+WaveGlow

Also announcing BotMaker Early Access, which enables enterprises to easily integrate skills and deploy them as a bot on embedded and datacenter platforms, both offline and online.

Triton Inference Server 2.7 

At GTC, NVIDIA announced Triton Inference Server 2.9. Triton is an open source inference serving software that maximizes performance and simplifies production deployment at scale. Release updates include: 

  • Model Navigator (alpha), a new tool in Triton which automatically converts TensorFlow and PyTorch models to a TensorRT plan, validates accuracy, and sets up the deployment environment
  • Model Analyzer will now automatically determine optimal batch size and model instances to maximize performance, based on latency or throughput requirements
  • Support for OpenVINO backend (beta) for high performance inferencing on CPU, Windows Triton build (alpha), and integration with MLOps platforms: Seldon and Allegro

TensorRT 7.2 is Now Available

At GTC, NVIDIA announced TensorRT 8.0, the latest version of the high-performance deep learning inference SDK. This version includes:

  • Quantization Aware Training for FP32 accuracy with INT8 precision 
  • Sparsity support on Ampere GPUs delivers up to 50% higher throughput
  • Up to 2x faster inference for transformer based networks like BERT with new compiler optimizations

TensorRT 8.0 will be freely available to members of NVIDIA Developer Program in Q2, 2021.

NVIDIA NeMo 1.0 RC

NVIDIA NeMo is an open-source toolkit for developing state-of-the-art conversational AI models. 

Highlights include:

  • ASR collection: Added new state-of-the-art model architectures – CitriNet and Conformer-CTC. Also used the Mozilla Common Voice dataset and AIshell-2 corpus to add speech recognition support for multiple languages including – Mandarin, Spanish, German, French, Italian, Russian, Polish, and Catalan. 
  • NLP collection: Added ten neural machine translation language models supporting bidirectional translation between English and Spanish, Russian, Mandarin, German and French
  • TTS collection: Added support for HiFiGan, MelGan, GlowTTS, UniGlow, and SqueezeWave model architectures and pre-trained models. 

This release includes 60 additional highly-accurate models. Learn more from NeMo collections in NGC.

NVIDIA Maxine

Maxine provides an accelerated SDK with state-of-the-art AI features for building virtual collaboration and content creation applications. At GTC, we announced AI Face Codec, a novel AI-based method from NVIDIA research to compress videos by rendering human faces for video conferencing delivering up to 10x reduction in bandwidth vs H.264.

Maxine is available now to members of the NVIDIA Developer Program. Get Started with NVIDIA Maxine.

NGC Updates (Includes Framework Updates)

The NGC catalog is a hub of GPU-optimized containers, pre-trained models, SDKs and Helm charts designed to accelerate end-to-end AI workflows. Updates include:

  • Deep Learning Frameworks
    • 21.04 containers for TensorFlow, PyTorch (v.24) and Apache MXNet (v.1.8)
    • Includes support for CUDA 11.3, cuDNN 8.2, Dali 1.0 and Ubuntu 20.04
  • Brand new UI enables users to navigate, find and download content faster than before with features such as improved search and filtering, tagged content, and direct links to all documentation on the home page.
  • TLT 3.0 provides a unified command line tool to launch commands, enables multiple Docker setup and integrates to DeepStream and Jarvis application frameworks. 
  • Magnum IO container unifies key NVIDIA technologies such as NCCL, NVSHMEM, UCX, and GDS in a single package allowing developers to build applications and run in a data center equipped with GPUs, storage and high-performance switching fabric.
  • New and Updated Partner Software
    • Matlab: The latest release highlights simplified workflows for the development of deep learning, autonomous systems, and automotive solutions.
    • Brightics AI Accelerator:Samsung SDS’ simple, fast, and automated machine learning platform.
    • Determined AI Helm Chart:An open source deep learning training platform. 
  • Plexus Satellite Container: Provides a rich set of tools for setting up and managing an isolated networked Kubernetes cluster on the Core Scientific Plexus software stack.

cuDNN 8.2 GA

The NVIDIA CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for accelerating training and inference applications. This version includes:

  • BFloat16 support for CNNs on NVIDIA Ampere architecture GPUs
  • Speed up CNNs by fusing convolution operators, point-wise operations and runtime reductions. 
  • Faster out-of-box performance with new dynamic kernel selection infrastructure
  • Up to 2X higher RNN performance with new optimizations and heuristics

DALI 1.0 GA

The NVIDIA Data Loading Library (DALI) is an open-source GPU-accelerated library for fast  pre-processing of images, videos and audio to accelerate deep learning workflows. This version includes:

  • New Functional API for simpler pipeline creation and ease-of-use
  • Easy integration with Triton Inference Server with DALI Backend
  • New GPU-accelerated operators for image, video and audio processing