Categories
Offsites

Identifying Disfluencies in Natural Speech

People don’t write in the same way that they speak. Written language is controlled and deliberate, whereas transcripts of spontaneous speech (like interviews) are hard to read because speech is disorganized and less fluent. One aspect that makes speech transcripts particularly difficult to read is disfluency, which includes self-corrections, repetitions, and filled pauses (e.g., words like “umm”, and “you know”). Following is an example of a spoken sentence with disfluencies from the LDC CALLHOME corpus:

But that’s it’s not, it’s not, it’s, uh, it’s a word play on what you just said.

It takes some time to understand this sentence — the listener must filter out the extraneous words and resolve all of the nots. Removing the disfluencies makes the sentence much easier to read and understand:

But it’s a word play on what you just said.

While people generally don’t even notice disfluencies in day-to-day conversation, early foundational work in computational linguistics demonstrated how common they are. In 1994, using the Switchboard corpus, Elizabeh Shriberg demonstrated that there is a 50% probability for a sentence of 10–13 words to include a disfluency and that the probability increases with sentence length.

The proportion of sentences from the Switchboard dataset with at least one disfluency plotted against sentence length measured in non-disfluent (i.e., efficient) tokens in the sentence. The longer a sentence gets, the more likely it is to contain a disfluency.

In “Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection”, we present research findings on how to “clean up” transcripts of spoken text. We create more readable transcripts and captions of human speech by finding and removing disfluencies in people’s speech. Using labeled data, we created machine learning (ML) algorithms that identify disfluencies in human speech. Once those are identified we can remove the extra words to make transcripts more readable. This also improves the performance of natural language processing (NLP) algorithms that work on transcripts of human speech. Our work puts special priority on ensuring that these models are able to run on mobile devices so that we can protect user privacy and preserve performance in scenarios with low connectivity.

Base Model Overview
At the core of our base model is a pre-trained BERTBASE encoder with 108.9 million parameters. We use the standard per-token classifier configuration, with a binary classification head being fed by the sequence encodings for each token.

Illustration of how tokens in text become numerical embeddings, which then lead to output labels.

<!–

Illustration of how tokens in text become numerical embeddings, which then lead to output labels.

–>

We refined the BERT encoder by continuing the pretraining on the comments from the Pushrift Reddit dataset from 2019. Reddit comments are not speech data, but are more informal and conversational than the wiki and book data. This trains the encoder to better understand informal language, but may run the risk of internalizing some of the biases inherent in the data. For our particular use case, however, the model only captures the syntax or overall form of the text, not its content, which avoids potential issues related to semantic-level biases in the data.

We fine-tune our model for disfluency classification on hand-labeled corpora, such as the Switchboard corpus mentioned above. Hyperparameters (batch size, learning rate, number of training epochs, etc.) were optimized using Vizier.

We also produce a range of “small” models for use on mobile devices using a knowledge distillation technique known as “self training”. Our best small model is based on the Small-vocab BERT variant with 3.1 million parameters. This smaller model achieves comparable results to our baseline at 1% the size (in MiB). You can read more about how we achieved this model miniaturization in our 2021 Interspeech paper.

Streaming
Some of the latest use cases for automatic speech transcription include automated live captioning, such as produced by the Android “Live Captions” feature, which automatically transcribes spoken language in audio being played on the device. For disfluency removal to be of use in improving the readability of the captions in this setting, then it must happen quickly and in a stable manner. That is, the model should not change its past predictions as it sees new words in the transcript.

We call this live token-by-token processing streaming. Accurate streaming is difficult because of temporal dependencies; most disfluencies are only recognizable later. For example, a repetition does not actually become a repetition until the second time the word or phrase is said.

To investigate whether our disfluency detection model is effective in streaming applications, we split the utterances in our training set into prefix segments, where only the first N tokens of the utterance were provided at training time, for all values of N up to the full length of the utterance. We evaluated the model simulating a stream of spoken text by feeding prefixes to the models and measuring the performance with several metrics that capture model accuracy, stability, and latency including streaming F1, time to detection (TTD), edit overhead (EO), and average wait time (AWT). We experimented with look-ahead windows of either one or two tokens, allowing the model to “peek” ahead at additional tokens for which the model is not required to produce a prediction. In essence, we’re asking the model to “wait” for one or two more tokens of evidence before making a decision.

While adding this fixed look-ahead did improve the stability and streaming F1 scores in many contexts, we found that in some cases the label was already clear even without looking ahead to the next token and the model did not necessarily benefit from waiting. Other times, waiting for just one extra token was sufficient. We hypothesized that the model itself could learn when it should wait for more context. Our solution was a modified model architecture that includes a “wait” classification head that decides when the model has seen enough evidence to trust the disfluency classification head.

Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.

<!–

Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.

–>

We constructed a training loss function that is a weighted sum of three factors:

  1. The traditional cross-entropy loss for the disfluency classification head
  2. A cross-entropy term that only considers up to the first token with a “wait” classification
  3. A latency penalty that discourages the model from waiting too long to make a prediction

We evaluated this streaming model as well as the standard baseline with no look-ahead and with both 1- and 2-token look-ahead values:

Graph of the streaming F1 score versus the average wait time in tokens. Three data points indicate F1 scores above 0.82 across multiple wait times. The proposed streaming model achieves near top performance with much shorter wait times than the fixed look ahead models.

The streaming model achieved a better streaming F1 score than both a standard baseline with no look ahead and a model with a look ahead of 1. It performed nearly as well as the variant with fixed look ahead of 2, but with much less waiting. On average the model waited for only 0.21 tokens of context.

Internationalization
Our best outcomes so far have been with English transcripts. This is mostly due to resourcing issues: while there are a number of relatively large labeled conversational datasets that include disfluencies in English, other languages often have very few such datasets available. So, in order to make disfluency detection models available outside English a method is needed to build models in a way that does not require finding and labeling hundreds of thousands of utterances in each target language. A promising solution is to leverage multi-language versions of BERT to transfer what a model has learned about English disfluencies to other languages in order to achieve similar performance with much less data. This is an area of active research, but we do have some promising results to outline here.

As a first effort to validate this approach, we added labels to about 10,000 lines of dialogue from the German CALLHOME dataset. We then started with the Geotrend English and German Bilingual BERT model (extracted from Multilingual BERT) and fine-tuned it with approximately 77,000 disfluency-labeled English Switchboard examples and 1.3 million examples of self-labeled transcripts from the Fisher Corpus. Then, we did further fine tuning with about 7,500 in-house–labeled examples from the German CALLHOME dataset.

Diagram illustrating the flow of labeled data and self-trained output in our best multilingual training setup. By training on both English and German data we are able to improve performance via transfer learning.

Our results indicate that fine-tuning on a large English corpus can produce acceptable precision using zero-shot transfer to similar languages like German, but at least a modest amount of German labels were needed to improve recall from less than 60% to greater than 80%. Two-stage fine-tuning of an English-German bilingual model produced the highest precision and overall F1 score.

Approach Precision Recall F1
German BERTBASE model fine-tuned on 7,300 human-labeled German CALLHOME examples 89.1% 81.3% 85.0
Same as above but with additional 7,500 self-labeled German CALLHOME examples 91.5% 83.3% 87.2
English/German Bilingual BERTbase model fine-tuned on English Switchboard+Fisher, evaluated on German CALLHOME (zero-shot language transfer) 87.2% 59.1% 70.4
Same as above but subsequently fine-tuned with 14,800 German CALLHOME (human- and self-labeled) examples 95.5% 82.6% 88.6

Conclusion
Cleaning up disfluencies from transcripts can improve not just their readability for people, but also the performance of other models that consume transcripts. We demonstrate effective methods for identifying disfluencies and expand our disfluency model to resource-constrained environments, new languages, and more interactive use cases.

Acknowledgements
Thank you to Vicky Zayats, Johann Rocholl, Angelica Chen, Noah Murad, Dirk Padfield, and Preeti Mohan for writing the code, running the experiments, and composing the papers discussed here. Wealso thank our technical product manager Aaron Schneider, Bobby Tran from the Cerebra Data Ops team, and Chetan Gupta from Speech Data Ops for their support obtaining additional data labels.

Categories
Misc

Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE

Some things are easy as A, B, C. But when it comes to autonomous vehicles, the key may be in one, two, three. Faction, a Bay Area-based startup and NVIDIA Inception member, is preparing to debut its business-to-business autonomous delivery service, with three-wheel production electric vehicles purpose-built for driverless operation, streamlining time to market. In Read article >

The post Three Wheeling: Startup Faction Develops Affordable Tri-Wheel AVs on NVIDIA DRIVE appeared first on NVIDIA Blog.

Categories
Misc

Upcoming Event: Recommender Systems Summit 2022

Join us to hear featured speakers from Netflix, Twitter, Weights & Biases, Coveo, and more discuss challenges building, training, optimizing, and deploying production-ready recommender systems.

Categories
Offsites

Minerva: Solving Quantitative Reasoning Problems with Language Models

Language models have demonstrated remarkable performance on a variety of natural language tasks — indeed, a general lesson from many works, including BERT, GPT-3, Gopher, and PaLM, has been that neural networks trained on diverse data at large scale in an unsupervised way can perform well on a variety of tasks.

Quantitative reasoning is one area in which language models still fall far short of human-level performance. Solving mathematical and scientific questions requires a combination of skills, including correctly parsing a question with natural language and mathematical notation, recalling relevant formulas and constants, and generating step-by-step solutions involving numerical calculations and symbolic manipulation. Due to these challenges, it is often believed that solving quantitative reasoning problems using machine learning will require significant advancements in model architecture and training techniques, granting models access to external tools such as Python interpreters, or possibly a more profound paradigm shift.

In “Solving Quantitative Reasoning Problems With Language Models” (to be released soon on the arXiv), we present Minerva, a language model capable of solving mathematical and scientific questions using step-by-step reasoning. We show that by focusing on collecting training data that is relevant for quantitative reasoning problems, training models at scale, and employing best-in-class inference techniques, we achieve significant performance gains on a variety of difficult quantitative reasoning tasks. Minerva solves such problems by generating solutions that include numerical calculations and symbolic manipulation without relying on external tools such as a calculator. The model parses and answers mathematical questions using a mix of natural language and mathematical notation. Minerva combines several techniques, including few-shot prompting, chain of thought or scratchpad prompting, and majority voting, to achieve state-of-the-art performance on STEM reasoning tasks. You can explore Minerva’s output with our interactive sample explorer!

Solving a multi-step problem: A question from the MATH dataset and Minerva’s solution. The model writes down a line equation, simplifies it, substitutes a variable, and solves for y.

A Model Built for Multi-step Quantitative Reasoning
To promote quantitative reasoning, Minerva builds on the Pathways Language Model (PaLM), with further training on a 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats. Standard text cleaning procedures often remove symbols and formatting that are essential to the semantic meaning of mathematical expressions. By maintaining this information in the training data, the model learns to converse using standard mathematical notation.

Example questions from the Joint Entrance Examination Main Math 2020 exam taken each year by almost 2M Indian high-school students intended to study engineering and similar fields (left), and the National Math Exam in Poland (May 2022) taken by approximately 270K high-school students every year (right).
A dataset for quantitative reasoning: Careful data processing preserves mathematical information, allowing the model to learn mathematics at a higher level.

Minerva also incorporates recent prompting and evaluation techniques to better solve mathematical questions. These include chain of thought or scratchpad prompting — where Minerva is prompted with several step-by-step solutions to existing questions before being presented with a new question — and majority voting. Like most language models, Minerva assigns probabilities to different possible outputs. When answering a question, rather than taking the single solution Minerva scores as most likely, multiple solutions are generated by sampling stochastically from all possible outputs. These solutions are different (e.g., the steps are not identical), but often arrive at the same final answer. Minerva uses majority voting on these sampled solutions, taking the most common result as the conclusive final answer.

Majority voting: Minerva generates multiple solutions to each question and chooses the most common answer as the solution, improving performance significantly.

Evaluation on STEM Benchmarks
To test Minerva’s quantitative reasoning abilities we evaluated the model on STEM benchmarks ranging in difficulty from grade school level problems to graduate level coursework.

  • MATH: High school math competition level problems
  • MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.
  • GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.

We also evaluated Minerva on OCWCourses, a collection of college and graduate level problems covering a variety of STEM topics such as solid state chemistry, astronomy, differential equations, and special relativity that we collected from MIT OpenCourseWare.

In all cases, Minerva obtains state-of-the-art results, sometimes by a wide margin.

Evaluation results on MATH and MMLU-STEM, which include high school and college level questions covering a range of STEM topics.
Model   MATH     MMLU-STEM     OCWCourses     GSM8k  
Minerva 50.3% 75% 30.8% 78.5%
Published state of the art    6.9% 55% 74.4%
Minerva 540B significantly improves state-of-the-art performance on STEM evaluation datasets.

What Minerva Gets Wrong
Minerva still makes its fair share of mistakes. To better identify areas where the model can be improved, we analyzed a sample of questions the model gets wrong, and found that most mistakes are easily interpretable. About half are calculation mistakes, and the other half are reasoning errors, where the solution steps do not follow a logical chain of thought.

It is also possible for the model to arrive at a correct final answer but with faulty reasoning. We call such cases “false positives”, as they erroneously count toward a model’s overall performance score. In our analysis, we find that the rate of false positives is relatively low (Minerva 62B produces less than 8% false positives on MATH).

Below are a couple of example mistakes the model makes.

Calculation mistake: The model incorrectly cancels the square root on both sides of the equation.
Reasoning mistake: The model computes the number of free throws at the fourth practice, but then uses this number as the final answer for the first practice.

Limitations
Our approach to quantitative reasoning is not grounded in formal mathematics. Minerva parses questions and generates answers using a mix of natural language and LaTeX mathematical expressions, with no explicit underlying mathematical structure. This approach has an important limitation, in that the model’s answers cannot be automatically verified. Even when the final answer is known and can be verified, the model can arrive at a correct final answer using incorrect reasoning steps, which cannot be automatically detected. This limitation is not present in formal methods for theorem proving (e.g., see Coq, Isabelle, HOL, Lean, Metamath, and Mizar). On the other hand, an advantage of the informal approach is that it can be applied to a highly diverse set of problems which may not lend themselves to formalization.

Future Directions
While machine learning models have become impressive tools in many scientific disciplines, they are often narrowly scoped to solve specific tasks. We hope that general models capable of solving quantitative reasoning problems will help push the frontiers of science and education. Models capable of quantitative reasoning have many potential applications, including serving as useful aids for researchers, and enabling new learning opportunities for students. We present Minerva as a small step in this direction. To see more samples from Minerva, such as the one below, please visit the interactive sample explorer!

Solving a problem using calculus and trigonoometry: A question from the MATH dataset asking for the speed of a particle in circular motion. Minerva finds a correct step-by-step solution. In the process, Minerva computes a time derivative and applies a trigonometric identity.

Acknowledgements
Minerva was a collaborative effort that spanned multiple teams in Google Research. We would like to thank our coauthors Aitor Lewkowycz, Ambrose Slone, Anders Andreassen, Behnam Neyshabur, Cem Anil, David Dohan, Henryk Michalewski, Imanol Schlag, Theo Gutman-Solo, Vedant Misra, Vinay Ramasesh, and Yuhuai Wu, as well as our collaborators Erik Zelikman and Yasaman Razeghi. Minerva builds upon the work of many others at Google, and we would like to thank the PaLM team, the T5X team, the Flaxformer team, and the JAX team for their efforts. We thank Tom Small for designing the animation in this post. We would also like to especially thank Vedant Misra for developing the Minerva sample explorer.

Categories
Misc

The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More

Turn the TV on. GeForce NOW is leveling up gaming in the living room. The Samsung Gaming Hub launched today, delivering GeForce NOW natively on 2022 Samsung Smart TVs. Plus, the SHIELD Software Experience Upgrade 9.1 is now rolling out to all NVIDIA SHIELD TVs, delivering new gaming features that improve GeForce NOW. Great living Read article >

The post The Gaming Evolution Will Be Televised: GFN Thursday Levels Up the Living Room Experience on New Samsung TVs and More appeared first on NVIDIA Blog.

Categories
Misc

Is there someone who can I ask help about importing a .tflite in Android studio?

submitted by /u/matticrisp
[visit reddit] [comments]

Categories
Misc

The Metaverse Goes Industrial: Siemens, NVIDIA Extend Partnership to Bring Digital Twins within Easy Reach

Silicon Valley magic met Wednesday with 175 years of industrial technology leadership as Siemens CEO Roland Busch and NVIDIA Founder and CEO Jensen Huang shared their vision for an “industrial metaverse” at the launch of the Siemens Xcelerator business platform in Munich. “When we combine the real and digital worlds we can achieve new levels Read article >

The post The Metaverse Goes Industrial: Siemens, NVIDIA Extend Partnership to Bring Digital Twins within Easy Reach appeared first on NVIDIA Blog.

Categories
Misc

NVIDIA, Partners Show Leading AI Performance and Versatility in MLPerf

NVIDIA and its partners continued to provide the best overall AI training performance and the most submissions across all benchmarks with 90% of all entries coming from the ecosystem, according to MLPerf benchmarks released today. The NVIDIA AI platform covered all eight benchmarks in the MLPerf Training 2.0 round, highlighting its leading versatility. No other Read article >

The post NVIDIA, Partners Show Leading AI Performance and Versatility in MLPerf appeared first on NVIDIA Blog.

Categories
Misc

The Full Stack Optimization Powering NVIDIA MLPerf Training v2.0 Performance

Learn about the full-stack optimizations that enabled the NVIDIA platform to deliver even more performance in MLPerf Training v2.0.

MLPerf benchmarks are developed by a consortium of AI leaders across industry, academia, and research labs, with the aim of providing standardized, fair, and useful measures of deep learning performance. MLPerf training focuses on measuring time to train a range of commonly used neural networks for the following tasks:

  • Natural language processing
  • Speech recognition
  • Recommender systems
  • Biomedical image segmentation
  • Object detection
  • Image classification
  • Reinforcement learning

Lower training times are important to speed time to deployment, minimizing the total cost of ownership and maximizing return on investment.

However, just as important as a platform’s performance is its versatility. The ability to train every model, as well as provide infrastructure fungibility to run all AI workloads from training to inference, is critical to allowing organizations to maximize return on their infrastructure investments.

The NVIDIA platform, with full-stack innovation and a rich developer and application ecosystem, continues to be the only one to submit results on all eight MLPerf Training tests, as well as to submit on all MLPerf Inference and MLPerf high-performance computing (HPC) tests.

In this post, you will learn about the methods that NVIDIA deployed across the entire stack to deliver more performance in MLPerf v2.0.

Full stack improvements

NVIDIA MLPerf v2.0 submissions are based on the proven A100 Tensor Core GPU, the NVIDIA DGX A100 system, as well as the NVIDIA DGX SuperPOD reference architecture. Many partners also submitted results using the A100 Tensor Core GPU.

Through continued innovation across the entire stack, including system software, libraries, and algorithms, NVIDIA has yet again delivered performance improvements compared to prior submissions using the same A100 Tensor Core GPU.

Compared to NVIDIA MLPerf v0.7 submissions, which marked the first A100 Tensor Core GPU submissions, results showed gains of up to 2.1x on a per-chip basis, and 5.7x for max-scale training (Table 1).

Benchmark v2.0 Max-Scale Time to Train (min) (vs. v1.1) (vs. v0.7) v2.0 Per-Accelerator Time to Train (min) (vs. v1.1) (vs. v0.7)
Recommendation (DLRM) 0.59
(1.08x)
(5.66x)
12.78 (A100)
(1.07x)
(2.08x)
Natural Language Processing (BERT) 0.21
(1.10x)
(4.48x)
126.95 (A100)
(1.26x)
(2.69x)
Image Classification (ResNet-50 v1.5) 0.32
(1.09x)
(2.38x)
217.82 (A100)
(1.07x)
(1.46x)
Speech Recognition (RNN-T) 2.15
(1.10x)
(NA*)
230.07 (A100)
(1.19x)
(NA*)
Image Segmentation (3D U-Net) 1.22
(1.13x)
(NA*)
170.23 (A100)
(1.20x)
(NA*)
Object Detection – Lightweight (RetinaNet) 4.25
(NA**)
(NA**)
675.18 (A100)
(NA**)
(NA**)
Object Detection – Heavyweight
(Mask R-CNN)
3.09
(1.05x)
(3.39x)
327.34 (A100)
(1.13x)
(2.01x)
Reinforcement Learning (MiniGo) 16.23
(0.95x)
(1.05x)
2045.37 (A100)
(1.04X)
(1.17x)
Table 1. NVIDIA MLPerf v2.0 training times 

MLPerf v1.1 submission details: 
Per-Accelerator: BERT: 1.1-2066 | DLRM: 1.1-2064 | Mask R-CNN: 1.1-2066 | Resnet50 v1.5: 1.1-2065 | RNN-T: 1.1-2066 | 3D U-Net: 1.1-2065 | MiniGo: 1.1-2067 
Max-Scale: BERT: 1.1-2083 | DLRM: 1.1-2073 | Mask R-CNN: 1.1-2076 | Resnet50 v1.5: 1.1-2082 | SSD: 1.1-2070 | RNN-T: 1.1-2080 | 3D U-Net: 1.1-2077 | MiniGo: 1.1-2081 (*) 

MLPerf v2.0 submission details: 
Per-Accelerator: BERT: 2.0-2070 | DLRM: 2.0-2068   | Mask R-CNN: 2.0-2070 | Resnet50 v1.5: 2.0-2069 | RetinaNet: 2.0-2091 | RNN-T: 2.0-2066 | 3D U-Net: 2.0-2060 | MiniGo: 2.0-2059 
Max-Scale: BERT: 2.0-2106 | DLRM: 2.0-2098 | Mask R-CNN: 2.0-2099 | Resnet50 v1.5: 2.0-2107 | RetinaNet: 2.0-2103 | RNN-T: 2.0-2104 | 3D U-Net: 2.0-2100 | MiniGo: 2.0-2105

Per-Accelerator performance for A100 computed using 8xA100 server time-to-train and multiplying it by 8. 3D U-Net and RNN-T were not part of MLPerf v0.7.  (**)  RetinaNet was not part of either MLPerf v0.7 or v1.1. MLPerf name and logo are trademarks. For more information, see www.mlperf.org.

The following sections highlight some of the work done to achieve these improvements.

BERT

The latest NVIDIA BERT submission took advantage of the following optimizations:  

  • Sequence packing 
  • Fusion of fully connected and GELU layers 

Sequence packing

In previous rounds, the overhead related to padding required to fill up the batch was already optimized by introducing an unpadding optimization. Unpadding, however, results in dynamically sized buffers, as the total number of tokens is not fixed anymore.

This is not an issue when we do not have to use CUDA graphs, such when large-batch sizes are used. However, for small batch sizes, where CUDA graphs are used to reduce CPU overhead, dynamically sized buffers require many separate graphs for each possible size. To take advantage of CUDA graphs efficiently while minimizing the overhead of padding, NVIDIA used the concept of sequence packing this round.

In MLPerf Bidirectional Encoder Representations from Transformers (BERT), a training sample has been restricted to 512 tokens, but it often has fewer tokens than 512. As the training sequences have different lengths, it is possible to fit more than one sequence within a 512-token sample.

Sequence packing requires that the length distribution of the training set sequences be known in advance. The sequences can be merged into packed samples such that none of the merged samples exceed a length of 512 tokens.

NVIDIA used a similar packing algorithm that another submitter used in MLPerf v1.1. Adopting an algorithm from a different submitter was made possible by the high degree of general-purpose programmability of GPUs.

To strike a good balance between implementation complexity and performance, up to three sequences were packed in per sample. This results in each training sample containing a varying number of sequences as a batch of three samples can each contain three to nine sequences.

CUDA graphs require buffer sizes to be fixed across time for each graph. The varying number of total sequences was handled by creating a separate graph for each possible number of sequences within a batch.

For large-scale training, we used a batch size of two per chip. This translates into five to seven separate graphs, which is much less than would be required with the unpadding optimization mentioned at the beginning.

Overall, this technique improved the results for large-scale runs by 10% and 33% for 4096-GPU and 1024-GPU scenarios, respectively.

Fusion of fully connected and GELU layers

BERT uses a Gaussian error linear unit (GELU) activation function that follows a fully connected layer. In prior submissions, the GELU activation function was implemented as a single kernel. This approach required additional memory transactions for both input reads and output writes.

In this round, NVIDIA implemented the fusion of a fully connected layer (a matrix multiply operation) with the GELU activation function. This eliminated the need for a large number of memory read and write operations, yielding a 2-4% increase in overall throughput – larger gains are observed for larger per-chip batch sizes.

In general, it is more efficient to fuse activation math into the end of matrix multiply operations, which means fusing the GELU activation function into different fully connected layers (Figure 1).

An example of matrix multiply operations in which dgrad and wgrad corresponds to data and weight gradient computation for fully connected layers, respectively. The diagram does not show matmul operations with no fusions (such as wgrad_FC1).
Figure 1. Left: Pattern of operations in BERT, middle: Fusion graph in forward pass, right: Fusion graph in backward pass. Each box represents a single kernel.

Deep learning recommendation models

The latest NVIDIA deep learning recommendation models (DLRM) submission again leverages NVIDIA Merlin HugeCTR, an optimized open-source deep neural network training framework for recommender systems.

Kernel fusions

Multilayer perceptrons (MLP) represent a key building block for DLRM. To reduce the number of trips to global memory, fusions of elementwise kernels and general matrix multiply (GEMM) kernels have been widely employed.

The NVIDIA cuBLAS library has recently introduced a new fusion type: GEMM with DReLU (fusing ReLU gradient computation with matrix multiply operations in the backward pass). HugeCTR takes advantage of this new fusion type to enhance the performance of MLPs.

Improved overlap of computation and communication

Increasing GPU utilization is important to provide the highest performance.

In this latest submission, NVIDIA notably improved the overlap of computations and communications in the evaluation of hybrid embeddings to improve GPU utilization. Specifically, the execution of the dense network in iteration i overlapped with the execution of the embedding in iteration i+1 through pipelining, increasing the utilizations of the GPUs.

This overlapping is possible because there are no inter-iteration dependences in the evaluation phase.

Additionally, several of the key kernels in the forward/backward hierarchical all-to-all operations for hybrid embedding were also optimized.

An example of embedding and dense network execution overlapping over three iterations. 
Figure 2. Overlapping execution of embedding and dense network 

ResNet-50

For ResNet-50, we employed the following optimizations to boost performance:  

  • Better max-scale training configuration
  • Faster cuDNN kernels

Better max scale training configuration

When a model is trained on a large scale, if the global batch size is not an integer multiple of the training images in the dataset, the last iteration of the epoch gets added extra data to keep the batch size consistent across iterations. This wasted computation can be saved if the global batch size is close to the integer multiple of the dataset. This is especially important for larger-scale training, where the global batch size is relatively large. 

In this round of MLPerf, we concluded that using 527 nodes with a global batch size of 67,456 significantly reduced wasted computation, resulting in a performance boost of 3.5% compared to NVIDIA’s ResNet-50 submission in MLPerf v1.1.

Faster CuDNN kernels

For the ResNet50 submission, NVIDIA significantly improved the kernels picked up by cuDNN. This includes both better kernels being picked up for the layer sizes and optimized kernel implementation for different tile sizes.

From these optimized kernel samplings, we observed over 4% throughput improvement in the large-scale configurations from MLPerf v1.1.

RetinaNet

The NVIDIA RetinaNet submission takes advantage of several software optimizations, including:  

  • Channel last memory format and automatic mixed precision
  • Using fusion for speed-up
  • Optimizing loss block
  • Asynchronous scoring
  • CUDA Graphs

Channel last memory format and automatic mixed precision

To avoid memory reorganization and to effectively increase peak performance, NVIDIA used the PyTorch channel last memory format (NHWC instead of NCHW) and the PyTorch automatic mixed precision (AMP).

Using fusion for speed-up

For the RetinaNet submission, NVIDIAtook advantage of several fusion opportunities. The cuDNN runtime fusion through the Apex library was used to fuse CONV-bias-ReLU and CONV-bias patterns, and PyTorch NVFuser were used to fuse element-wise operations, such as scale-bias-ReLU and scale-bias-add-ReLU.

The cuDNN runtime fusion Python interface can be found in the Apex repository (import ConvBiasReLU or ConvBias from apex.contrib.conv_bias_relu).

Optimizing loss block

The RetinaNet loss-related calculations are separated into two stages: ground truth data preprocessing and the actual loss calculation.

As the ground truth data preprocessing is not dependent on the model output, part of the ground truth data processing was offloaded to DALI with custom functions, enabling it to be performed asynchronously, improving system resource utilization. The remainder of the preprocessing was reimplemented and then merged into the model graph to avoid jitter.

For the loss calculation, an optimized focal loss implementation was used, which can be found in the Apex library.

Asynchronous scoring

The RetinaNet submission guidelines require that evaluation (inference and scoring) be performed after each training epoch. The scoring time overhead is significant due to the large number of images and bounding boxes in the OpenImages validation dataset, as well as the sequential implementation of the scoring code.

To mitigate the scoring overheads—particularly in large-scale execution, asynchronous scoring was implemented so that the next training epoch masked the previous epoch scoring procedure.

Diagram shows the execution of asynchronous COCO scoring as part of MLPerf RetinaNet evaluation phase
Figure 3. Execution of asynchronous COCO scoring in evaluation

CUDA Graphs

CUDA Graphs were used extensively in the NVIDIA RetinaNet submission. The entire model and portions of the ground truth preprocessing were graphed which required that they be reimplemented to fit CUDA graph constraints.

The model’s forward and backward passes were graph captured, as well as portions of the ground truth preprocessing. The latter required code adaptation to fit CUDA graph constraints.

For more information, see Accelerating PyTorch with CUDA Graphs.

Mask R-CNN

The NVIDIA Mask R-CNN submissions utilized several techniques to improve performance:  

  • Bottleneck block optimizations
  • RPN head fusion
  • Evaluation
  • Top-K

Bottleneck block optimizations

The resnet backbone is built as a stack of bottleneck blocks, each composed of three sequential convolutions. Each convolution is followed by a batch-norm and a ReLu. The batch-norm modules have four parameters, and some math is required to compute a couple of intermediate terms in the forward method.

As the batch-norms are frozen, the parameters never change, meaning that the intermediate terms do not change either. To save time, these intermediate terms were computed just one time.

Backpropagation of ReLu involves creating and applying a mask. In earlier versions of the code, this mask was stored with half (FP16) precision. In this round, the DReLU mask is represented as a Boolean and not FP16 to reduce memory transactions.

During back propagation, data gradients and weight gradients were computed for each of the three convolution layers. NVIDIA found empirically that while the data gradient GPU kernels were launched with a sufficient number of CTAs to fully use the GPUs, the weight gradient kernels were launched with far fewer CTAs.

One optimization that was implemented was to launch the data gradient kernels first, and then launch all three weight gradient kernels on separate streams so that they ran concurrently. This reduced total computation time for weight gradients.

These optimizations are available for PyTorch users in the bottleneck block module in Apex.

RPN head fusion

A new Apex module, which fuses convolution, bias, and ReLu, was implemented, as discussed in the RetinaNet section. This module was in MaskR-CNN as well to fuse forward propagation of some of the layers in RPN head block.

Evaluation

Evaluation, on average, takes almost as much time as training. Evaluation is done asynchronously on dedicated nodes, but the results are shared with the training nodes through a blocking broadcast.

The training nodes wait for a certain number of steps before they start waiting for the evaluation broadcast to minimize any evaluation result wait time. The learning rate curve has two inflection points in it, and it is extremely unlikely that the model will have converged before passing the last inflection point. That’s why you should wait for as long as possible to check for evaluation results, until training has passed the last learning rate curve inflection point.

Top-K

In earlier versions of PyTorch, the number of CTAs launched by the top-k kernel was proportional to the per-GPU batch size. This yielded poor performance when the batch size equaled 1, the batch size that is always used for NVIDIA max-scale runs.

This issue was addressed in previous rounds with a two-stage top-k method, which was implemented in Python, but this solution did not generalize well. Work on a more general solution was already underway.

In this round, NVIDIA worked with the PyTorch team to ensure that the new top-k implementation that yielded far better performance for a batch size of 1 made it into PyTorch. When this was complete, the prior two-stage top-k implementation was replaced with the new PyTorch module.

3D U-Net

3D U-Net has multiple large layers with an input channel count of 32. For wgrad kernels, using a kernel with the default 64x256x64 meant significant tile size quantization loss.

Thanks to the introduction of the new 32x256x32 wgrad kernels in cuDNN, this tile size quantization loss was saved. This resulted in a speedup of over 5% on a single node in MLPerf v2.0 over MLPerf v1.1.

RNN-T

The preprocessing step of Recurrent Neural Network Transducer (RNN-T) is relatively intensive. Thanks to DALI, most of the preprocessing overhead can be pipelined and hidden under the main training loop.

However, because the size of the input data might vary, there was a need for relocating the internal memory buffers after the initial iteration, increasing the length of the warm-up phase.

DALI has recently switched to a memory-pool based allocator, where the pool is managed using the cuMem API. This significantly reduces the overhead of allocating new buffers, yielding a much faster warm-up process in training.

Conclusion

Thanks to optimizations across the stack, the NVIDIA platform was yet again able to boost performance in MLPerf Training v2.0 using the proven NVIDIA A100 Tensor Core GPU and NVIDIA DGX A100 platforms.

NVIDIA continues to be the only platform to submit results in the MLPerf benchmarking suite, including MLPerf Training, MLPerf Inference, and MLPerf HPC. This showcases the performance and versatility of the entire platform, which is crucial as modern AI becomes pervasive across every computing domain.

In addition to providing the software used for NVIDIA MLPerf submissions in the MLPerf repository, dozens of additional models were also made and optimized for NVIDIA GPUs, available on the NGC hub.

The NVIDIA platform is also ubiquitous, providing customers with the choice of where to run models. NVIDIA A100 is available from all major server makers and cloud service providers, allowing you to deploy on-premises, in the cloud, in a hybrid environment, or at the edge.

Categories
Misc

Siemens and NVIDIA to Enable Industrial Metaverse

Siemens, a leader in industrial automation and software, infrastructure, building technology and transportation, and NVIDIA, a pioneer in accelerated graphics and artificial intelligence (AI), today announced an expansion of their partnership to enable the industrial metaverse and increase use of AI-driven digital twin technology that will help bring industrial automation to a new level.