Torch-TensorRT is a PyTorch integration for TensorRT inference optimizations on NVIDIA GPUs. With just one line of code, it speeds up performance up to 6x on NVIDIA GPUs.
I’m excited about Torch-TensorRT, the new integration of PyTorch with NVIDIA TensorRT, which accelerates the inference with one line of code. PyTorch is a leading deep learning framework today, with millions of users worldwide. TensorRT is an SDK for high-performance, deep learning inference across GPU-accelerated platforms running in data center, embedded, and automotive devices. This integration enables PyTorch users with extremely high inference performance through a simplified workflow when using TensorRT.
Figure 1. PyTorch models can be compiled with Torch-TensorRT on various NVIDIA platforms
What is Torch-TensorRT
Torch-TensorRT is an integration for PyTorch that leverages inference optimizations of TensorRT on NVIDIA GPUs. With just one line of code, it provides a simple API that gives up to 6x performance speedup on NVIDIA GPUs.
This integration takes advantage of TensorRT optimizations, such as FP16 and INT8 reduced precision, while offering a fallback to native PyTorch when TensorRT does not support the model subgraphs.
How Torch-TensorRT works
Torch-TensorRT acts as an extension to TorchScript. It optimizes and executes compatible subgraphs, letting PyTorch execute the remaining graph. PyTorch’s comprehensive and flexible feature sets are used with Torch-TensorRT that parse the model and applies optimizations to the TensorRT-compatible portions of the graph.
After compilation, using the optimized graph is like running a TorchScript module and the user gets the better performance of TensorRT. The Torch-TensorRT compiler’s architecture consists of three phases for compatible subgraphs:
Lowering the TorchScript module
Conversion
Execution
Lowering the TorchScript module
In the first phase, Torch-TensorRT lowers the TorchScript module, simplifying implementations of common operations to representations that map more directly to TensorRT. It is important to note that this lowering pass does not affect the functionality of the graph itself.
Figure 2. Parsing and transforming TorchScript’s graph
Conversion
In the conversion phase, Torch-TensorRT automatically identifies TensorRT-compatible subgraphs and translates them to TensorRT operations:
Nodes with static values are evaluated and mapped to constants.
Nodes that describe tensor computations are converted to one or more TensorRT layers.
The remaining nodes stay in TorchScripting, forming a hybrid graph that is returned as a standard TorchScript module.
Figure 3. Mapping Torch’s ops to TensorRT ops for the fully connected layer
The modified module is returned to you with the TensorRT engine embedded, which means that the whole model—PyTorch code, model weights, and TensorRT engines—is portable in a single package.
Figure 4. Transforming the Conv2d layer into TensorRT engine while log_sigmoid falls back to TorchScript JIT
Execution
When you execute your compiled module, Torch-TensorRT sets up the engine live and ready for execution. When you execute this modified TorchScript module, the TorchScript interpreter calls the TensorRT engine and passes all the inputs. The engine runs and pushes the results back to the interpreter as if it was a normal TorchScript module.
Figure 5. Runtime execution of PyTorch and TensorRT ops
Torch-TensorRT features
Torch-TensorRT introduces the following features: support for INT8 and sparsity.
Support for INT8
Torch-TensorRT extends the support for lower precision inference through two techniques:
Post-training quantization (PTQ)
Quantization-aware training (QAT)
For PTQ, TensorRT uses a calibration step that executes the model with sample data from the target domain. IT tracks the activations in FP32 to calibrate a mapping to INT8 that minimizes the information loss between FP32 and INT8 inference. TensorRT applications require you to write a calibrator class that provides sample data to the TensorRT calibrator.
Torch-TensorRT uses existing infrastructure in PyTorch to make implementing calibrators easier. LibTorch provides a DataLoader and Dataset API, which streamlines preprocessing and batching input data. These APIs are exposed through C++ and Python interfaces, making it easier for you to use PTQ. For more information, see Post Training Quantization (PTQ).
For QAT, TensorRT introduced new APIs: QuantizeLayer and DequantizeLayer, which map the quantization-related ops in PyTorch to TensorRT. Operations like aten::fake_quantize_per_*_affine is converted into QuantizeLayer + DequantizeLayer by Torch-TensorRT internally. For more information about optimizing models trained with PyTorch’s QAT technique using Torch-TensorRT, see Deploying Quantization Aware Trained models in INT8 using Torch-TensorRT.
Sparsity
The NVIDIA Ampere architecture introduces third-generation Tensor Cores at NVIDIA A100 GPUs that use the fine-grained sparsity in network weights. They offer maximum throughput of dense math without sacrificing the accuracy of the matrix multiply accumulate jobs at the heart of deep learning.
TensorRT supports registering and executing some sparse layers of deep learning models on these Tensor Cores.
Torch-TensorRT extends this support for convolution and fully connected layers.
Example: Throughput comparison for image classification
In this post, you perform inference through an image classification model called EfficientNet and calculate the throughputs when the model is exported and optimized by PyTorch, TorchScript JIT, and Torch-TensorRT. For more information, see the end-to-end example notebook on the Torch-TensorRT GitHub repository.
Installation and prerequisites
To follow these steps, you need the following resources:
A Linux machine with an NVIDIA GPU, compute architecture 7 or earlier
Docker installed, 19.03 or earlier
A Docker container with PyTorch, Torch-TensorRT, and all dependencies pulled from the NGC Catalog
Now that you have a live bash terminal in the Docker container, launch an instance of JupyterLab to run the Python code. Launch JupyterLab on port 8888 and set the token to TensorRT. Keep the IP address of your system handy to access JupyterLab’s graphical user interface on the browser.
Navigate to this IP address on your browser with port 8888. If you are running this example of a local system, then navigate to Localhost:8888.
After you connect to JupyterLab’s graphical user interface on the browser, you can create a new Jupyter notebook. Start by installing timm, a PyTorch library containing pretrained computer vision models, weights, and scripts. Pull the EfficientNet-b0 model from this library.
pip install timm
Import the relevant libraries and create a PyTorch nn.Module object for EfficientNet-b0.
import torch
import torch_tensorrt
import timm
import time
import numpy as np
import torch.backends.cudnn as cudnn
torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
efficientnet_b0 = timm.create_model('efficientnet_b0',pretrained=True)
You get predictions from this model by passing a tensor of random floating numbers to the forward method of this efficientnet_b0 object.
model = efficientnet_b0.eval().to("cuda")
detections_batch = model(torch.randn(128, 3, 224, 224).to("cuda"))
detections_batch.shape
This returns a tensor of [128, 1000] corresponding to 128 samples and 1,000 classes.
To benchmark this model through both PyTorch JIT and Torch-TensorRT AOT compilation methods, write a simple benchmark utility function:
cudnn.benchmark = True
def benchmark(model, input_shape=(1024, 3, 512, 512), dtype='fp32', nwarmup=50, nruns=1000):
input_data = torch.randn(input_shape)
input_data = input_data.to("cuda")
if dtype=='fp16':
input_data = input_data.half()
print("Warm up ...")
with torch.no_grad():
for _ in range(nwarmup):
features = model(input_data)
torch.cuda.synchronize()
print("Start timing ...")
timings = []
with torch.no_grad():
for i in range(1, nruns+1):
start_time = time.time()
pred_loc = model(input_data)
torch.cuda.synchronize()
end_time = time.time()
timings.append(end_time - start_time)
if i%10==0:
print('Iteration %d/%d, avg batch time %.2f ms'%(i, nruns, np.mean(timings)*1000))
print("Input shape:", input_data.size())
print('Average throughput: %.2f images/second'%(input_shape[0]/np.mean(timings)))
You are now ready to perform inference on this model.
Inference using PyTorch and TorchScript
First, take the PyTorch model as it is and calculate the average throughput for a batch size of 1:
model = efficientnet_b0.eval().to("cuda")
benchmark(model, input_shape=(1, 3, 224, 224), nruns=100)
The same step can be repeated with the TorchScript JIT module:
Here are the results that I’ve achieved on an NVIDIA A100 GPU with a batch size of 1.
Figure 6. Comparing throughput of native PyTorch with Torch-TensorRt on an NVIDIA A100 GPU with batch size 1
Summary
With just one line of code for optimization, Torch-TensorRT accelerates the model performance up to 6x. It ensures the highest performance with NVIDIA GPUs while maintaining the ease and flexibility of PyTorch.
Interested in trying it on your model? Download Torch-TensorRT from the PyTorch NGC container to accelerate PyTorch inference with TensorRT optimizations, and no code changes.
TensorRT 8.2 optimizes HuggingFace T5 and GPT-2 models. With TensorRT-accelerated GPT-2 and T5, you can generate excellent human-like texts and build real-time translation, summarization, and other online NLP applications within strict latency requirements.
The transformer architecture has wholly transformed (pun intended) the domain of natural language processing (NLP). Over the recent years, many novel network architectures have been built on the transformer building blocks: BERT, GPT, and T5, to name a few. With increasing variety, the size of these models has also rapidly increased.
While larger neural language models generally yield better results, deploying them for production poses serious challenges, especially for online applications where a few tens of ms of extra latency can negatively affect the user experience significantly.
With the latest TensorRT 8.2, we optimized T5 and GPT-2 models for real-time inference. You can turn the T5 or GPT-2 models into a TensorRT engine, and then use this engine as a plug-in replacement for the original PyTorch model in the inference workflow. This optimization leads to a 3–6x reduction in latency compared to PyTorch GPU inference, and a 9–21x compared to PyTorch CPU inference.
In this post, we give you a detailed walkthrough of how to achieve the same latency reduction, using our newly published example scripts and notebooks based on Hugging Face transformers for the tasks of open-end text generation with GPT-2 and translation and summarization with T5.
Introduction to T5 and GPT-2
In this section, we briefly explain the T5 and GPT-2 models.
T5 for answering questions, summarization, translation, and classification
T5 or Text-To-Text Transfer Transformer is a recent architecture created by Google. It reframes all natural language processing (NLP) tasks into a unified text-to-text format where the input and output are always text strings. T5’s architecture enables applying the same model, loss function, and hyperparameters to any NLP task such as machine translation, document summarization, question answering, and classification tasks such as sentiment analysis.
The T5 model was inspired by the fact that transfer learning has produced state-of-the-art results in NLP. The principle behind transfer learning is that a model pretrained on abundantly available untrained data with self-supervised tasks can be fine-tuned for specific tasks on smaller task-specific labeled datasets. These models have proven to have better results than models trained on task-specific datasets from scratch.
Based on the concept of Transfer Learning, Google proposed the T5 model in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. In this paper, they also introduced the Colossal Clean Crawled Corpus (C4) dataset. The T5 model, pretrained on this dataset achieves state-of-the-art results on many downstream NLP tasks. Published pretrained T5 models range up to 3B and 11B parameters.
GPT-2 for generating excellent human-like texts
Generative Pre-Trained Transformer 2 (GPT-2) is an auto-regressive unsupervised language model originally proposed by OpenAI. It is built from the transformer decoder blocks and trained on very large text corpora to predict the next word in a paragraph. It generates excellent human-like texts. Larger GPT-2 models, with the largest reaching 1.5B parameters, generally write better, more coherent texts.
Deploying T5 and GPT-2 with TensorRT
With TensorRT 8.2, we optimize the T5 and GPT-2 models by building and using a TensorRT engine as a drop-in replacement for the original PyTorch model. We walk you through scripts and Jupyter notebooks and highlight the important bits, which are based on Hugging Face transformers. For more information, see the example scripts and notebooks for a detailed step-by-step execution guide.
Setting up
The most convenient way to get started is by using a Docker container, which provides an isolated, self-contained, and reproducible environment for the experiments.
These commands start the Docker container and JupyterLab. Open the JupyterLab interface in your web browser:
http://:8888/lab/
In JupyterLab, to open a terminal window, choose File, New, Terminal. Compile and install the TensorRT OSS package:
cd $TRT_OSSPATH
mkdir -p build && cd build
cmake .. -DTRT_LIB_DIR=$TRT_LIBPATH -DTRT_OUT_DIR=`pwd`/out
make -j$(nproc)
Now you are ready to proceed with experimenting with the models. In the following sequence, we demonstrate the steps for the T5 model. The following code blocks are not meant to be copy-paste runnable but rather walk you through the process. For reproduction purposes, see the notebooks on the GitHub repository.
At a high level, optimizing a Hugging Face T5 and GPT-2 model with TensorRT for deployment is a three-step process:
Download models from the HuggingFace model zoo.
Convert the model to an optimized TensorRT execution engine.
Carry out inference with the TensorRT engine.
Use the generated engine as a plug-in replacement for the original PyTorch model in the HuggingFace inference workflow.
Download models from the HuggingFace model zoo
First, download the original Hugging Face PyTorch T5 model from HuggingFace model hub, together with its associated tokenizer.
You can then employ this model for various NLP tasks, for example, translating from English to German:
print(tokenizer.decode(outputs[0], skip_special_tokens=Truinputs = tokenizer("translate English to German: That is good.", return_tensors="pt")
# Generate sequence for an input
outputs = t5_model.to('cuda:0').generate(inputs.input_ids.to('cuda:0'))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
TensorRT 8.2 supports GPT-2 up to the “xl” version (1.5B parameters) and T5 up to 11B parameters, which are publicly available on the HuggingFace model zoo. Larger models can also be supported subject to GPU memory availability.
Converting the model to an optimized TensorRT execution engine.
Before converting the model to a TensorRT engine, you convert the PyTorch model to an intermediate universal format. ONNX is an open format for machine learning and deep learning models. It enables you to convert deep learning and machine-learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single unified format.
Converting to ONNX
For the T5 model, convert the encoder and decoder separately using a utility function.
Now you are ready to parse the T5 ONNX encoder and decoder and convert them to optimized TensorRT engines. As TensorRT carries out many optimizations, such as fusing operations, eliminating transpose operations, and kernel auto-tuning to find the best performing kernel on a target GPU architecture, this conversion process might take a while.
Similarly, for the GPT-2 model, you can follow the same process to generate a TensorRT engine. The optimized TensorRT engines can be used as a plug-in replacement for the original PyTorch models in the HuggingFace inference workflow.
TensorRT transformer optimization specifics
Transformer-based models are a stack of either transformer encoder or decoder blocks. Encoder (decoder) blocks have the same architecture and number of parameters. T5 consists of stacks of transformer encoders and decoders, while GPT-2 is composed of only transformer decoder blocks (Figure 1).
Figure 1a. T5 architecture
Figure 1b. GPT-2 architecture
Each transformer block, also known as the self-attention block, consists of three projections by using fully connected layers to project the input into three different subspaces, termed query (Q), key (K), and value (V). These matrices are then transposed, with QT and KT being used to compute the normalized dot-product attention values, before being combined with VT to produce the final output (Figure 2).
Figure 2. Self-attention block
TensorRT optimizes the self-attention block by pointwise layer fusion:
Reduction is fused with power ops (for LayerNorm and residual-add layer).
Scale is fused with softmax.
GEMM is fused with ReLU/GELU activations.
Additionally, TensorRT also optimizes the network for inference:
Eliminating transpose ops.
Fusing the three KQV projections into a single GEMM.
When FP16 mode is specified, controlling layer-wise precisions to preserve accuracy while running the most compute-intensive ops in FP16.
TensorRT vs. PyTorch CPU and GPU benchmarks
With the optimizations carried out by TensorRT, we’re seeing up to 3–6x speedup over PyTorch GPU inference and up to 9–21x speedup over PyTorch CPU inference.
Figure 3 shows the inference results for the T5-3B model at batch size 1 for translating a short phrase from English to German. The TensorRT engine on an A100 GPU provides a 21x reduction in latency compared to PyTorch running on a dual-socket Intel Platinum 8380 CPU.
Figure 3. T5-3B model inference comparison. TensorRT on A100 GPU provides a 21x smaller latency compared to PyTorch CPU inference.
CPU: Intel Platinum 8380, 2 sockets. GPU: NVIDIA A100 PCI Express 80GB. Software: PyTorch 1.9, TensorRT 8.2.0 EA. Task: “Translate English to German: that is good.”
Conclusion
In this post, we walked you through converting the Hugging Face PyTorch T5 and GPT-2 models to an optimized TensorRT engine for inference. The TensorRT inference engine is used as a drop-in replacement for the original HuggingFace T5 and GPT-2 PyTorch models and provides up to 21x CPU inference speedup. To achieve this speedup for your model, get started today with TensorRT 8.2.
Movies like The Matrix and The Lord of the Rings inspired a lifelong journey in computer graphics for Piotr Skiejka, a senior visual effects artist at Ubisoft. Born in Poland and based in Singapore, Skiejka turned his childhood passion of playing with motion design — as well as compositing, lighting and rendering effects — into Read article >
With the holiday season comes many joys for GeForce NOW members. This month, RTX 3080 membership preorders are activating in Europe. Plus, we’ve made a list — and checked it twice. In total, 20 new games are joining the GeForce NOW library in December. This week, the list of nine games streaming on GFN Thursday Read article >
First responders don’t have time on their side. Whether fires, search-and-rescue missions or medical emergencies, their challenges are usually dangerous and time-sensitive. Using autonomous technology, Zurich-based Fotokite is developing a system to help first responders save lives and increase public safety. Fotokite Sigma is a fully autonomous tethered drone, built with the NVIDIA Jetson platform, Read article >
I have a project in which I will make a model that classifies Knee MRIs from the OAI dataset. My university provided me the dataset.
The classification is about knee osteoartrithis. The model must assign a grade (from 0 to 4) in each MRI.
I am facing a problem as I am not able to find the labels in the MRI file tha was given to me. Is the label somewhere in the metadata or the header of the DICOM files (MRIs) and I cannot find it or my professor forgot to send me an extra file containing the labels ?
In this round of MLPerf training v1.1, optimization across the entire stack including hardware, system software, libraries, and algorithms continue to boost NVIDIA MLPerf training performance.
Five months have passed since v1.0, so it is time for another round of the MLPerf training benchmark. In this v1.1 edition, optimization over the entire hardware and software stack sees continuing improvement across the benchmarking suite for the submissions based on NVIDIA platform. This improvement is observed consistently at all different scales, from single machines all the way to industrial super-computers such as the NVIDIA Selene consisting of 560 NVIDIA DGX A100 systems and the Microsoft Azure NDm A100 v4 cluster consisting of 768-node A100-based systems.
Increasingly, organizations are using the MLPerf benchmarks to guide their AI infrastructure strategies. MLPerf (part of MLCommons) is a global consortium of AI leaders from academia, research labs, and industry whose mission is to build fair and useful benchmarks that provide unbiased evaluations of training and inference performance for hardware, software, and services—all conducted under prescribed conditions. To stay on the cutting edge of industry trends, MLPerf continues to evolve, holding new tests at regular intervals and adding new workloads that represent the state of the art in AI.
As in the previous rounds of the MLPerf benchmark, this post provides a technical deep dive into the optimization work underlying NVIDIA’s industry-leading performance. For more information about previous rounds, see the following posts:
Continuing to disclose and elaborate on these technical details, NVIDIA shows a strong commitment to the important issue of open and fair community-driven benchmarking standards and practices for the advancement of AI for public good.
Optimization across the entire stack
With the building blocks still centered around the now well-established NVIDIA A100 GPU, the NVIDIA DGX A100 platform and the NVIDIA SuperPod reference architecture, optimizations across the entire stack, especially on the system software, libraries and algorithmic fronts, have led to continuing performance improvements of NVIDIA-based platforms in MLPerf v1.1.
Compared to our own MLPerf v0.7 submissions 1 year ago, we observed up to 2.1x improvement on a chip-to-chip basis and up to 5.3x for max-scale training, as shown in Table 1.
Per-Accelerator performance for A100 computed using NVIDIA 8xA100 server time-to-train and multiplying it by 8 (**). U-Net and RNN-T were not part of MLPerf v0.7. MLPerf name and logo are trademarks. For more information, see www.mlperf.org.
The next sections cover some highlights.
CUDA Graphs
In MLPerf v1.0, we extensively used CUDA Graphs for most of the benchmarks. CUDA Graphs launched several kernels as a single executable unit to accelerate throughput by minimizing communication with the CPU. But the scope of each graph was only a portion of one full iteration, which processes a single minibatch. As a result, only part of an iteration was captured as each iteration was broken down into multiple CUDA graphs.
In MLPerf v1.1, we used CUDA Graphs to capture an entire iteration into a single graph for multiple benchmarks, further minimizing communication with the CPU during training and improving the performance at scale. This was implemented for both the PyTorch and MXNet benchmarks, resulting in up to 6% performance gains in the ResNet-50 and BERT workloads.
NCCL
NCCL, part of NVIDIA Magnum IO technologies, is the library that optimizes inter-GPU communication for your server topology. A key feature in NCCL that was added earlier this year was support for CUDA Graphs. This enabled us to capture the entire iteration as a single graph, as described in the previous section.
Previously, NCCL copied all the weights from the graph and performed an all-reduce function, which sums all the weights. The updated weights are then written back to the graph. This required multiple copies of the data.
We have now introduced user buffer registration, where pointers are used by NCCL collectives to avoid copying the data back and forth when used alongside the Scalable Hierarchical Aggregation And Reduction Protocol (SHARP), also part of NVIDIA Magnum IO. In the presence of CUDA Graphs and SHARP, we observed about 2% of end-to-end additional speedup.
NCCL has also implemented fusing scaling ops (multiplying by a scalar) into the communication kernels to reduce data copies, resulting in up to an additional ~3% end-to-end savings in communication-heavy networks like BERT.
Fine-grained overlapping
In this round, we have strongly leveraged the capabilities of GPU hardware that enables the fine-grained overlap of independent computation blocks with each other across multiple cores, as well as increased overlap of communication and computation. This improved the performance, especially of max-scale training, up to 10% on Mask R-CNN and 27% on DLRM.
For the recommender systems benchmark (DLRM) in particular, we made use of the capabilities of software and hardware to use GPU resources efficiently by overlapping multiple operations:
Overlapping embedding index computations with the all-reduce collective of the previous iteration
Overlapping data gradient and weight gradient computations
Increased overlapping of math and other multi-GPU collectives, such as all-to-all
For 3D-UNet, spatial parallelism performance is improved by more efficient scheduling of math and communication kernels for increased overlap of the two.
For Mask R-CNN, we have implemented overlapping of loss computation for mask head, bounding-box head, and RPN-head for improved GPU utilization at scale.
We have significantly improved multi-GPU group batch norm (GBN) performance through more efficient memory copies (vectorization) and better overlap of communication and math within the kernel. This enables scaling the workloads to more GPUs, resulting in more than 10% savings of max-scale training for some computer vision benchmarks such as ResNet50 and SSD and 5% savings for 3D-UNet.
Kernel fusion and optimization
Finally, for the first time in this MLPerf round, we have introduced the fusion of bias gradient reduction into matrix multiplication kernels (fusing of two operations). This results in up to 3% performance improvements.
Model optimization specifics
In this section, we dive into the optimization work on each of the workloads.
BERT
Fusing bias gradient reduction into matrix multiplication in backward pass
The cuBLAS library recently introduced a new type of fusion: fusing bias gradient computation and weight gradient computation in the same kernel.
In this round, we used this cuBLAS feature to fuse these two operations in the backward pass. We also fused bias addition and matrix multiplication in the forward pass. Figure 1 shows the fused operations for the forward pass and backward pass.
Figure 1. Fusing other operations with matrix multiplication in the forward pass (left) and backward pass (right)
Improved fused multihead attention
In the previous round, we implemented fusing of the multihead attention module. This module used parallelism across the num_sequences and num_heads variables. This means that there are a total of num_sequences*num_heads thread blocks that are scheduled simultaneously on different streaming multiprocessors (SMs) on the GPU. num_heads is 16 in the MLPerf BERT model, and when num_sequences is smaller than 6, there are not enough thread blocks to fill the GPU, limiting the parallelism.
In this round, we improved these kernels by introducing slicing across the sequence dimension for batched matrix multiplications required for attention computation, which helped increase the parallelism proportionately. This optimization resulted in an ~8% end-to-end speedup for max-scale training scenarios, where the per-chip batch size is small.
Full iteration graph capture with CUDA Graphs
As mentioned in the previous section on CUDA Graphs, in this round, we captured the full iteration of BERT into a single CUDA graph. This was possible due to CUDA Graphs support in the NCCL communications library, as well as the PyTorch framework. This resulted in ~3% end-to-end savings due to reduced CPU latency and jitter at scale. On top of that, we also leveraged NCCL user-buffer preregistration feature when CUDA Graphs is used, resulting in another ~2% of end-to-end performance improvement.
Setting model parameter buffers to point to a contiguous flat buffer
BERT uses a distributed optimizer to speed up optimization steps. For best all-gather performance, the intermediate buffers used in a distributed optimizer for weight parameters should all be part of a single contiguous flat buffer. This way, instead of running several all-gather functions on small separate tensors, we can better use GPU interconnects by running all-gather for one large message.
On the other hand, PyTorch by default allocates separate buffers for each of the model’s parameter tensors to be used during the forward pass. This requires an extra “unflattening” step, as shown in Figure 2, between the end of an iteration and the beginning of the next one.
In this MLPerf round, we used a single contiguous buffer where each parameter tensor is placed next to each other as part of one big buffer. This removes the need for the extra unflattening step, as shown in Figure 2. This optimization results in ~4% end-to-end performance savings for BERT at max-scale configurations, where the cost of optimizer and parameter copies is most pronounced.
Figure 2.How the parameter tensors are represented in memory between several different steps of an iteration before and after the optimization.
DLRM
HugeCTR, a recommendation system dedicated training framework, part of NVIDIA Merlin, continues to power NVIDIA DLRM submission.
Hybrid embedding index precomputing
In the previous MLPerf round, we implemented hybrid embedding to reduce the communication between GPUs.
Even though the hybrid embedding implemented in HugeCTR significantly reduces the communication traffic, it requires indices to be calculated to determine where to read and distribute the embedding vectors stored on each GPU. The index calculation only relies on the input data, which is prefetched onto the GPU a few iterations ahead. Therefore, in HugeCTR, index precomputing is leveraged as an optimization to hide the cost of computing the indices under the communication kernels of the previous iteration.
Sharing the same spirit as index precomputing in training iterations, the hybrid embedding indices for evaluation can be computed and cached when the evaluation is performed for the first time. They can be reused for the remaining evaluations, which completely removes the cost of computing the indices for subsequent evaluations.
Better overlapping between communication and computation
In DLRM, to facilitate model-parallel training, two all-to-all collectives are needed in the forward and backward phases, respectively. In addition, there is an all-reduce collective at the end of the training for the data-parallel part of the model. How to overlap computation with these communication collectives is key to achieve a high utilization of the GPU, and a high training throughput. A few optimizations have been made to enable a better overlap. Figure 3 shows a simplified timeline for one training iteration.
In the forward propagation phase, the bottom MLP is performed while the forward all-to-all kernel is waiting for the data to arrive. In the backward propagation phase, all-reduce and all-to-all are overlapped to increase the utilization of the network. The index precomputation is also scheduled to overlap with these two communication collectives to use the idle resources on the GPU, maximizing the training throughput.
Figure 3. A simplified timeline view of a training iteration, illustrating fine-grained overlap of compute and communication operations.
Asynchronous weight gradients computing
The data gradient computation and weight gradient computation of an MLP are two independent branches of computations that share the same input. Unlike the data gradients, weight gradients are not needed until the gradient all-reduce. Thanks to the flexibility of the GPU in scheduling kernels, these two computations are performed in parallel in HugeCTR, maximizing the utilization of the GPU.
Better fusions
Kernel fusion is an effective way to reduce trips to memory, improving the GPU utilization. Many fusion patterns have been leveraged in DLRM to achieve better performance previously. For example, the data gradient calculation, ReLU backward operation and bias gradient calculation can be fused together in HugeCTR through cuBLAS. Such a cross-layer fusion pattern leaves the last bias gradient computation unfused.
In this round, the GEMM and bias gradient fusion supported in cuBLAS is leveraged to fuse the bias gradient computation into the weight gradient computation for the last layer of an MLP.
Another fusion example is weight conversion fusion. To support mixed-precision training, the FP32 master weights must be casted into FP16 weights during training. As an optimization, this precision casting is fused with the SGD optimizer in HugeCTR. Whenever an FP32 master weight is updated, it writes out an FP16 version of the updated weight into the memory, eliminating the need for a separate kernel for doing the conversion.
Mask R-CNN
NEEDS LEAD-IN SENTENCE
Using NHWC layout for all the convolutional layers
The ResNet-50 backbone has used the NHWC layout for a long time, but the rest of the model used NCHW up until now.
This round we were able to switch the FPN module (which immediately follows the ResNet-50 backbone) to NHWC. Running FPN in NHWC means we can transpose the outputs instead of the inputs, which is more efficient because the inputs are much larger than the outputs. This change boosted the performance by 4-5% for the max-scale configurations.
Using dedicated nodes exclusively for evaluation for multinode scenario
Although evaluation overlaps with training, Mask R-CNN evaluation is a resource-intensive process. Inevitably, training performance suffers slightly when evaluation is running simultaneously. For max-scale configurations, evaluation takes almost as long as training. Having evaluation constantly running in the background significantly affects the training performance.
One way to overcome this issue is to use a separate set of nodes for evaluation, that is, one set of nodes does training and a smaller set of nodes does evaluation. Implementing this change for max-scale configurations boosted the end-to-end performance by 12%.
Using multithreaded COCO evaluation
The COCO evaluation function consumes most of the evaluation time and is run separately on the bounding box and segmentation mask results. A couple of rounds ago we overlapped these two evaluation calls by running them in multiple processes.
This round, we enabled multi-threaded processing with openmp for the COCO evaluation loop. This is an optional feature in the NVIDIA version of the COCO API software. The evaluation loop can be parallelized by providing an optional argument that specifies the desired number of threads. This optimization improves evaluation speed by about 10%, but only the last evaluation is exposed, so the effect on end-to-end time is much smaller, about 0.5%.
Two-stage top-K computation with quadrupled occupancy for small batch-size runs
We make a couple of top-K calls in Mask R-CNN that take a long time due to low occupancy. The number of cooperative thread arrays, or CTAs (thread blocks) launched by the top-K kernel is proportional to the per-GPU batch size. Max-scale configurations use a per-GPU batch size of 1, which results in only five CTAs being launched. Each CTA is assigned one SM, while the A100 has more than 100 SMs, suggesting low utilization of the GPU.
To alleviate this, we implemented a two-stage approach:
In the first stage, we split the input into four equal parts and then we do top-K on each part with a single call.
In the second stage, we concatenate the four temporary results and take top-K of that.
This yields the same result as before but runs more than 3x faster because we are now launching 20 CTAs instead of 5 in the first stage. Dividing the input further makes the first stage faster, but also makes the second stage slower.
Splitting the input eight ways instead of four means that 40 CTAs are launched instead of 20 in the first stage. The first stage completes in half the time but, unfortunately, the second stage becomes so much slower that overall performance is better with a four-way split. Implementing a four-way split for the max-scale configuration resulted in a 3-4% performance boost.
Overlapping loss computations for mask head, bounding-box head, and RPN-head
Most of the GPU kernels launched by Mask R-CNN suffer from low occupancy when the batch size is small. One way to alleviate this is to overlap execution of as many kernels as possible to take advantage of the GPU resources that would otherwise go idle.
Some of the loss calculations can be done simultaneously. This is true for mask-head loss, bounding-box loss, and RPN-head loss, so we place each of these three loss calculations on different CUDA streams so that they can be executed simultaneously. This boosted the performance by about 5% for max-scale configurations.
3D-UNet
Vectorized concatenate and split operation
3D-UNet uses concatenate operations to concatenate the decoder and encoder activations. This results in device-to-device copies for activation tensors in the forward and backward passes. We optimized these copies by using vectorized loads and stores, doing a 4x wider read/write operation. This speeds up the concat and split operators by over 2.4x, giving an end-to-end speedup of 4.7% on single node configuration and 1.3% on max-scale configuration.
Efficient spatial-parallel convolutions
In MLPerf v1.0, we introduced spatial-parallel convolution, where we split the input activation across multiple GPUs (8 to be precise). The implementation of spatial-parallel convolution enabled us to hide the halo exchanges behind convolutions behind the convolution.
In MLPerf v1.1, we optimized the scheduling of communication and convolution operations so that we get a much better overlap between the launched communication and convolution kernels. While this makes sure that the halo exchanges are not exposed, it also helps reduce the jitter significantly. This optimized scheduling improved scores in max-scale configuration by over 25%.
Figure 4. Spatial-parallel convolution: Halo Transfer completely hidden behind convolutions
Spatial-parallel loss calculations
3D-Unet uses the DICE loss and Softmax Cross Entropy loss as its loss functions. The DICE loss is defined as the following formula:
In this formula, and represent pairs of corresponding pixel values of prediction and ground truth, respectively.
In the max-scale configuration, because a single GPU works on just a slice of the image, each GPU holds only a slice of and . To optimize the loss calculation, we calculated the partial terms in each GPU independently and exchanged the partial terms among all the GPUs in the group through NVLink. These partial terms were then combined to form the DICE loss results. This sped up the loss calculation by more than 4x, improving the max-scale score by 7%.
Better configurations
We increased the global batch size to be a factor of dataset size. The DALI data loader library enables us to use the same shard to train for different epochs. This enables us to reduce the time it takes to cache the dataset in the GPU significantly.
Because each GPU loads much fewer images, the Bounding Box cache in DALI warms up much more quickly too. This optimization reduced the startup time significantly and resulted in 20% speedup over MLPerf v1.0.
Data-parallel async evaluation
As training gets faster, scaling evaluation to hide behind training becomes challenging. In MLPerf v1.1, the inferences on a single image were sharded across the GPUs to improve the evaluation scaling. The results of inferences were then all-gathered to form the final output. This enables the entire evaluation phase to be hidden behind training iterations.
Faster group instance norm
The multi-GPU Instancenorm kernel was improved significantly by parallelizing the inter-GPU communication of multiple channel-blocks and reducing the DRAM time of the kernel by vectorizing memory reads and writes. This resulted in throughput improvement of over 5% in the max-scale configuration.
ResNet-50
End-to-end CUDA graphs
For ResNet-50, as the benchmark scales out to >256 nodes, the per-GPU batch size reduces to a very small value, where the iteration time is only ~8-10ms. At these extremely small iteration times, it is critical to ensure there are no gaps in GPU execution arising from dependencies running on the CPU.
For MLPerfV1.1, we reduced jitter at scale by using end-to-end CUDA Graphs to capture an entire iteration across the forward pass, backward pass, optimizer, and Horovord/NCCL gradient all-reduce as a single graph. The use of CUDA Graphs provides a 6% performance benefit at max-scale training.
GBN
As the scale increases for ResNet50 and the local batch sizes decrease, to achieve the fastest possible convergence, we use the GBN technique. For every BatchNorm layer, the mean and variance is all-reduced among a group of GPUs.
For MLPerf v1.1, the performance of GBN within a single DGX node was significantly improved by parallelizing the inter GPU communication of multiple channel-blocks and reducing the DRAM time of the kernel by vectorizing memory reads and writes. This provided a 10% performance benefit at scale.
SSD
On-GPU image caching
Image networks make heavy use of image cropping and resizing to capture features that represent richer statistics of the dataset and to improve the generalization capability of the model.
In the previous MLPerf rounds, SSD used the NVIDIA Data Loading Library (DALI) image decoding features to decode only the cropped region of the JPG image. This feature avoids wasting time decoding the entire image, especially if the crop is small.
However, this means that this cropped image is only used one time as the image is not cached in memory. Future uses of the original image will likely have different cropped regions, meaning that the original image will be decoded each time it is used. This behavior leads to jitter across GPUs as the decoding cost varies greatly on the size of the region required by the crop. This is particularly pronounced for scale-out scenarios.
For this round, we took advantage of the 80-GB memory capacity available to each NVIDIA A100 80-GB GPU by using another DALI feature that decodes the entire image and caches it in memory. This enables future uses of the same image to avoid the decoding cost and instead pick the cropped region directly from memory. Doing this is cheaper than decoding the cropped region each time and has much less run-to-run and device-to-device variation in execution time.
Overall, this optimization resulted in 2% end-to-end performance improvement in our single node configuration and ~5% improvement in our efficient-scale configuration, which is in between single-node and max-scale in terms of scale.
GBN for SSD
SSD also took advantage of the GBN improvements implemented in ResNet-50, which gave a ~4% E2E improvement in our max-scale configuration.
RNN-T
More optimized apex transducer
The transducer module in apex has been further optimized to improve the training throughput. Two optimizations have been added to the transducer joint and transducer loss module, respectively.
Transducer joint, ReLU, and dropout are three consecutive memory-bound operations in RNN-T. As an optimization, ReLU and dropout have been fused with the transducer joint in the apex.contrib.transducer.TransducerJoint module in apex, effectively cutting the trips to the memory.
The backward propagation of the transducer loss is a memory-intensive operation. An optimization of vectorizing the loads and stores in the backward operation has been added to apex.contrib.transducer.TransducerLoss in apex, improving the memory bandwidth utilization of the kernel.
More data preprocessing on GPU
Preprocessing the next batch on the CPU while the GPU is busy with a forward and backward pass can hide data preprocessing time. This is ideal. However, when data preprocessing is compute-intensive, the preprocessing time on the CPU might get exposed.
DALI can help unload the CPU by computing parts of the preprocessing on GPU, leveraging the massively parallel processing nature of the GPU. In this submission, the silence trimming operation was moved to GPU, improving the training throughput.
Conclusion
Building upon the well-established and proven NVIDIA A100 GPU and NVIDIA DGX A100 platforms, optimization across the stacks continues to deliver performance improvements across the board for NVIDIA platform-based submissions in this round of MLPerf v1.1 training benchmark.
It is worth reiterating that the NVIDIA platform has been the only solution to make submissions across all workloads in the MLPerf benchmarking suite, demonstrating both industry-leading performance and versatility.
All software used for NVIDIA submissions is available from the MLPerf repository, to enable you to reproduce our benchmark results. We constantly add these cutting-edge MLPerf improvements into our deep learning frameworks containers available on NGC, our software hub for GPU-optimized applications.
Use unsupervised AI and time series modeling to create microtargeted models for every user and account; as humans and machine/account combinations running on your network.
Traditional approaches to finding and stopping threats have ceased to be appropriately effective. One reason is the scope of ways an attacker can enter a system and do damage have proliferated as the interconnections between apps and systems have proliferated.
Applying AI to the problem seems like a natural choice but this in some sense broadens the data problem. A typical user may interact with 100 or more apps while doing their job, and integrations between apps means that there may be tens of thousands of interconnections and permissions shared across those 100 apps. If you have 10,000 users, you’d need 10,000 models as a beginning.
The good news is that NVIDIA Morpheus addresses this problem. NVIDIA recently announced an update to Morpheus, an end-to-end tool to apply data science to cybersecurity problems.
While most apps and systems will create logs, the variety, volume, and velocity of these logs means that much of the response possible is “closing the barn door after the horse has left.” Identifying credential breaches and the damage done can take weeks if you’re lucky, months if you’re average.
With any number of users beyond “modest” or “very modest,” traditional rule-based systems to create warnings are insufficient. A person who knew how another person or system typically behaves could notice something fishy almost immediately when that user or system started doing something that was unusual.
Every account has a digital fingerprint: a typical set of things it does or doesn’t do in a specific sequence in time. This problem is no longer addressed by just strong passwords that reset periodically, a table of rules, and periodic drop-sized spot checks of logs from the ocean of log data.
The problem is understanding every user’s day-to-day, moment-by-moment work. This is a data science problem.
Model ensemble for multiple methods
10,000 models are daunting enough. But if we’re committed to approaching the cybersecurity problem like the serious data science problem it is, one model isn’t enough. The state of the art in the most critical data-science problems is ensembling multiple models.
A model ensemble is where models are combined in some way to give better predictions than a single model could. The “wisdom of the crowd” turns out to be just as true, with a crowd of machine learning methods all trying to predict the same thing.
In the case of identifying the digital fingerprint of a malevolent attack, Morpheus takes two different models and uses them to alert human analysts to possible serious danger. One method is only a few years old and the other is several hundred years old:
Because attacks seek to hide their behavior by mimicking a given account’s behavior, autoencoders test how typical a given user’s behavior is as a flat snapshot.
Because an attack is temporal, Fourier transforms are used to understand the typical behavior over time.
Method 1: Autoencoders
In the specific example that is enabled with Morpheus, an autoencoder is trained on AWS CloudTrail data. The CloudTrail logs are nested JSON objects that can be transformed into tabular data. The data fields can vary widely across time and users. This requires the flexibility that neural net methods provide and the preprocessing speed of RAPIDS, a part of the Morpheus platform. The particular neural net method that Morpheus deploys in this use case is an autoencoder.
At a high level, an autoencoder is a type of neural network that tries to extract noise from a given datum and reconstruct that datum in an approximated form without that noise while being as true as possible to the actual datum.
For example, think of a photograph with scratches over the surface. A good autoencoder reproduces the underlying picture without the scratches. A well-trained autoencoder, one that knows its domain well, has low “loss” or error as it reconstructs a given datum.
In this case, you take a given user’s typical behavior, take out the “noise” of slight variation, and reproduce that digital fingerprint. Each encoding event has a loss or error associated with it, like any statistics problem.
To deploy this solution, update the pretrained model that comes with Morpheus with a period of typical, attack-free data for each user/service and machine/service interaction. Move these models to the NVIDIA Triton Inference Server layer of Morpheus.
What may be surprising is that the actual auto-encoding is discarded and the loss number is preserved. A user-defined threshold is defined to flag accounts to be reviewed by a human. The default option is a classic Z-score: Is the loss four standard deviations higher than the average loss for this user?
Figure 2. The combinatorial explosion of the modern enterprise and its security requirements
Method 2: Fast Fourier Transforms
Figure 3. Two timepoints of anomalous activity captured by the Morpheus framework that likely would go undetected without time series analysis like FFT
Fast Fourier transforms (FFTs) distill the essential behavior of a wave under the data noise. Fourier analysis was developed in the late 1700s and has continued to be invaluable to the applied mathematical analysis of fields as diverse as finance, traffic engineering, economics, and in this case, cybersecurity.
A given time series is compostable into various components, showing regular seasonal, weekly, and hourly variations along with a trend. Decomposing a time series enables analysts to understand if something that goes up and down a lot over time is actually growing despite constant oscillations. They can also understand if the time series is of interest to the cybersecurity use case, and if there is something truly unusual going on beyond the normal ebb and flow of traffic.
Machine application activity tends to oscillate over time, and attacker activities can be difficult to detect among the periodic noise in data with just a volumetric alert. To find subtle anomalies inside periodic data, you transform the data from the time domain to the frequency domain using FFT. You then reconstruct the signal back to the time domain (with iFFT) but use only the top 90% of frequencies. A large difference between the original signal and the reconstructed signal indicates the times at which the machine’s activity is unusual and potentially compromised by malicious human activity.
Morpheus applies FFTs by learning what a normal period or periods of activity looks like for a given user/service and machine/service system interaction. After this, GPUs perform decomposition quickly and apply a rolling Z-score to the transformed data to flag periods that are anomalous. For reference, CuPy FFT decompositions are as much as 120x faster than comparable operations done through NumPy. For more information, see FFT Speedtest comparing Tensorflow, PyTorch, CuPy, PyFFTW and NumPy.
Putting it all together
Morpheus is a tool to aid human analysts. This means that it is at its most useful when it sends the right amount of data to a person.
Figure 4. NVIDIA Morpheus workflow
Returning to the ensembling discussion from earlier, Morpheus uses a voting ensemble. Data flagged by both models with the most urgency is sent to the human security team. This enables a force multiplier for cybersecurity red teams by directing their valuable time to threats as they unfold in real time, rather than weeks or months later.
The cybersecurity data problem is like refining ore from silt: you begin with a huge amount of nothing and as you sift, refine, and assay, you get to something actually worth looking at. While we wouldn’t suggest that system intrusions are gold, we know that the time of analysts is.
Effective defense requires intelligence tools to aid traceability and prioritization. The ensembling of sophisticated methods that Morpheus deploys does just that. This means reduced financial, reputational, and operational risk for enterprises that deploy Morpheus.
Try it out
Morpheus ships with the code, data, and models for you to be able to see how the use case works and to get a feel for how Morpheus would work for your enterprise. Using the earlier workflow, you observed a micro-F1 score of 1. In addition, across multiple experiments, you saw a near 0% rate of false attribution (machine compared to human).
Beyond the state-of-the-art data science and prebuilt models, Morpheus is designed to be a platform for cybersecurity data science. It seamlessly combines a suite of NVIDIA and Cyber Log Accelerator (CLX) technologies to make deployment easy and fast.
Keep in mind that these models, particularly the FFT model, cannot start totally cold and must be given some amount of data that is representative of a normal, attack-free stream of CloudTrail logs.
This is just the beginning of what Morpheus can do to stop the hacking specter haunting enterprises. It is easy to imagine that, in the near future, even more models are deployed simultaneously for even greater predictive accuracy. For access to the latest release of NVIDIA Morpheus, register for the expanded early access program.
Look who just set new speed records for training AI models fast: Dell Technologies, Inspur, Supermicro and — in its debut on the MLPerf benchmarks — Azure, all using NVIDIA AI. Our platform set records across all eight popular workloads in the MLPerf training 1.1 results announced today. NVIDIA A100 Tensor Core GPUs delivered the Read article >