DataBloom - Part 467

Misc

MLPerf v1.0 Training Benchmarks: Insights into a Record-Setting NVIDIA Performance

Post author By
Post date June 30, 2021
No Comments on MLPerf v1.0 Training Benchmarks: Insights into a Record-Setting NVIDIA Performance

MLPerf v1.0 showcases the continuous innovation that is happening in the AI domain. In the last two-and-a-half years since the first MLPerf training benchmark launched, NVIDIA performance has increased by nearly 7x. In this post, we describe some of the major optimizations that enabled such improvements.

MLPerf is an industry-wide AI consortium tasked with developing a suite of performance benchmarks that cover a range of leading AI workloads widely in use. The latest MLPerf v1.0 training round includes vision, language and recommender systems, and reinforcement learning tasks. It is continually evolving to reflect the state-of-the-art AI applications.

NVIDIA submitted MLPerf v1.0 training results for all eight benchmarks, as is our tradition. In fact, systems built upon the NVIDIA AI platform are the only commercially available systems to make submissions across the board.

Compared to our previous MLPerf v0.7 submissions, we improved up to 2.1x on a chip-to-chip basis and up to 3.4x at scale. We set 16 performance records with eight on a per-chip basis and eight at-scale training in the commercially available solutions category.

Benchmark	Max Scale Records (min) DGX SuperPod	Per Accelerator Records (min) A100
Recommendation (DLRM)	0.99	15.3
NLP (BERT)	0.32	169.2
Image classification (ResNet-50 v1.5)	0.4	219.0
Speech recognition – Recurrent (RNN-T)	2.75	309.6
Image Segmentation (3D U-Net)	3.00	229.1
Object detection lightweight (SSD)	0.48	66.5
Object detection heavyweight (Mask R-CNN)	3.95	400.2
Reinforcement Learning	15.53	2156.3

Table 1. NVIDIA MLPerf AI records. (*)

This is the second MLPerf training round featuring NVIDIA A100 GPUs. Our continual year-over-year improvement on the same hardware is a lively testament to the strength of the NVIDIA platform and commitment to continuous software improvement. As in previous MLPerf rounds, NVIDIA engineers developed a host of innovations to achieve these new levels of performance:

Extending CUDA Graphs across all benchmarks. Neural networks are traditionally launched as individual kernels from the CPU to execute on the GPU. In MLPerf v1.0, we launched the entire sequence of kernels as a graph on the GPU, minimizing communication with the CPU.
Using SHARP to double the effective interconnect bandwidth between nodes. SHARP offloads collective operations from the CPU and GPU to the network and eliminates the need for sending data multiple times between endpoints.

The CUDA graph and SHARP enhancements enabled us to increase our scale to a record number of 4096 GPUs used to solve a single AI network.

Spatial parallelism enabled us to split a single image across eight GPUs for massive image segmentation networks like 3D U-Net and to use more GPUs for higher throughput.
Among hardware improvements, new HBM2e GPU memory on the NVIDIA A100 GPU increased memory bandwidth by nearly 30% to 2 TBps.

This post provides insights into many of the optimizations used to deliver the outstanding scale and performance. Many of these improvements are available on NGC, which is the hub for NVIDIA GPU-optimized software. You can realize the benefits of these optimizations in your real-world applications, instead of just observing better benchmark scores from the sideline.

At-scale training

Large-scale training requires system hardware and software to be precisely tuned to work together and support the unique performance requirements that arise at scale. NVIDIA made major advances on both dimensions, which are now available for production use.

On the system side, the key building block of our at-scale training is the NVIDIA DGX SuperPOD. DGX SuperPOD is the culmination of years of expertise in HPC and AI data centers. It is based on the NVIDIA DGX A100 with the latest NVIDIA A100 Tensor Core GPU, third-generation NVIDIA NVLink, NVSwitch, and the NVIDIA ConnectX-6 VPI 200 Gbps HDR InfiniBand. These were combined to make Selene a top 5 supercomputer in the Top 500 supercomputer list, with the following components:

4480 NVIDIA A100 Tensor Core GPUs
560 NVIDIA DGX A100 systems
850 Mellanox 200G HDR InfiniBand switches

On the software side, the NGC container release v. 21.05 enhances and enables several capabilities:

Distributed optimizer support enhancement.
Improved communication efficiency with Mellanox HDR Infiniband and NCCL 2.9.9.
Added SHARP support. SHARP improves upon the performance of MPI and machine learning collective operations. SHARP support was added to NCCL to offload all-reduce collective operations into the network fabric, reducing the amount of data traversing between nodes.

Workloads

In this section, we dive into the optimizations for selected individual MLPerf workloads.

Recommendation (DLRM)

Recommendation is arguably the most pervasive AI workload in data centers today. The NVIDIA MLPerf DLRM submission was based on HugeCTR, a GPU-accelerated recommendation framework that is part of the NVIDIA Merlin open beta framework. The HugeCTR v3.1 beta release added the following optimizations:

Hybrid embedding

One of the major challenges in scaling DLRM to multiple nodes is the ~10x difference in per-GPU all-to-all bandwidth between NVLink and Infiniband. This makes the embedding exchange between nodes a significant bottleneck during training.

To counteract this, HugeCTR implemented hybrid embedding, a novel embedding design that deduplicates the categories in a batch before doing the embedding weight exchange in the forward pass. It also reduces the gradients locally before doing gradient exchange in the backward pass.

For efficient deduplication, the hybrid embedding maps the categories to frequent and infrequent embeddings based on the statistical access frequency of categories. The frequent embedding is implemented in a data-parallel fashion that takes away most of the replicated categories in a batch, reducing the embedding exchange traffic. Infrequent embedding follows the distributed model parallel-embedding paradigm. This enables DLRM to scale to multiple nodes with unprecedented efficiency.

HugeCTR hybrid embedding deduplicates the categories in a batch before doing the embedding weight exchange. — *Figure 1. Hybrid embedding in* *HugeCTR*

Optimized collectives

All-to-all and all-reduce collective latencies play a significant role in scaling efficiency. Multinode all-to-all throughput for small message sizes was limited by the Infiniband message rate. To mitigate this, HugeCTR implemented fused NVLink aggregation using hierarchical all-to-all for embedding exchange.

HugeCTR implemented fused NVLink aggregation using hierarchical all-to-all for embedding exchange. — *Figure 2. Collective optimizations in HugeCTR*

You can optimize internode all-to-all and all-reduce latencies further:

Directly using the native IB verbs API and SHARP to mitigate library overheads.
Graph-capturable, GPU-initiated communication to reduce launch overheads.
One-sided eager protocol instead of two-sided rendezvous protocol to reduce network hops.
Eliminating redundant message buffer copies using persistent communication buffers.
Reduced NIC-GPU synchronization latency using NIC atomics directly on GPU memory instead of indirection through CPU.

Intranode all-reduce is also optimized using a single-shot reduction algorithm as opposed to ring.

Frequent embedding all-reduce and MLP all-reduce are fused into a single all-reduce operation to save on exposed all-reduce latency.

Optimized data reader

Input pipeline plays a significant role in training performance. To achieve peak I/O throughput, HugeCTR implemented a fully asynchronous data reader using the Linux asynchronous I/O library (AIO). Because hybrid embedding requires the whole batch to be present on all the GPUs, direct host-to-device (H2D) for each GPU would make PCIe a bottleneck. So, the data is copied onto the GPUs using a hierarchical approach, by first doing a H2D over PCIe and then a broadcast over NVLink.

The HugeCTR data reader is fully asynchronous using Linux asynchronous I/O. — *Figure 3. HugeCTR multithread data reader*

Moreover, H2D traffic from data readers may interfere with internode all-to-all and all-reduce traffic over PCIe. So, HugeCTR implements intelligent data-reader scheduling to avoid such interference.

Overlapping MLP with embedding

Because the bottom MLP has no data dependencies with embedding, several components of the bottom MLP could be overlapped with embedding for efficient utilization of GPU resources.

Bottom MLP forward is overlapped with the embedding forward pass
Frequent embedding local weight update is overlapped with all-reduce
MLP weight update is overlapped with internode all-to-all

cuBLASLt GEMM fusions

HugeCTR implemented a fused, fully connected layer that made use of cublasLt GEMM fusions:

GEMM + Relu + bias fusion for MLP fprop
GEMM + dRelu + dBias fusion for MLP bprop

cuBLASLt GEMM fusions speed up forward and backward passes. — *Figure 4. Fused FC layer in HugeCTR using cuBLASLt GEMM fusions*

Whole-iteration CUDA graph

To reduce launch latencies and prevent PCIe interference between kernel launches, data-reader, and communication traffic, all DLRM compute and communication kernels are designed to be stream-capturable. The whole training iteration is captured into a single CUDA graph.

With the preceding optimizations, we scaled to multiple nodes and completed the DLRM training task in just under a minute on 14 DGX-A100 nodes. This is a 3.3x speedup compared to the previous v0.7 submission.

NLP (BERT)

BERT is arguably one of the most important workloads in the NLP domain today. In the MLPerf v1.0 round, we improved upon our v0.7 submission with the following optimizations:

Fused multihead attention

The size of the activation tensors inside the multihead attention block grows with the square of the sequence length. This results in increased memory footprint, as well as longer runtimes due to the accompanying memory access operations. We fused softmax, masking, and dropout operations into a single kernel both in forward pass and backward pass. By doing so, we avoided several memory access operations for large activation tensors inside the multihead attention block, which resulted in a meaningful performance boost.

For more information, see SelfMultiheadAttn in the NVIDIA Apex library.

Distributed LAMB

In this MLPerf round, we implemented distributed LAMB. In distributed LAMB, the gradients are first split across eight GPUs within each DGX-A100 node. This is followed by an all-reduce between the nodes in eight separate groups. After this operation, each GPU has one of eight chunks that constitute the all-reduced gradient tensor, and the LAMB optimizer is run on 1/8^th of the full gradient tensor.

When necessary, gradient norms are computed by computing local norms and performing an all-reduce operation. After the optimizer, an intranode all-gather operation is performed at each node, so that each GPU has the full updated parameter tensor. Execution is continued with the forward pass of the next iteration.

Distributed LAMB substantially improves performance both for single-node and multinode configurations. For more information, see DistributedFusedLAMB in the Apex library.

Synchronization-free training

There are cases where the GPU execution depends on some value that is stored or calculated on the CPU. An example is when a specific tensor has a varying size that depends on the computation for each iteration. Because tensor size information is kept on the CPU, there must be a synchronization between GPU and CPU to pass the tensor size information for proper buffer allocation.

Our solution was using a tensor with fixed size, but indicating which elements are valid using a separate Boolean mask. With this approach, no CPU-GPU synchronization was needed, as the tensor sizes are known. When a subsequent computation must know the real size of the tensor, as for an averaging operation, the elements of the Boolean mask can be summed on the GPU.

Even though this approach resulted in slightly more access to GPU memory, it is much faster than having CPU synchronization in the critical path. This optimization resulted in a significant performance boost for small local batch size, which is the case for our max-scale configuration. This is because CPU synchronizations can’t keep up with fast GPU execution for small batch sizes.

By using a fixed size tensor and a Boolean mask as described in the text, we are able to eliminate CPU synchronizations needed for dynamic-sized tensors. — *Figure 5. Synchronization-free training eliminates CPU synchronization*

Another source of CPU-GPU synchronization is the data that is kept on CPU, such as learning rate or potentially other optimizer states. We kept all the optimizer states on the GPU for distributed LAMB to achieve synchronization-free execution.

As a result of these optimizations, we eliminated all the synchronizations between CPU and GPU during a training cycle. The only synchronizations are the ones that happen at the evaluation points, to log the evaluation accuracy in a file in real time for every evaluation point.

CUDA Graphs in PyTorch

Traditionally, CPU launches each GPU kernel individually. In general, even though GPU kernels do more work for large batch sizes, CPU kernel launch work and related CPU overheads stay fixed, barring the variations in CPU scheduling. As a result, for small local batch sizes, CPU overhead can become a significant performance bottleneck. This is what happened in our max-scale BERT configuration in MLPerf.

On top of that, when CPU execution becomes a bottleneck, variations in CPU execution result in different runtimes across all GPUs for each iteration. This introduces a significant synchronization overhead when the workload is scaled to many GPUs (4096 GPUs in this case). Each GPU synchronizes every iteration for gradient reductions, and iteration time is determined by the slowest worker.

CUDA Graphs is a feature that enables launching an entire sequence of kernels at one time, eliminating CPU overheads between kernel executions. CUDA Graphs recently became available in PyTorch. By graph capturing the model, we eliminated CPU overhead and the accompanying synchronization overhead. The CUDA Graphs implementation resulted in a 1.7x performance boost just by itself for our max-scale BERT configuration.

One-shot all-reduce with SHARP

SHARP improved the performance of collectives significantly for BERT, especially for our max-scale configuration. End-to-end performance boost from SHARP is 17% for this BERT configuration.

Image classification (ResNet-50 v1.5)

ResNet-50 is the veteran among MLPerf workloads. In this edition of MLPerf, we continue to optimize ResNet with the following optimizations:

DALI optimizations

At large scales (>128 nodes) for ResNet-50, we reduced the local batch size per GPU to extremely small values. This often results in sub-20-ms iteration time. To reduce the overhead of the data pipeline, we introduced the input batch multiplier (IBM). DALI throughput is higher at large batch sizes than smaller batch sizes. To take advantage of this fact, we created super batches that are much larger than the local batch size. For each iteration, we then derived the needed samples from these super batches, increasing the DALI processing throughput and reducing the data pipeline overhead.

At these small iteration times, gapless and continuous execution is the key to perfect scaling. Pre-allocating DALI buffers through hints is another feature that we introduced to reduce the overhead of dynamic GPU memory allocation while exploring the dataset.

MXNet fused BN+ReLu and BN+Add+ReLu performance optimizations

For ResNet-50, batch norm (BN) is a significant portion of the network’s iteration time. We optimized the fused BN+ReLu and BN+Add+ReLu kernels in MXNet through vectorization, cache-friendly memory traversals, and reducing quantization.

Improved MXNet dependency engine improvements

The new MXNet dependency engine provides an asynchronous approach to scheduling work on the GPU, reducing the host (CPU) overhead and jitter such as overhead arising from MXNet and Horovord handshake.

In the new dependency engine, the operation updates the dependency as soon as the work is scheduled on the GPU, not when the work is finished. It is the subsequent operation that must perform the synchronization to ensure correctness. This is further enhanced by removing the need for synchronization and using cudaStreamWait events to manage dependencies.

Old MXNet dependency engine with significant CPU overhead and jitter such as overhead arising from MXNet and Horovord handshake. — *Figure 6. ResNet-50 and the old MXNet dependency engine*

cudaStreamWait Events manage dependencies and removes the need for synchronization. — *Figure 7. ResNet-50 and the new dependency engine,* cudaStreamWaitEvent

Image segmentation (3D U-Net)

U-Net3D is one of the two new workloads in this round of MLPerf training. We used the following optimizations:

Spatial parallelism

In 3D U-Net, because the sample number in the training dataset is relatively small, there is a fundamental limit to how much it can be scaled with naive data parallelism. To break that limit, we used spatial parallelism to split a single image across eight GPUs. At the end of the backward propagation, the gradients from each partition can be all-reduced as usual to get the resultant gradients, which can then be used to calculate the weight gradients.

The naive approach to implementing spatial parallel convolution is to transfer the halo information from the neighboring GPU before running the convolution. However, to increase efficiency, we implemented a different scheme, in which we transfer the halo from the neighboring GPU in parallel to running the main inner convolutions. The error term to this main convolution is calculated independently using the halo and added to get the result. By hiding the transfer costs, we saw much better scaling efficiency than with the naive approach.

spatial parallel convolution splits a single image across eight GPUs. — *Figure 8. Spatial parallel convolution with fprop*

For the backward propagation, similarly, the halos needed for the dgrad operation are transferred in parallel with the computation of the weight gradients and data gradients. The halos transferred for the data gradients are then reused for computing the correction terms for both weight and data gradients.

3D U-Net has a bottleneck region in the middle of the network with much smaller activation sizes, which are not suited for spatial parallelism. We used a hybrid approach where we used spatial parallelism only for the layers that benefit from it. We gathered the activations for the sample from all GPUs right before this bottleneck region and executed these layers serially on each GPU. We split the work among the GPUs again when cooperating became beneficial. This made sure that we made the best choice separately for each region of the network.

Hybrid approach: Spatial parallelism in 3D U-Net splits the base of the pyramid but serializes the tip of the pyramids. — *Figure 10. Application of spatial parallelism in 3D U-Net*

Asynchronous evaluation

Evaluation contributes a significant amount of time in the reference code. Because evaluation can be run concurrently with training, we assigned dedicated nodes for running just evaluation.

To hide the evaluation behind the training cycle entirely, we used spatial parallelism to speed up the validation step. In addition, as evaluation uses the same set of images, the images were loaded only one time and then cached in the GPU memory.

Because the evaluation doesn’t start until a third of the way through the training, the evaluation nodes have enough time to load, process, and cache the dataset, as well as initialize all required libraries.

At the end of the training cycle, training nodes use InfiniBand to transfer the model quickly to the evaluation nodes and continue running subsequent training iterations. The evaluation nodes run evaluation after the model parameters are transferred. At the end of the evaluation, the evaluation node communicates to the training nodes if the target accuracy is reached.

The number of evaluation nodes added are just enough to hide the entire evaluation cycle behind the training cycle.

Asynchronous evaluation runs concurrently with training on dedicated nodes.] — *Figure 11. Asynchronous evaluation schedule*

Data loader

We optimized the data loader in two ways: optimizing the augmentations and caching the dataset.

Augmentations: 3D U-Net requires heavy augmentation due to the small size of the dataset. One of the most expensive operations is something that we call “biased crop”. On contrary to a random crop, biased crop selects regions with a positive label with a given probability. This requires heavy computations of 3D-connected components on labels every time the expensive path is selected. To avoid calculating the connected components every time that the sample is loaded, the result is cached in the host and reused so that it is calculated only one time.

Data loading: As the training gets faster with the new features, the I/O starts to show up as the bottleneck. To alleviate this, we cached the entire image dataset in the GPU memory. This removes the PCIe and I/O from the critical data loader path. While the images are loaded from the large and high-bandwidth GPU memory, the labels are loaded from the CPU to perform augmentations.

Data caching on GPU memory removes the PCIe and I/O from the critical data loader path. — *Figure 12. Data caching on GPU memory*

Channels-Last layout support

Because the Channels-Last layout is more efficient for convolution kernels, native support for the Channels-Last format was added in MXNet. This avoids any additional transposes needed in the model to take advantage of highly efficient GPU kernels.

Apart from these optimizations, 3D U-Net benefited from the optimized BatchNorm + ReLu activation kernel. The BatchNorm kernel was run repeatedly with a BatchSize value of 1 to get the Instance-Norm functionality. The asynchronous dependency engine implemented in MXNet, CUDA Graphs, and SHARP also helped performance significantly.

With the array of optimizations made for 3D U-Net, we scaled to 100 DGX A100 nodes (800 GPUs), with training running on 80 nodes (640 GPUs) and evaluation running on 20 nodes (160 GPUs). The max-scale configuration of 100 nodes got over 9.7x speedup as compared to the single-node configuration.

Object detection (lightweight) (SSD)

This is the fourth time that the lightweight SSD has been featured in MLPerf. In this round, the evaluation schedule was changed to happen every fifth epoch, starting from the first. In previous rounds, the evaluation scheduled started from the 40th epoch. Even with the extra computational requirement, we sped up our submissions time by more than x1.6.

SSD consists of many smaller convolution layers. The benchmark was particularly affected by the improvements to the MXNet dependency engine, CUDA Graphs, and the enablement of SHARP, as discussed earlier.

More efficient configs

The training time of a deep learning model is a multivariable function. In its most basic form, the equation is as follows:

$Train; Time = Average; Iteration; Time; cdot; Number; of; Iterations$

The goal is to minimize $Train; Time$ where both it and $Average; Iteration; Time$ and $Number; of; Iterations$ are functions of the batch size.

$Average; Iteration$ is a monotonically non-decreasing function. Batch sizes are more computationally efficient, but they take more time per iteration.

On the other hand, $Number; of; Iterations$ , up to a certain batch size, is a monotonically nonincreasing function. Larger batch sizes require fewer iterations to converge because the model sees more images per iteration.

Batch size	Iterations per epoch (1)	Epochs to convergence (2)	Total iterations
1024	115	50	5750
2048	58	65	3770
3072	39	75	2925
4096	29	90	2610

Table 2. Number of iterations required for convergence using different batch sizes
(1) MS-COCO epoch size is 117266 images; (2) empirical value

Compared to the v0.7 submission where we used a batch size of 2048, the v1.0 batch size was 3072, which required 22% fewer iterations. Because the larger iteration was only 20% slower, the result was an 8% faster time to convergence.

In this example, going to a batch size of 4096 instead of 3072 would’ve resulted in a longer training time. The 11% fewer iterations didn’t make up for the extra 20% run time per iteration.

Optimized evaluation

Evaluation can be broken into three phases:

Inference: Using the trained model to make predictions against the validation dataset. Runs on the GPU.
Prep: [NEED DESCRIPTION]
Scoring: Evaluating the inference results against the ground truth. Runs asynchronously on the CPU.

The new evaluation in v1.0 adds eight validation cycles to the base submission. Worse, improvements to the epoch train time means that scoring needs to take less than 2 seconds or the training time of five epochs. Otherwise, it won’t be fully hidden and any training time improvements are pointless.

To improve inference time, we made sure that the inference graph was static. We improved the nonmaximum suppression implementation and moved the Boolean mask, used to filter negative detections, to outside the graph. Static graphs save memory reallocation time and make switching between training and inference contexts faster.

For scoring, we used nv-cocoapi, which is a C++ implementation of cocoapi and 60x times faster. For v1.0, we improved the nv-cocoapi performance by 2x with multithreaded results accumulation, faster indices sorting, and caching the ground truth data structures.

Object detection (heavyweight) (Mask R-CNN)

We optimized object detection with the following techniques:

CUDA Graphs in PyTorch

Deep learning frameworks use GPUs to accelerate computations, but a significant amount of code still runs on CPU cores. CPU codes process meta-data like tensor shapes to prepare arguments needed to launch GPU kernels. Processing metadata is a fixed cost while the cost of the computational work done by the GPUs is positively correlated with batch size. For large batch sizes, CPU overhead is a negligible percentage of total run time cost. At small batch sizes, CPU overhead can become larger than GPU run time. When that happens, GPUs go idle between kernel calls.

This issue can be identified on an Nsight Systems timeline plot. The plot below shows the “backbone” portion of Mask R-CNN with per-GPU batch size of 1 before graphing. The green portion shows CPU load while the blue portion shows GPU load. In this profile, you see that the CPU is maxed out at 100% load while GPU is idle most of the time. There is a lot of empty space between GPU kernels.

Nsight timeline plot of Mask R-CNN shows that the CPU is maxed out at 100% load while GPU is idle most of the time, and empty space between GPU kernels. — *Figure 13. Nsight timeline plot of Mask R-CNN*

CUDA graph is a tool that can automatically eliminate CPU overhead when tensor shapes are static. A complete graph of all kernel calls is captured during the first step. In subsequent steps, the entire graph is launched with a single operation, eliminating all the CPU overhead. PyTorch now has support for CUDA graph, we used this to speed up Mask R-CNN for MLPerf 1.0.

With CUDA graph, the entire graph is launched with a single op, eliminating all the CPU overhead.] — *Figure 14. CUDA graph optimization*

With graphing, we see that the GPU kernels are tightly packed and GPU utilization remains high. The graphed portion now runs in 6 ms instead of 31 ms, a speedup of 5x. We mostly just graphed the ResNet backbone, not the entire model. Even then, we saw >2x uplift for the entire benchmark just from graphing.

Removing synchronization points

There are many PyTorch modules that make the main process wait until the GPU has finished all previously launched kernels. This can be detrimental to performance, because it makes the CPU sit idle when it could be working on launching more kernels. The CPU can get ahead of the GPU in low overhead segments and start launching kernels from succeeding segments. As long as total CPU overhead is less than total GPU kernel time, the CPU never becomes the bottleneck, but this breaks when sync points are introduced. Also, model segments that have sync points cannot be graphed with CUDA graph, so removing syncs is important.

We did some of this work for MLPerf 1.0. For instance, torch.randperm was rewritten to use CUB instead of Thrust because the latter is a synchronous C++ template library. These improvements are available in the latest NGC container.

Removing all the syncs improved the uplift that we saw from CUDA Graphs from 1.6x to 2.5x.

Asynchronous evaluation

Our MLPerf 0.7 submission did asynchronous evaluation, but it wasn’t fast enough to keep up with training after optimizations. Evaluation took 18 seconds per epoch, and 4 seconds of that was fully exposed time. Without changes to the evaluation code, our at-scale submission would have clocked in about 100 seconds slower.

Of the three evaluation phases, inference and prep account for all the exposed time. To speed up inference, we cached the test images in GPU memory, as they never change. We moved the prep phase to a pool of background processes, as each sample in the test dataset can be processed independently. We scored segmentation masks and boxes simultaneously in two background processes. These optimizations reduced evaluation time to ~4 seconds per epoch.

Dataloader optimization

This component loads and augments images during training. In our MLPerf 0.7 submission, all data loading work was done by CPU cores. The old dataloader was not fast enough to keep up with training after optimizations. To remedy that, we developed a hybrid dataloader.

The hybrid dataloader decodes the images on CPU and then does image augmentation work on GPU using Torchvision. To hide the cost of dataloading completely, we moved the load next image call in the main training loop after the loss backward call. The CPUs are idle for several milliseconds after the loss backward call because of theCUDA Graph launch. This is more than enough time to decode the next image. After the GPUs finish back propagation, they sit idle while the optimizer does all-reduce on the gradients. During this idle time, the dataloader does image augmentation work.

Speech recognition (RNN-T)

Speech recognition with RNN-T is the other new workload in this round of MLPerf training. We used the following optimizations:

Apex transducer loss

RNN-T uses a special loss function that we call transducer loss function. The algorithm that computes the loss is iterative in nature. A naive implementation is often inefficient due to the irregular memory access pattern and the exposed long memory read latency.

To overcome this difficulty, we developed apex.contrib.transducer.TransducerLoss. It uses a diagonal-wave-front-like computing paradigm to exploit the parallelism in the algorithm. Shared memory and registers are used extensively to cache the data exchanged between iterations. The loss function also employs prefetch to hide the memory access latency.

Apex transducer joint

Another component that is often found in a transducer-type network is the transducer joint operation. To accelerate this operation, we developed apex.contrib.transducer.TransducerJoint. This Apex extension is not only faster than its native PyTorch counterpart, but also enables output packing, reducing the workload seen by following layers.

Figure 15 shows the packing operation by the Apex transducer joint. In the baseline joint operation, the paddings from the input sequences are carried over to the output, as the joint operation is oblivious to the input padding. In the Apex transducer joint operation, the paddings are removed at the output, reducing the size of the tensor fed to the following operations.

Apex transducer joint packing operation explained: the padding created in the baseline joint operation is removed, hence reducing the size of the tensor fed to the following operations. — *Figure 15. Apex transducer joint packing operation*

Sequence splitting

To reduce computations of LSTMs that are wasted on paddings, we split batch processing into two phases (Figure 16). In the first pass, all the samples in the minibatch up to certain time steps (enclosed by the black boxes) are evaluated. Half of the samples in the minibatch are completed in the first pass. The remaining time steps of the other half of the samples (enclosed by the red boxes) are evaluated in the second pass. The regions enclosed by blue boxes represent the savings from batch splitting.

The black dashed line in Figure 16 estimates the workload seen by the GPUs. Because the batch size is halved for the second pass, the workload seen by the GPU is roughly halved. In multi-GPU training, it is often the slowest GPU that limits the training throughput. The dashed line is obtained from the GPU with most workloads.

To mitigate this load imbalance, we employed a technique called presorting, where samples in a minibatch are sorted based on their sequence lengths. The longest and shortest sequences are placed on the same GPU to balance the workload. The intuition behind this is that GPUs with long sequences are likely to be the bottleneck. Therefore, short sequences should be placed on these GPUs as well to maximize the benefit of sequence splitting.

Batch splitting

RNN-T has an interesting network structure where the LSTMs deal with relatively small tensors, whereas the joint net takes much larger tensors. To enable LSTMs to run more efficiently with a large batch size while not exceeding the GPU memory capacity by having a huge tensor in the joint net, we employed a technique called batch splitting (Figure 17). We used a reasonably large batch size so that LSTMs achieved a decent GPU utilization. In contrast, joint net operates on a portion of the batch and loops through those subbatches one by one.

In Figure 17, a batch splitting factor of 2 is used. In this case, the batch sizes of the inputs to the LSTMs and the joint net are B and B/2, respectively. Because all the tensors generated by the joint net, except the gradients for the weights, are no longer needed after the backpropagation is completed, they can be released and create room for the next subbatch in the loop.

Batch splitting enables LSTMs to run more efficiently with a large batch size while not exceeding the GPU memory.] — *Figure 17. Batch splitting*

Batch evaluation with CUDA Graphs

Other than accelerating training, evaluation of RNN-T has also been scrutinized. The evaluation of RNN-T is iterative in nature and the evaluation of the predict network is performed step by step. Each sample in a batch might pick different code paths in the same time step, depending on the execution results. Because of these, a naive implementation leads to a low GPU utilization rate and a long evaluation time that is comparable to the training itself.

To overcome these difficulties, we performed two categories of optimizations in the evaluation. The first optimization performed evaluation in batch mode and take care of the different control flows in a batch with predicates. The second optimization graphed the main RNN-T evaluation loop, which consists of many short GPU kernels. We also used loop unrolling and overlapping CPU-GPU communication with GPU execution to amortize associated overheads. The optimized evaluation was more than 100x faster than the reference code for the single-node configuration, and more than 30x faster for the max-scale configuration.

Summary

MLPerf v1.0 showcases the continuous innovation happening in the AI domain. The NVIDIA AI platform delivers leadership performance with tight integration of hardware, data center technologies, and software to realize the full potential of AI.

In the last two-and-a-half years since the first MLPerf training benchmark launched, NVIDIA performance has increased by nearly 7x. The NVIDIA platform excels in both performance and usability, offering a single leadership platform from data center to edge to cloud.

All software used for NVIDIA submissions is available from the MLPerf repository, to enable you to reproduce our benchmark results. We constantly add these cutting-edge MLPerf improvements into our deep learning frameworks containers available on NGC, our software hub for GPU-optimized applications.

Misc

Taking it to the Street: NVIDIA DRIVE Ecosystem Brings AVs to Public Markets

Post author By
Post date June 30, 2021
No Comments on Taking it to the Street: NVIDIA DRIVE Ecosystem Brings AVs to Public Markets

Just like money, autonomous vehicles never sleep. And the companies developing them are working just as hard, rolling out transformative technology and growing into publicly traded entities. This ecosystem includes every aspect of the autonomous vehicle industry ― from sensors to software to mobility services ― all using the high-performance, energy-efficient NVIDIA DRIVE platform to Read article >

The post Taking it to the Street: NVIDIA DRIVE Ecosystem Brings AVs to Public Markets appeared first on The Official NVIDIA Blog.

Misc

Make Any Face Come to Life: NVIDIA’s Simon Yuen Talks Audio2Face

Post author By
Post date June 30, 2021
No Comments on Make Any Face Come to Life: NVIDIA’s Simon Yuen Talks Audio2Face

We all know about the applications for digital humans for films and video games, but at NVIDIA, Simon Yuen has discovered the vast need and potential for digital humans beyond the entertainment industry. Yuen spoke with NVIDIA AI Podcast host Noah Kravitz about how we’re getting to a point where the simulation of digital humans Read article >

The post Make Any Face Come to Life: NVIDIA’s Simon Yuen Talks Audio2Face appeared first on The Official NVIDIA Blog.

Misc

ConvLSTM3D

Post author By
Post date June 30, 2021
No Comments on ConvLSTM3D

I may be losing my mind, but the tf website lists a 3D convolutional LSTM for keras in tf-nightly, but I can’t seem to find it after installing or updating.

Did it get dropped at some point? I’m hoping I can add the LSTM before flattening the data.

submitted by /u/QuantumSorcerer
[visit reddit] [comments]

Misc

Can TensorFlow or other machine learning detect objects on your desktop?

Post author By
Post date June 30, 2021
No Comments on Can TensorFlow or other machine learning detect objects on your desktop?

For example, can you run an app that will look at your desktop and identify icons / mouse / live video / certain applications like it can do for real world objects in videos?

I don’t know how to Google this, I can’t find any real results for this question of using your live desktop in place of a video as you source.

Hope this is clear, thanks.

submitted by /u/Iamjohnmiller
[visit reddit] [comments]

Misc

Discover New CUDA 11.4 Features

Post author By
Post date June 29, 2021
No Comments on Discover New CUDA 11.4 Features

The new release consists of GPU-accelerated libraries, debugging and optimization tools, an updated C/C++ compiler, and a runtime library to build and deploy your application on major architectures.

NVIDIA announces our newest release of the CUDA development environment consisting of GPU-accelerated libraries, debugging and optimization tools, an updated C/C++ compiler, and a runtime library to build and deploy your application on major architectures including NVIDIA Ampere, x86, Arm server processors, and POWER. The latest release, CUDA 11.4, and its features are focused on enhancing the programming model, new language support, and performance of your CUDA applications.

Key features:

CUDA Programming model enhancements
- CUDA Graph launch performance
- Multi-process Service (MPS) features
- Asynchronous Programming model
Language support – CUDA
- C++ support enhancements
- Python support
Compiler enhancements
CUDA Driver enhancements
- GPUDirect RDMA package inclusion
- GPUDirect Storage package inclusion

CUDA 11.4 ships with R470 driver. The driver now includes GPUDirect RDMA, as well as GPUDirect Storage packages that streamline and enable you to leverage these technologies without the need for separate installation of additional packages. The driver also enables new MIG configurations for the recently launched NVIDIA A30 GPU to double the amount of memory per MIG slice. This results in optimal performance for various workloads on the A30 GPU, especially for AI Inference workloads.

Resources:

Learn More & Download Now

Misc

How to view MSE for individual predictions?

Post author By
Post date June 29, 2021
No Comments on How to view MSE for individual predictions?

I am using tensorflow for a regression problem and I would like to see the error for each prediction that the model makes during training. At the end of epoch it prints the MSE for that epoch but I would like to be able to print or view all the errors for each prediction. Is there a way to do that? I was thinking about adding some print statements or modifying the tensorflow source code in some way. Thanks!

submitted by /u/NameError-undefined
[visit reddit] [comments]

Misc

Hidden GEM: Canadian Weather Forecasts to Run on NVIDIA-Powered System

Post author By
Post date June 29, 2021
No Comments on Hidden GEM: Canadian Weather Forecasts to Run on NVIDIA-Powered System

The supercomputer behind Canada’s weather forecasts is getting an upgrade, adopting NVIDIA networking to support long-running, computationally intensive environmental models. Located in Quebec, the system runs a complex forecasting and data assimilation system known as GEM — the Global Environmental Multiscale model. The model processes information about temperature, air pressure and wind to produce both Read article >

The post Hidden GEM: Canadian Weather Forecasts to Run on NVIDIA-Powered System appeared first on The Official NVIDIA Blog.

Misc

Latest Nsight Compute 2021.2 Release Now Available for Download

Post author By
Post date June 29, 2021
No Comments on Latest Nsight Compute 2021.2 Release Now Available for Download

The new release helps identify more performance issues, and makes it easier to understand and fix them.

The new Nsight Compute 2021.2 release helps identify more performance issues, and makes it easier to understand and fix them.

Register Dependency Visualization

This latest release adds a new feature for register dependency visualization. It helps identify long dependency chains and inefficient register usage that can limit performance. The SASS view in the Source page has new columns that track all the potential writes for a register each time it is read. Columns show all dependencies for registers, predicates, uniform registers and uniform predicates.

Standalone Source Viewer

Developers have frequently requested this feature to allow the view of side-by-side assembly and correlated source code for CUDA kernels in the Source page without needing to collect a profile. Users can directly open .cubin files from disk in the GUI to see the code correlation. This feature helps users understand how their code is being translated into assembly by the compiler and can be used to identify compiler optimizations and inefficiencies.

Guided Analysis Improvements

Several other features have been added to improve the guided analysis experience within the GUI. These include highlighted focus metrics, report cross-links, increased rule visibility and documentation references. These all add to the built-in profile and optimization guided analysis that Nsight Compute provides to help users understand and fix performance bottlenecks.

OptiX 7 Resource Tracking

In addition to existing Optix API tracing, this release provides support for tracking OptiX objects in the Resources tool window. OptiX 7 users can now see the properties and lifetime for objects like OptixDeviceContext, OptixProgramGroup, OptixDenoiser and more. Understanding when objects are created, destroyed, and interacted with can reveal unexpected behaviours that may cause performance or correctness issues in an OptiX application.

Additional Improvements

There have been additional improvements to management of baseline reports, font settings, CLI filters, and a new Python interface for reading report data. There is also support for tracking the new memory alloc/free nodes in CUDA graphs. For full details, see the latest release notes.

Resources:

Learn More & Download Now
Documentation
Forums
GTC On-Demand Session: “CUDA is Evolving, and the Latest Developer Tools are Adapting to Keep Up”
GTC On-Demand Session: “Requests, Wavefronts, Sectors Metrics: Understanding and Optimizing Memory-Bound Kernels with Nsight Compute”
Demo Video: New Nsight Systems and Nsight Compute Highlights
Additional instructional videos and blog posts for more information.

Offsites

Quickly Training Game-Playing Agents with Machine Learning

Post author By
Post date June 29, 2021
No Comments on Quickly Training Game-Playing Agents with Machine Learning

Posted by Leopold Haller and Hernan Moraldo, Software Engineers, Google Research

In the last two decades, dramatic advances in compute and connectivity have allowed game developers to create works of ever-increasing scope and complexity. Simple linear levels have evolved into photorealistic open worlds, procedural algorithms have enabled games with unprecedented variety, and expanding internet access has transformed games into dynamic online services. Unfortunately, scope and complexity have grown more rapidly than the size of quality assurance teams or the capabilities of traditional automated testing. This poses a challenge to both product quality (such as delayed releases and post-launch patches) and developer quality of life.

Machine learning (ML) techniques offer a possible solution, as they have demonstrated the potential to profoundly impact game development flows – they can help designers balance their game and empower artists to produce high-quality assets in a fraction of the time traditionally required. Furthermore, they can be used to train challenging opponents that can compete at the highest levels of play. Yet some ML techniques can pose requirements that currently make them impractical for production game teams, including the design of game-specific network architectures, the development of expertise in implementing ML algorithms, or the generation of billions of frames of training data. Conversely, game developers operate in a setting that offers unique advantages to leverage ML techniques, such as direct access to the game source, an abundance of expert demonstrations, and the uniquely interactive nature of video games.

Today, we present a ML-based system that game developers can use to quickly and efficiently train game-testing agents, helping developers find serious bugs quickly while allowing human testers to focus on more complex and intricate problems. The resulting solution requires no ML expertise, works on many of the most popular game genres, and can train an ML policy, which generates game actions from game state, in less than an hour on a single game instance. We have also released an open source library that demonstrates a functional application of these techniques.

Supported genres include arcade, action/adventure, and racing games.

The Right Tool for the Right Job
The most elemental form of video game testing is to simply play the game. A lot. Many of the most serious bugs (such as crashes or falling out of the world) are easy to detect and fix; the challenge is finding them within the vast state space of a modern game. As such, we decided to focus on training a system that could “just play the game” at scale.

We found that the most effective way to do this was not to try to train a single, super effective agent that could play the entire game from end-to-end, but to provide developers with the ability to train an ensemble of game-testing agents, each of which could effectively accomplish tasks of a few minutes each, which game developers refer to as “gameplay loops”.

These core gameplay behaviors are often expensive to program through traditional means, but are much more efficient to train than a single end-to-end ML model. In practice, commercial games create longer loops by repeating and remixing core gameplay loops, which means that developers can test large stretches of gameplay by combining ML policies with a small amount of simple scripting.

Simulation-centric, Semantic API
One of the most fundamental challenges in applying ML to game development is bridging the chasm between the simulation-centric world of video games and the data-centric world of ML. Rather than ask developers to directly convert the game state into custom, low-level ML features (which would be too labor intensive) or attempting to learn from raw pixels (which would require too much data to train), our system provides developers with an idiomatic, game-developer friendly API that allows them to describe their game in terms of the essential state that a player observes and the semantic actions they can perform. All of this information is expressed via concepts that are familiar to game developers, such as entities, raycasts, 3D positions and rotations, buttons and joysticks.

As you can see in the example below, the API allows the specification of observations and actions in just a few lines of code.

Example actions and observations for a racing game.

From API to Neural NetworkThis high level, semantic API is not just easy to use but also allows the system to flexibly adapt to the specific game being developed – the specific combination of API building blocks employed by the game developer informs our choice of network architecture, since it provides information about the type of gaming scenario in which the system is deployed. Some examples of this include: handling action outputs differently depending on whether they represent a digital button or analog joystick, or using techniques from image processing to handle observations that result from an agent probing its environment with raycasts (similar to how autonomous vehicles probe their environment with LIDAR).

Our API is sufficiently general to allow modeling of many common control-schemes (the configuration of action outputs that control movement) in games, such as first-person games, third-person games with camera-relative controls, racing games, twin stick shooters, etc. Since 3D movement and aiming are often an integral aspect of gameplay in general, we create networks that automatically tend towards simple behaviors such as aiming, approach or avoidance in these games. The system accomplishes this by analyzing the game’s control scheme to create neural network layers that perform custom processing of observations and actions in that game. For example, positions and rotations of objects in the world are automatically translated into directions and distances from the point of view of the AI-controlled game entity. This transformation typically increases the speed of learning and helps the learned network generalize better.

An example neural network generated for a game with joystick controls and raycast inputs. Depending on the inputs (red) and the control scheme, the system generates custom pre- and post-processing layers (orange).

Learning From The Experts in Real Time
After generating a neural network architecture, the network needs to be trained to play the game using an appropriate choice of learning algorithm.

Reinforcement learning (RL), in which an ML policy is trained directly to maximize a reward, may seem like the obvious choice since they have been successfully used to train highly competent ML policies for games. However, RL algorithms tend to require more data than a single game instance can produce in a reasonable amount of time, and achieving good results in a new domain often requires hyperparameter tuning and strong ML domain knowledge.

Instead, we found that imitation learning (IL), which trains ML policies based by observing experts play the game, works well for our use case. Unlike RL, where the agent needs to discover a good policy on its own, IL only needs to recreate the behavior of a human expert. Since game developers and testers are experts in their own games, they can easily provide demonstrations of how to play the game.

We use an IL approach inspired by the DAgger algorithm, which allows us to take advantage of video games’ most compelling quality – interactivity. Thanks to the reductions in training time and data requirements enabled by our semantic API, training is effectively realtime, giving a developer the ability to fluidly switch between providing gameplay demonstrations and watching the system play. This results in a natural feedback loop, in which a developer iteratively provides corrections to a continuous stream of ML policies.

From the developer’s perspective, providing a demonstration or a correction to faulty behavior is as simple as picking up the controller and starting to play the game. Once they are done, they can put the controller down and watch the ML policy play. The result is a training experience that is real-time, interactive, highly experiential, and, very often, more than a little fun.

ML policy for an FPS game, trained with our system.

Conclusion
We present a system which combines a high-level semantic API with a DAgger-inspired interactive training flow that enables training of useful ML policies for video game testing in a wide variety of genres. We have released an open source library as a functional illustration of our system. No ML expertise is required and training of agents for test applications often takes less than an hour on a single developer machine. We hope that this work will help inspire the development of ML techniques that can be deployed in real-world game-development flows in ways that are accessible, effective, and fun to use.

Acknowledgements
We’d like to thank the core members of the project: Dexter Allen, Leopold Haller, Nathan Martz, Hernan Moraldo, Stewart Miles and Hina Sakazaki. Training algorithms are provided by TF Agents, and on-device inference by TF Lite. Special thanks to our research advisors, Olivier Bachem, Erik Frey, and Toby Pohlen, and to Eugene Brevdo, Jared Duke, Oscar Ramirez and Neal Wu who provided helpful guidance and support.