Categories
Misc

TIME Magazine Names NVIDIA Instant NeRF a Best Invention of 2022

TIME Magazine named NVIDIA Instant NeRF, a technology capable of transforming 2D images into 3D scenes, one of the Best Inventions of 2022.  “Before…

TIME Magazine named NVIDIA Instant NeRF, a technology capable of transforming 2D images into 3D scenes, one of the Best Inventions of 2022

“Before NVIDIA Instant NeRF, creating 3D scenes required specialized equipment, expertise, and lots of time and money. Now it just takes a few photos and a few minutes,” TIME writes in their release.

The 3D rendering tool was introduced at SIGGRAPH 2022, the world’s largest conference for computer graphics and interactive techniques. 

At SIGGRAPH, NVIDIA researchers Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller submitted their paper, Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. The innovative research quickly gained popularity, winning the SIGGRAPH 2022 Technical Papers Award

Accumulating tens of thousands of downloads and achieving over 9,700 stars on the Instant NeRF GitHub page, developers and 3D content creators are embracing the ability to create stunning 3D scenes with the tool. 

What is a NeRF?

Neural Radiance Fields (NeRF) are neural networks capable of generating 3D images or scenes from a set of 2D images. Using spatial location and volumetric rendering, the model uses the camera pose from the images to render the 3D space of the scene.

NeRFs are computationally intensive and historically required many hours for rendering. However, Instant NeRFs give users the power to render an image or scene quickly and accurately with a small number of images. You can generate a scene in seconds and the longer the model trains the more details of a scene are rendered. 

Video 1. Amalfi Coast created with Instant NeRF. Credit: Jonathan Stephens

Exploring NeRFs

To support developers adopting Instant NeRF, NVIDIA hosted an Instant NeRF Sweepstakes over the summer. The event encouraged contestants to explore their creative abilities while using InstantNeRF and the chance to win a GeForce RTX 3090. The sweepstakes reached over 2.7 million people on Twitter. 

Since the code release, a large community of creators ranging from AI researchers to photographers have demonstrated their Instant NGP skills by making their own NeRFs.

“NeRF is a way to freeze a moment in time that is more immersive than a photograph or a video. It’s a way to recreate a moment – the whole moment. NeRF is the natural extension of photogrammetry and the evolution of modern photography,” said Michael Rubloff of Franc Lucent, an early explorer in NeRF technology.

Video 2. NERF of Zeus. Credit: Hugues Bruyère

“Museums are great exploration fields for interactive and immersive experiences. Volumetric scenes like this one can also directly contribute to historical heritage conservation, with a modern and immersive twist,” said Hugues Bruyère, partner and chief of innovation at the Montreal-based creative studio Dpt.

The community has taken NeRFs to another level by expressing themselves and their art through this new form of photography. Some have even found ways to make NeRFs impactful in how they work.

Video 3. A NeRF scene of a woman meditating. Credit: Franc Lucent

Make your own NeRFs

This remarkable technology is available for anyone to try out! 

Check out this Getting Started with Instant NeRF post on how to set up the code and create your first NeRF. Or skip right to the code and try it out for yourself by visiting the NV Labs Instant NeRF GitHub

You can also see how various artists & professionals have used Instant NeRF in their projects

Video 4. Using NeRF to scan mirror surfaces. Credit: Karen X. Cheng

Featured image: The Purdue Engineering Fountain. Credit: Jonathan Stephens

Categories
Misc

Give the Gift of Gaming With GeForce NOW Gift Cards

The holiday season is approaching, and GeForce NOW has everyone covered. This GFN Thursday brings an easy way to give the gift of gaming with GeForce NOW gift cards, for yourself or for a gamer in your life. Plus, stream 10 new games from the cloud this week, including the first story downloadable content (DLC) Read article >

The post Give the Gift of Gaming With GeForce NOW Gift Cards appeared first on NVIDIA Blog.

Categories
Misc

Accelerate Enterprise Apps with Microsoft Azure Stack HCI and NVIDIA BlueField DPUs

As enterprises continue to shift workloads to the cloud, some applications need to remain on-premises to maximize latency performance and meet security, data…

As enterprises continue to shift workloads to the cloud, some applications need to remain on-premises to maximize latency performance and meet security, data sovereignty, and compliance policies. Microsoft Azure Stack HCI is a hyperconverged infrastructure (HCI) stack delivered as an Azure service. Providing built-in security and manageability, Azure Stack HCI is ideally positioned to run production workloads and cloud-native apps in core and edge data centers.  

NVIDIA BlueField data processing unit (DPU) is an accelerated data center infrastructure platform that unleashes application performance and system efficiency. BlueField DPUs help cloud-minded enterprises overcome performance and scalability bottlenecks in modern IT environments. This is achieved by offloading, accelerating, and isolating software-defined infrastructure workloads. 

Marking a major leap forward in performance and productivity, Microsoft has demonstrated a prototype of the Azure Stack HCI platform accelerated on NVIDIA BlueField-2 DPUs, delivering 12x CPU resource efficiency and 60% higher throughput. The DPU-accelerated HCI platform enables significant total cost of ownership (TCO) savings by requiring fewer servers and lower power and space to operate a given workload.  

Performance and efficiency gains 

Azure Stack HCI is a software-defined platform that runs the Azure networking stack for connecting virtual machines (VMs) and containers together. Software-defined networks deliver rich functionality and great flexibility and enable enterprises to easily scale from a single on-premises data center to hybrid and multi-cloud environments.  

Despite the many benefits of software-defined networks, those that run exclusively on CPUs are resource constrained. They’re known for stealing away expensive cores that would otherwise be used for running business applications, taking a toll on performance and scalability. In addition, software-defined network (SDN) technologies have had a longstanding conflict with hardware-accelerated networking (namely SR-IOV). This forces cloud architects to prioritize one over the other, often at the cost of poor application performance or higher TCO. 

NVIDIA BlueField DPUs are designed to deliver the best of both worlds: the functionality and agility of SDNs with the performance and efficiency of hardware-accelerated networking (SR-IOV). BlueField offloads the entire SDN workload from the host CPU, freeing up CPU cores for line-of-business applications and creating major TCO savings.  

Running in the BlueField Arm processor, the SDN pipeline is mapped to the BlueField-accelerated programmable pipeline to increase throughput and packet processing performance and reduce latency. By offloading and accelerating functions of the Azure Stack HCI SDN on NVIDIA BlueField-2 DPUs, the Microsoft team is able to achieve massive performance and efficiency gains (Figure 1). 
 

Diagrams showing network throughput and CPU core savings. x86 CPU delivered 60 Gbps throughput at 8 CPU cores compared to 96 Gbps at zero CPU utilization delivered with BlueField-2. 
Figure 1. Accelerating the Azure Stack HCI on NVIDIA BlueField-2 DPUs results in network throughput gains (left) and CPU core savings (right) 

The test results indicate that BlueField-2 delivers line-rate, software-defined networking at 96 Gb/s with practically zero CPU utilization compared to only 60 Gb/s that were utilized over 8 CPU cores. Those 8 cores are now freed up to run business workloads with 60% faster networking. 

To show an apples-to-apples comparison and because the x86 CPU was not able to reach  
~100 Gb/s in the test, the number of CPU cores that would have been required to support 96 Gb/s have been extrapolated. Figure 1 shows that 12 CPU cores would have been required to support 96 Gb/s, which means BlueField-2 saves 12 CPU cores to support the same throughput. 

This savings enables enterprises to design, deploy, and operate fewer servers to deliver the same business outcomes, or alternatively, achieve better outcomes based on the same number of servers. 

From a functional standpoint, the Microsoft Azure Stack HCI platform supports a wide range of SDN policies: VXLAN network encapsulation and decapsulation (encap/decap), access-control list (ACL), quality-of-service (QoS), IP-in-IP encapsulation, network address translation (NAT), IPsec encryption/decryption, and more. The traditional configuration of accelerated networking (SR-IOV) for a VM would require bypassing this rich set of policies for the benefit of achieving higher performance.  

The Azure Stack HCI solution prototype, now offloaded and accelerated on the BlueField DPU, delivers both advanced SDN policies without sacrificing performance and efficiency. See Evolving Networking with a DPU-Powered Edge for more details.

Summary 

While shifting workloads to the cloud, enterprises continue to prioritize on-premises and edge data centers to meet stringent performance and security requirements. Next-generation applications including machine learning and artificial intelligence increasingly rely on accelerated networking to realize their full potential. The Microsoft Azure Stack HCI on NVIDIA BlueField DPUs prototype delivers on the promise of hybrid cloud and accelerated networking for businesses at every scale. 

Learn more about NVIDIA BlueField DPUs

Categories
Misc

NVIDIA AI Turbocharges Industrial Research, Scientific Discovery in the Cloud on Rescale HPC-as-a-Service Platform

Just like many businesses, the world of industrial scientific computing has a data problem. Solving seemingly intractable challenges — from developing new energy sources and creating new modes of transportation, to addressing mission-critical issues such as driving operational efficiencies and improving customer support — requires massive amounts of high performance computing. Instead of having to Read article >

The post NVIDIA AI Turbocharges Industrial Research, Scientific Discovery in the Cloud on Rescale HPC-as-a-Service Platform appeared first on NVIDIA Blog.

Categories
Misc

What Is Denoising?

Anyone who’s taken a photo with a digital camera is likely familiar with a “noisy” image: discolored spots that make the photo lose clarity and sharpness. Many photographers have tips and tricks to reduce noise in images, including fixing the settings on the camera lens or taking photos in different lighting. But it isn’t just Read article >

The post What Is Denoising? appeared first on NVIDIA Blog.

Categories
Misc

NVIDIA Hopper, Ampere GPUs Sweep Benchmarks in AI Training

Two months after their debut sweeping MLPerf inference benchmarks, NVIDIA H100 Tensor Core GPUs set world records across enterprise AI workloads in the industry group’s latest tests of AI training. Together, the results show H100 is the best choice for users who demand utmost performance when creating and deploying advanced AI models. MLPerf is the Read article >

The post NVIDIA Hopper, Ampere GPUs Sweep Benchmarks in AI Training appeared first on NVIDIA Blog.

Categories
Misc

Tuning AI Infrastructure Performance with MLPerf HPC v2.0 Benchmarks

As AI becomes increasingly capable and pervasive in high performance computing (HPC), MLPerf benchmarks have emerged as an invaluable tool. Developed by…

As AI becomes increasingly capable and pervasive in high performance computing (HPC), MLPerf benchmarks have emerged as an invaluable tool. Developed by MLCommons, MLPerf benchmarks enable organizations to evaluate the performance of AI infrastructure across a set of important workloads traditionally performed on supercomputers. 

Peer-reviewed industry-standard benchmarks are a critical tool for evaluating HPC platforms, and NVIDIA believes access to reliable performance data will help guide HPC architects of the future in their design decisions. 

MLPerf HPC benchmarks measure training time and throughput for three types of high-performance simulations that have adopted machine learning techniques. 

Two figures, one for strong scaling the other for weak scaling. Strong scaling figure shows participation from five different submitters, NVIDIA Selene being fastest for all three benchmarks with Jülich being a close 2nd. Weak scaling figure shows three participants where NVIDIA Selene ranging from about 2x faster for Opencatalyst compared to Jülich to more than 3x faster on Deepcam.
Figure 1. MLPerf HPC v2.0 all submitted results

This post walks through the steps the NVIDIA MLPerf team took to optimize each benchmark and measurement to extract the maximum performance. We will focus on the optimizations in MLPerf HPC v2.0 in addition to those in MLPerf HPC v1.0. 

CosmoFlow 

Each instance of the CosmoFlow training application benchmark loads ~8 TB of training data and ~1 TB of validation data. These consist of 512K training samples and 64K validation samples. Each sample has a 16 MB data file and a tiny 144-character label file. In total, there are over 1 million small files that need to be loaded into node-local Nonvolatile Memory Express (NVMe) before training can begin.

In MLPerf HPC v1.0, this resulted in data staging taking a significant amount of time both for strong-scale and weak-scale cases. For the weak-scale case, having many instances—each loading over 1 million files from the file system—puts additional strain on the shared disk system. 

For the number of instances used for weak-scale submission, this causes a non-linear degradation of staging performance with respect to number of instances. These problems were addressed in several ways and are outlined below.

Data staging on NVMe

For a single strong training instance analysis of the NVIDIA MLPerf HPC v1.0 submission showed that only a small fraction of the maximum theoretical read bandwidth of the Selene Lustre file system was used. This was also the case for storage network interface cards (NICs) on the nodes when staging the input dataset. 

During the staging phase of training, the CPU resources allocated are dedicated entirely to sourcing data from the shared file system to node-local NVMe storage. Increasing the threads dedicated to staging, and staging the training and validation data in parallel, reduced the staging time by ~75%. This equates to a ~4x speedup in staging and resulted in a 40% end-to-end reduction in total time for strong-scale. 

Data compression

Loading many tiny files is inherently inefficient. In the case of CosmoFlow, there are more than 1 million files, each with 144 bytes. To further improve staging performance, the associated data and label were combined into one compressed file offline ahead of time. 

In parallel with the data being staged from disk, the files are uncompressed locally onto the compute node disk. This reduces the number of files to be read from disk by 50% and the total data transferred from disk by ~85%, at the end giving an additional 13% staging speedup for strong-scale scenarios. This results in a 7% end-to-end performance improvement in overall training time for strong-scale submission. 

This approach achieved over 900 GB/s read bandwidth for data staging of a strong-scale scenario. 

Increasing effective bandwidth when running multiple instances 

For additional algorithmic details, refer to the DeepCam explanation from the 2021 MLPerf HPC submission, MLPerf HPC v1.0: Deep Dive into Optimizations Leading to Record-Setting NVIDIA Performance.

When running multiple instances at the same time, for weak scaling, every instance must stage a copy of the training and validation data on its local nodes. This year, the NVIDIA submission implemented the distributed staging mechanism for CosmoFlow. 

All of the nodes, regardless of which instance they are associated with, load a fraction of total data (1/N where N is the total number of nodes, which is 512 in this case) from the shared file system. Given the optimizations already discussed, this takes only a few seconds. 

Then, every node uses MPI_Allgather to distribute the data loaded from remote storage to the other nodes that need the data. This distribution takes place over the higher bandwidth InfiniBand Fabric. In other words, a large portion of data transfer that was previously happening over the storage network is offloaded to InfiniBand Fabric with this optimization. As a result of distributing staging, staging time scales linearly with the number of instances (at least up to 128 instances) for weak-scale scenarios. 

For the v1.0 submission, 32 instances were run, each staging ~9 TB of data. This took 10.76 minutes for an effective bandwidth of ~460 GB/s.

For this year’s submission, 128 instances were run, each staging ~9 TB of data, for which the total staging time takes 6.7 minutes. This means staging input data for 4x the number of instances in 1.6x less time, resulting in an effective bandwidth of ~2,900 GB/s, a 6.5x increase in effective bandwidth. Effective bandwidth assumes the amount of total data staged from the file system is the same as that of a non-distributed algorithm for a given number of instances.

Smaller instance sizes for weak-scale training

All the staging improvements enabled the size of the individual instances to be reduced for weak scaling (hence a larger number of parallel instances), which would not have been possible with the storage access bottlenecks that existed before the optimizations were implemented. In v1.0, 32 instances, each with 128 GPUs, caused the staging time to scale non-linearly. Increasing the number of instances caused a superlinear increase in staging time. 

Without the improvements to efficiently stage for many instances, the staging time would have continued to grow superlinearly with the number of instances, resulting in more time being spent for data staging than the actual training. 

With the optimizations described above, the number of instances were increased from 32 to 128 for the weak-scale submission, each instance using four nodes instead of 16 nodes as done in MLPerf HPC v1.0. In v2.0, staging was completed in less time, while increasing the number of models running simultaneously by 4x for weak-scale submission.

CUDA graphs and graph capture

CUDA graphs allow a single graph that consists of a sequence of kernels to be launched, instead of individually launching each of the kernels from CPU to GPU. This feature minimizes CPU involvement in each iteration, substantially improving performance by minimizing latencies—especially for strong scaling scenarios. 

CUDA graphs support was recently added to PyTorch. See Accelerating PyTorch with CUDA Graphs for more details. CUDA graphs support in PyTorch resulted in around a 15% end-to-end performance gain in CosmoFlow for the strong scaling scenario, which is most sensitive to latency and jitter. 

OpenCatalyst

Load balancing across GPUs

Data parallelism splits the global batch equally between each GPU. However, data-parallel task partitioning, by default, does not consider load imbalance within the batch. Load imbalance exists in Open Catalyst between the samples in a batch since the number of atoms of different molecules, the number of edges, and triplets in the graph obtained from molecules vary substantially (Figure 2).

This imbalance results in a large synchronization overhead in the multi-GPU setting. For the strong-scaling scenario, this results in 32% of the computation time being wasted. Lawrence Berkeley National Laboratory (LBNL) introduced an algorithm to balance the load across GPUs in MLPerf HPC v1.0, and this was adopted in the NVIDIA submission this round. 

This algorithm first preprocesses the training data to obtain the number of edges for each sample. In the sampling stage, each GPU is given the indices of the local samples and performs a global ALLgather to get the indices of global samples. 

Then the global samples are sorted by the number of edges and distributed across workers, so that each GPU processes as close to an equal number of edges as possible. This algorithm balances the workload well but introduces a large communication overhead especially as the application scales to more GPUs. This is the same algorithm used in the Open Catalyst submission from LBNL in v1.0.

NVIDIA also improved the sampling function in v2.0. The load balancing sampler avoids global (inter-GPU) communication by fetching the indices of all the samples in the global batch to all workers at the beginning. As before, samples are sorted by the number of edges, and partitioned into different buckets such that each bucket has the same approximate number of edges. Finally, each worker gets its bucket containing the indices of the samples that correspond to its global rank.

Diagram of load imbalance in Open Catalyst, showing the number of atoms, number of edges, and ratio of angles to edges all varying significantly across iterations.
Figure 2. Load imbalance in Open Catalyst, showing the number of atoms, number of edges, and ratio of angles to edges all varying significantly across iterations

Kernel fusion using nvFuser and cuGraph-ops

There are more than 10K kernels in the original OpenCatalyst model as downloaded from MLCommons GitHub. The deep learning compiler for PyTorch, nvFuser, is a common optimization methodology that uses just-in-time (JIT) compilation to fuse multiple operations into a single kernel. The approach decreases both the number of kernels and global memory transactions. 

To achieve this, NVIDIA modified the model script to enable JIT in PyTorch. Optimized fused kernels were also implemented in cuGraph-ops that were exposed through the RAPIDS framework. With the help of nvFuser and cuGraph-ops, the total number of kernels can be reduced by more than 90%. 

Fusing small GEMMs to improve GPU utilization

In the original computation graph, there are many small general matrix multiplications (GEMMs) which are executed sequentially and cannot saturate the GPU. These small GEMM operations can be fused to reduce the number of kernels and improve GPU utilization. Three kinds of GEMM fusions were applied–packing, batching, and horizontal fusion–as explained below. The only change was made to the model script to implement these fusions.

Packing – Several linear layers share the same input. One large GEMM was used to replace a set of several small GEMMs. 

Diagram showing linear layers sharing input.
Figure 3. Linear layers sharing input

Batching – Several linear layers have no dependency on each other. These linear layers were bundled into batch operations to improve the degree of parallelism.

Diagram showing linear layers computed independently.
Figure 4. Linear layers computed independently

Horizontal fusion –The formula of the output reduction can be expressed as w1 x 01 + w2 x 02 + w3 x 03 + w4 x 04 + w5 x 05, which just matches the block multiplication of matrices and they can be packed together.

Diagram showing matrix multiplication reduction.
Figure 5. Matrix multiplication reduction

Eliminating redundant computation on triplets

In the original computation graph, each edge feature is expanded to triplets and then each triplet performs an elementwise multiplication. The number of triplets is about 30x the number of edges, which results in a large number of redundant computations. To remove the redundant computations, elementwise multiplication was performed on edge features first and then expanded to perform edge features to triplets.

Pipeline optimization

An ALLReduce communication across all the workers is required before the loss stage to obtain the total number of atoms in the current global batch. As the execution time of forward pass is longer than the execution time of ALLReduce, the communication can be well overlapped.

Figure 6 shows the training process timeline. The global batch is first loaded by multiple processes into CPU memory. Memcpy from CPU memory to GPU memory and ALLReduce (to get the number of atoms in the global batch) are overlapped with forward pass.

Diagram showing training process timeline.
Figure 6. Training process timeline

Data staging

The training data of the Open Catalyst benchmark is 300 GB, and one DGXA100 node has a system memory of 2,048 GB and 256 threads (128 threads per socket, with two sockets per node). As a result, the whole training data can be preloaded into the CPU memory at the beginning. There is no need to load the minibatch from disk to CPU memory in every training step. 

To accelerate data preload, NVIDIA launched 256 processes, each loading 300/256 (~1.2) GB of the training dataset. It took about 10s~15s to finish the preload, which is negligible with respect to the end-to-end training time.

DeepCam

Loading data

Previously, the transparent in-memory data loader utilized background processes to cache data locally in dynamic random-access memory (DRAM). This causes a large overhead, and thus the loader was reimplemented to employ threads instead. 

Performance was previously limited by the Python Global Interpreter Lock (GIL). This time, the C++ based IO helper class was optimized to release the GIL. This approach allows the background loading to overlap with other CPU work. The same optimization was applied to the distributed data stager for the weak scaling score, improving end-to-end performance by about 15%.

Full iteration CUDA graph capture

Compared to MLPerf HPC v1.0, the scope of CUDA graph capture was extended to the full iteration, forward and backward pass, optimizer, and learning rate scheduler step. For this purpose, the sync-free optimizer FusedMixedPrecisionLAMB and DistributedLAMB from the NVIDIA APEX packages were employed for weak and strong scaling benchmarks. 

Additionally, all DeepCAM learning rate schedulers were ported to GPU. By increasing the fraction of the computation that is executed inside the CUDA graph, performance variability across devices that stems from CPU execution variability is reduced. Scale-out performance improves as a result.

Distributed optimizer

For improving strong scaling performance, the DistributedLAMB optimizer was used. This optimizer is especially suited for small per-GPU local batch sizes and large scales, since optimizer cost is more pronounced in such settings. The performance gain at scale is about 3% end-to-end for DeepCAM. 

cuDNN kernel optimizations

DeepCAM features a large number of computing kernels with different performance characteristics. While NVIDIA improved the performance of grouped convolutions in v1.0, the performance of pointwise convolutions was also improved in v2.0. They are used together with the grouped convolutions to form depth-wise separable convolutions. 

MLPerf HPC v2.0 final results 

AI is changing how science is done with high performance computing. Each year, new and more accurate surrogate models are built and shown to vastly outpace physics-based simulations with sufficient accuracy to be useful. Protein folding and the advent of OpenFold, RoseTTAFold, and AlphaFold 2 have been revolutionized by this AI-based approach, bringing protein structure-based drug discovery within reach.

MLPerf HPC reflects the supercomputing industry’s need for an objective, peer-reviewed method to measure and compare AI training performance for use cases relevant to HPC. 

NVIDIA has made significant progress since the MLPerf HPC v1.0 submission in 2021. The Selene supercomputer shows that the NVIDIA A100 Tensor Core GPU and the NVIDIA DGX-A100 SuperPOD, though nearly three years old, are still the best system for AI training for HPC use cases and beyond.

For more information, see MLPerf HPC Benchmarks Show the Power of HPC+AI.

Categories
Misc

Leading MLPerf Training 2.1 with Full Stack Optimizations for AI

As AI becomes increasingly capable and pervasive, MLPerf benchmarks, developed by MLCommons, have emerged as an invaluable tool for organizations to evaluate…

As AI becomes increasingly capable and pervasive, MLPerf benchmarks, developed by MLCommons, have emerged as an invaluable tool for organizations to evaluate the performance of AI infrastructure across a wide range of popular AI-based workloads.

MLPerf Training v2.1—the seventh iteration of this AI training-focused benchmark suite—tested performance across a breadth of popular AI use cases, including the follwoing:

  • Image classification
  • Object detection
  • Medical imaging
  • Speech recognition
  • Natural language processing
  • Recommendation
  • Reinforcement learning

Many AI applications take advantage of multiple AI models deployed in a pipeline. This means that it is critical for an AI platform to be able to run the full range of models available today as well as provide both the performance and flexibility to support new model innovations.

The NVIDIA AI platform submitted results on all workloads in this round and it continues to be the only platform to have submitted results on all MLPerf Training workloads.

Diagram shows a user asking their phone to identify the type of flower in an image and the many AI models that may be used to perform this identification task across several domains[SBE1] : audio, vision, recommendation, and TTS.
Figure 1. Real-world AI example showing the use of many AI models

NVIDIA Hopper delivers a big performance boost

In this round, NVIDIA submitted its first MLPerf Training results using the new H100 Tensor Core GPU, demonstrating up to 6.7x higher performance compared to the first A100 Tensor Core GPU submission and up to 2.6x more performance compared to the latest A100 results.

Chart shows that the H100 delivers up to 6.7x more performance than the first A100 submission in MLPerf Training.
Figure 2. Performance improvements of the latest H100 and A100 submissions compared to the first A100 submission on each MLPerf Training workload

ResNet-50 v1.5: 8x NVIDIA 0.7-18, 8x NVIDIA 2.1-2060,  8x NVIDIA 2.1-2091 | BERT:  8x NVIDIA 0.7-19, 8x NVIDIA 2.1-2062, 8x NVIDIA 2.1-2091 | DLRM: 8x NVIDIA 0.7-17, 8x NVIDIA 2.1-2059, 8x NVIDIA 2.1-2091 | Mask R-CNN: 8x NVIDIA 0.7-19, 8x NVIDIA 2.1-2062, 8x NVIDIA 2.1-2091 | RetinaNet: 8x NVIDIA 2.0-2091, 8x NVIDIA 2.1-2061, 8x NVIDIA 2.1-2091 | RNN-T: 8x NVIDIA 1.0-1060, 8x NVIDIA 2.1-2061, 8x NVIDIA 2.1-2091 | Mini Go: 8x NVIDIA 0.7-20, 8x NVIDIA 2.1-2063, 8x NVIDIA 2.1-2091 | 3D U-Net: 8x NVIDIA 1.0-1059, 8x NVIDIA 2.1-2060, 8x NVIDIA 2.1-2091
First NVIDIA A100 Tensor Core GPU results normalized for throughput due to higher accuracy requirements introduced in MLPerf Training 2.0 where applicable. 
The MLPerf name and logo are trademarks. For more information, see www.mlperf.org.

In addition, in its fifth MLPerf Training, A100 continued to deliver excellent performance across the full suite of workloads, delivering up to 2.5x more performance compared to its first submission, as a result of extensive software optimizations.

This post offers a closer look at the work done by NVIDIA to deliver these results.

BERT

For this round of MLPerf, several optimizations were implemented for our BERT submission, including the use of the FP8 format and the implementation of optimizations for FP8 operations, a reduction in CPU overhead, and the application of sequence packing for small scales.   

Integration with NVIDIA Transformer Engine

One of the key optimizations employed in our BERT submission in MLPerf Training v2.1 was the use of the NVIDIA Transformer Engine library. The library accelerates transformer models on NVIDIA GPUs and takes advantage of the FP8 data format supported by the NVIDIA Hopper fourth-generation Tensor Cores.

BERT FP8 inputs were used for fully connected layers, as well as for the fused multihead attention kernel that implements multihead attention in a single kernel. Using the FP8 format improves memory access times by reducing the amount of data transferred between memory and streaming multiprocessors (SMs) compared to the FP16 format.

Using the FP8 format for the inputs of matrix multiplications also takes advantage of the higher computational rates of FP8 format compared to the FP16 format on NVIDIA Hopper architecture GPUs. By taking advantage of the FP8 format, Transformer Engine accelerates the end-to-end training time by 37% compared to not using the Transformer Engine on the same hardware.

Transformer Engine abstracts away the FP8 tensor type from the user. As a result, the tensor format at the input and output of the encoder layers remains as FP16. The details of FP8 usage are handled by the Transformer Engine library inside the encoder layer.

Both E4M3 and E5M2 formats are employed for FP8, referred to as a hybrid recipe in Transformer Engine. For more information about FP8 format and recipes, see Using FP8 with Transformer Engine.

FP8 general matrix multiply layers

The Transformer Engine library features custom fused kernel implementations to accelerate commonly used NLP and data transformation operations.

Figure 3 shows the FP8 implementation of the forward and backward passes for the Linear layer in PyTorch. The inputs to the GEMM layers are converted to FP8 using the Cast+Transpose (C+T) fused kernel provided by the Transformer Engine library and the GEMM outputs are saved in FP16 precision. FP8 GEMM layers result in a 29% improvement in end-to-end training time. In Figure 3, the C+T operations in gray are redundant and are shown for illustrative purposes only. FP8 GEMM layers use the cuBLAS library in the backend of the Transformer Engine library.

Diagrams of the three FP8 GEMM patterns used.
Figure 3. FP8 GEMM patterns employed in BERT through Transformer Engine

Higher-efficiency, fused multihead attention for FP8

In this round, we implemented a different version of fused multihead attention that is more efficient for BERT use case, inspired by the FlashAttention algorithm.

This implementation does not write out the softmax output or dropout mask in the forward pass to be used in the backward pass. Instead, it recomputes the softmax output in the backward pass and uses the random number generator states directly from the forward pass to regenerate the dropout mask in the backward pass.

This approach is much more efficient, particularly when FP8 inputs and outputs are used due to reduced register pressure. It results in an 8% improvement in end-to-end time-to-train.

Minimize overhead using dataset packing

Previously, for small scales, we used an unpadding strategy to minimize the overhead that stems from varying sequence lengths and additional padding.

An alternative approach is to pack the sequences in a way such that they almost completely fill the batch matrix, making the additional padding negligible while keeping the buffer sizes static across iterations.

In our latest submission, we used a sequence packing algorithm to preprocess training data for small and medium-scale (64 GPUs or less) NVIDIA Hopper submissions. This is similar to the technique employed in previous rounds for larger scales with 1,024 GPUs and more.

Overlap CPU preprocessing with GPU operations to improve training time

Each training step in BERT involves preprocessing the input sequences (also known as mini-batches) on the CPU before copying them to the GPU.

In this round, an optimization was introduced to pipeline the forward pass execution of the current mini-batch with preprocessing of the next mini-batch. This optimization reduces idle GPU time, which is especially important as GPU execution gets faster. It resulted in a 2% improvement in end-to-end training time.

Diagram shows forward pass execution where the iteration is pipelined with the preprocessing of the next mini-batch.
Figure 4. Pipelining of data preprocessing and forward pass

New hyperparameter and batch sizes optimized for H100

With the new H100 Tensor Core GPUs based on the NVIDIA Hopper architecture, throughput scales extremely well with growing local batch sizes. As a result, we increased per-accelerator batch sizes and optimized the training hyperparameters accordingly.

ResNet-50

In this round of MLPerf, we extended the fusion of convolution and memory-bound operations beyond epilog fusions and improved the performance of the pooling operation.

Conv-BN fprop Fusion

The ResNet-50 model consists of a Conv->BN->Relu->Conv->BN->ReLu pattern, resulting in idle Tensor Cores when memory-bound normalization layers are being executed.

In MLPerf Training 2.1, BatchNorm was split into BatchNorm Stats calculation and BatchNorm Apply.

The programmability of NVIDIA GPUs enabled us to fuse the stats calculation in the epilog of the previous convolution, and fuse Apply in the mainloop of the next convolution.

For weight gradients calculation, however, this means the input must be recomputed by fusing BatchNorm Apply and ReLu in the wgrad. With new, high-performance kernels in cuDNN, this feature yielded a 4.2% end-to-end speedup for small scales.

Faster pooling operations

The ResNet-50 model employs maxPool and AvgPool operations in the stem and classifier blocks. By using the new graph API in cuDNN and taking advantage of the higher DRAM bandwidth in the NVIDIA H100 Tensor Core GPU, we sped up the pooling operations by over 3x. This resulted in a speedup of over 3% in MLPerf Training v2.1.

RetinaNet

The major optimization for RetinaNet in this round of MLPerf was improving score computation in the NVCOCO library to remove CPU bottlenecks for the max-scale submission. Additional optimizations include new fusions, extending the reach of CUDA Graphs for reducing CPU overheads, and using the DALI library to improve data preprocessing in the evaluation phase.  

NVCOCO: Accelerating scoring

As GPU execution gets faster, the portions of the code that execute on the CPU can bottleneck performance. This is especially true at large scales where the amount of work executed per GPU is smaller.

Currently, the mAP metric computation during the evaluation phase runs on the CPU, and it was the performance bottleneck in our previous max-scale submission.

In this MLPerf round, we optimized this evaluation computation to eliminate CPU bottlenecks and enable our GPU optimizations to shine. This has particularly helped with the max-scale submission.

The C++ extensions were also further optimized in NVIDIA cocoapi. For its mAP metric computation, we improved COCO’s performance by 3x and overall 20x over the original cocoapi implementation. These optimizations mostly are focused on file I/O, memory access, and load balancing.

We replaced pybind11 with native CPython PyModules as the interface between Python and C++. By reading JSON files directly on the C++ side and interacting with the CPython pointers of NumPy objects, we eliminated deep copies that might have existed before.

Also, loop transformations, such as loop fusion and loop reordering, have significantly improved cache locality and memory access efficiency for multithreading.

We added more parallel regions in OpenMP to exploit additional parallelism and adjusted tasking schedules to better load balance across the threads.

These optimizations in the metric computation have overall resulted in a ~60% end-to-end performance improvement at the 160-node scale. Removing CPU bottlenecks from the critical path also enabled us to increase our maximum scale to 256 nodes from 160 nodes. This yields an additional ~30% reduction in the total time-to-train, despite an increase in the number of epochs required to achieve the target accuracy.

The total end-to-end speedup from the COCO optimization is 2.3x. 

Extending CUDA graphs: Sync-free Adam optimizer

CUDA Graphs provides a mechanism to launch multiple GPU kernels without CPU intervention, mitigating CPU overheads.

In MLPerf Training v2.0, CUDA Graphs was used extensively in our RetinaNet submission. However, gradient scaling and the Adam optimizer step were left out of the region that was graph-captured, due to CPU-GPU synchronization in the optimizer implementation.

In this MLPerf Training v2.1 submission, the Adam optimizer was modified to achieve a sync-free operation. This enabled us to further extend the reach of CUDA graphs and reduce CPU overhead.

Additional cuDNN runtime fusion

In addition to the conv-bias-relu fusion used in the previous MLPerf submission, a conv-scale-bias-relu fusion was employed in the RetinaNet backbone by using cuDNN runtime fusion. This enabled us to avoid kernel launch latency and data movements, resulting in a 1.5% end-to-end speedup.

Using NVIDIA DALI during evaluation

The significant speedups achieved in the training passes have resulted in an increase in the proportion of the time spent on the evaluation stages.

NVIDIA DALI was previously employed during training but not during evaluation. To address the relatively slow evaluation iteration times, we used DALI to efficiently load and preprocess data.

Mask R-CNN

In this round of MLPerf, beyond improving the parallelization of different blocks of Mask R-CNN, we enabled the use of new kernel fusions and reduced the CPU overhead in training iterations.

Faster JSON interpreter

Switching from ujson to orjson reduced the loading time of the COCO 2017 annotations file by approximately 1.5 seconds.

Faster evaluation and NVCOCO optimizations

We used the NVCOCO improvements explained for RetinaNet to reduce the end-to-end time for all Mask R-CNN configurations by about 2 seconds. On average, these optimizations reduce evaluation time by approximately 2 seconds per epoch, but only the last evaluation is exposed in end-to-end time.

The optimized NVCOCO library is a drop-in replacement, making the optimizations directly available to end users.

Vectorized batched ROI Align

Region of Interest (ROI) Align performs bilinear interpolation, which requires a fair bit of math work. As this work is the same for all channels, vectorizing across the channel dimension reduced the amount of work needed by about 4x.

The way launch configurations are calculated were also changed to avoid launching more CUDA threads than needed.

Combined, these efforts improved performance for ROI Align forward propagation by about 5x.

Exposing more parallelism in model code

Mask R-CNN, like most models, incorporates many sections of code that can be executed in parallel. For example, mask-head loss calculation involves calculating a loss for multiple proposals, where each proposal loss can be calculated independently.

We achieved a 3-5% speedup by identifying such sections that can be parallelized and placing them on separate CUDA streams.

Removing more CPU-GPU syncs

In sections of the code where CUDA Graphs is not employed, the GPU kernels are launched from the CPU code. CPU code primarily performs bookkeeping tasks, such as managing memory, keeping track of pointers and indices, and so on.

If the CPU code is not fast enough, the GPU kernel finishes and sits idle before the next kernel is launched. Improving CPU performance to the point where the CPU portion of the code runs faster than the GPU portion is critical to maximizing training performance. This requires some amount of CPU run-ahead.

CPU-GPU synchronizations prevent this because they keep the CPU idle until the current GPU work completes, so removing CPU-GPU synchronizations is also critical for training performance.

We have done this in past rounds for submissions using the NVIDIA A100 Tensor Core GPU. However, the significant performance increase provided by the NVIDIA H100 Tensor Core GPU necessitated that more of these CPU-GPU synchronizations be removed.

This had a small impact on the NVIDIA A100 Tensor Core GPU results, as CPU overhead is not as pronounced for that GPU on Mask R-CNN. However, it improved performance on the H100 Tensor Core GPU by 25-30% for the 32-GPU configuration.

Runtime fusions in RPN and FPN

In previous rounds, we sped up the code by using the cudnn v8 API to perform runtime fusions for the ResNet-50 backbone of Mask R-CNN.

In this round, previous work was leveraged for RetinaNet to extend run time fusions to the RPN and FPN modules, further improving end-to-end performance by about 2%.

Boosting AI performance by 6.7x

The NVIDIA H100 GPU based on the NVIDIA Hopper architecture delivers the next large performance leap for the NVIDIA AI platform. It boosts performance by up to 6.7x compared to the first submission using the A100 GPU.

With software improvements alone, the A100 GPU demonstrated up to 2.5x more performance in this latest round compared to its debut submission, showcasing the continuous full-stack innovation delivered by the NVIDIA AI platform.

All software used for NVIDIA MLPerf submissions is available from the MLPerf repository, enabling you to reproduce our benchmark results. We constantly incorporate these cutting-edge MLPerf improvements to our deep learning framework containers. These containers are available on NGC, our software hub for GPU-optimized applications.

Categories
Misc

HORN Free! Roaming Rhinos Could Be Guarded by AI Drones

Call it the ultimate example of a job that’s sometimes best done remotely. Wildlife researchers say rhinos are magnificent beasts, but they like to be left alone, especially when they’re with their young. In the latest example of how researchers are using the latest technologies to track animals less invasively, a team of researchers has Read article >

The post HORN Free! Roaming Rhinos Could Be Guarded by AI Drones appeared first on NVIDIA Blog.

Categories
Misc

New Volvo EX90 SUV Heralds AI Era for Swedish Automaker, Built on NVIDIA DRIVE

It’s a new age for safety. Volvo Cars unveiled the Volvo EX90 SUV today in Stockholm, marking the beginning of a new era of electrification, technology and safety for the automaker. The flagship vehicle is redesigned from tip to tail — with a new powertrain, branding and software-defined AI compute — powered by the centralized Read article >

The post New Volvo EX90 SUV Heralds AI Era for Swedish Automaker, Built on NVIDIA DRIVE appeared first on NVIDIA Blog.