Categories
Misc

Tuning AI Infrastructure Performance with MLPerf HPC v2.0 Benchmarks

As AI becomes increasingly capable and pervasive in high performance computing (HPC), MLPerf benchmarks have emerged as an invaluable tool. Developed by…

As AI becomes increasingly capable and pervasive in high performance computing (HPC), MLPerf benchmarks have emerged as an invaluable tool. Developed by MLCommons, MLPerf benchmarks enable organizations to evaluate the performance of AI infrastructure across a set of important workloads traditionally performed on supercomputers. 

Peer-reviewed industry-standard benchmarks are a critical tool for evaluating HPC platforms, and NVIDIA believes access to reliable performance data will help guide HPC architects of the future in their design decisions. 

MLPerf HPC benchmarks measure training time and throughput for three types of high-performance simulations that have adopted machine learning techniques. 

Two figures, one for strong scaling the other for weak scaling. Strong scaling figure shows participation from five different submitters, NVIDIA Selene being fastest for all three benchmarks with Jülich being a close 2nd. Weak scaling figure shows three participants where NVIDIA Selene ranging from about 2x faster for Opencatalyst compared to Jülich to more than 3x faster on Deepcam.
Figure 1. MLPerf HPC v2.0 all submitted results

This post walks through the steps the NVIDIA MLPerf team took to optimize each benchmark and measurement to extract the maximum performance. We will focus on the optimizations in MLPerf HPC v2.0 in addition to those in MLPerf HPC v1.0. 

CosmoFlow 

Each instance of the CosmoFlow training application benchmark loads ~8 TB of training data and ~1 TB of validation data. These consist of 512K training samples and 64K validation samples. Each sample has a 16 MB data file and a tiny 144-character label file. In total, there are over 1 million small files that need to be loaded into node-local Nonvolatile Memory Express (NVMe) before training can begin.

In MLPerf HPC v1.0, this resulted in data staging taking a significant amount of time both for strong-scale and weak-scale cases. For the weak-scale case, having many instances—each loading over 1 million files from the file system—puts additional strain on the shared disk system. 

For the number of instances used for weak-scale submission, this causes a non-linear degradation of staging performance with respect to number of instances. These problems were addressed in several ways and are outlined below.

Data staging on NVMe

For a single strong training instance analysis of the NVIDIA MLPerf HPC v1.0 submission showed that only a small fraction of the maximum theoretical read bandwidth of the Selene Lustre file system was used. This was also the case for storage network interface cards (NICs) on the nodes when staging the input dataset. 

During the staging phase of training, the CPU resources allocated are dedicated entirely to sourcing data from the shared file system to node-local NVMe storage. Increasing the threads dedicated to staging, and staging the training and validation data in parallel, reduced the staging time by ~75%. This equates to a ~4x speedup in staging and resulted in a 40% end-to-end reduction in total time for strong-scale. 

Data compression

Loading many tiny files is inherently inefficient. In the case of CosmoFlow, there are more than 1 million files, each with 144 bytes. To further improve staging performance, the associated data and label were combined into one compressed file offline ahead of time. 

In parallel with the data being staged from disk, the files are uncompressed locally onto the compute node disk. This reduces the number of files to be read from disk by 50% and the total data transferred from disk by ~85%, at the end giving an additional 13% staging speedup for strong-scale scenarios. This results in a 7% end-to-end performance improvement in overall training time for strong-scale submission. 

This approach achieved over 900 GB/s read bandwidth for data staging of a strong-scale scenario. 

Increasing effective bandwidth when running multiple instances 

For additional algorithmic details, refer to the DeepCam explanation from the 2021 MLPerf HPC submission, MLPerf HPC v1.0: Deep Dive into Optimizations Leading to Record-Setting NVIDIA Performance.

When running multiple instances at the same time, for weak scaling, every instance must stage a copy of the training and validation data on its local nodes. This year, the NVIDIA submission implemented the distributed staging mechanism for CosmoFlow. 

All of the nodes, regardless of which instance they are associated with, load a fraction of total data (1/N where N is the total number of nodes, which is 512 in this case) from the shared file system. Given the optimizations already discussed, this takes only a few seconds. 

Then, every node uses MPI_Allgather to distribute the data loaded from remote storage to the other nodes that need the data. This distribution takes place over the higher bandwidth InfiniBand Fabric. In other words, a large portion of data transfer that was previously happening over the storage network is offloaded to InfiniBand Fabric with this optimization. As a result of distributing staging, staging time scales linearly with the number of instances (at least up to 128 instances) for weak-scale scenarios. 

For the v1.0 submission, 32 instances were run, each staging ~9 TB of data. This took 10.76 minutes for an effective bandwidth of ~460 GB/s.

For this year’s submission, 128 instances were run, each staging ~9 TB of data, for which the total staging time takes 6.7 minutes. This means staging input data for 4x the number of instances in 1.6x less time, resulting in an effective bandwidth of ~2,900 GB/s, a 6.5x increase in effective bandwidth. Effective bandwidth assumes the amount of total data staged from the file system is the same as that of a non-distributed algorithm for a given number of instances.

Smaller instance sizes for weak-scale training

All the staging improvements enabled the size of the individual instances to be reduced for weak scaling (hence a larger number of parallel instances), which would not have been possible with the storage access bottlenecks that existed before the optimizations were implemented. In v1.0, 32 instances, each with 128 GPUs, caused the staging time to scale non-linearly. Increasing the number of instances caused a superlinear increase in staging time. 

Without the improvements to efficiently stage for many instances, the staging time would have continued to grow superlinearly with the number of instances, resulting in more time being spent for data staging than the actual training. 

With the optimizations described above, the number of instances were increased from 32 to 128 for the weak-scale submission, each instance using four nodes instead of 16 nodes as done in MLPerf HPC v1.0. In v2.0, staging was completed in less time, while increasing the number of models running simultaneously by 4x for weak-scale submission.

CUDA graphs and graph capture

CUDA graphs allow a single graph that consists of a sequence of kernels to be launched, instead of individually launching each of the kernels from CPU to GPU. This feature minimizes CPU involvement in each iteration, substantially improving performance by minimizing latencies—especially for strong scaling scenarios. 

CUDA graphs support was recently added to PyTorch. See Accelerating PyTorch with CUDA Graphs for more details. CUDA graphs support in PyTorch resulted in around a 15% end-to-end performance gain in CosmoFlow for the strong scaling scenario, which is most sensitive to latency and jitter. 

OpenCatalyst

Load balancing across GPUs

Data parallelism splits the global batch equally between each GPU. However, data-parallel task partitioning, by default, does not consider load imbalance within the batch. Load imbalance exists in Open Catalyst between the samples in a batch since the number of atoms of different molecules, the number of edges, and triplets in the graph obtained from molecules vary substantially (Figure 2).

This imbalance results in a large synchronization overhead in the multi-GPU setting. For the strong-scaling scenario, this results in 32% of the computation time being wasted. Lawrence Berkeley National Laboratory (LBNL) introduced an algorithm to balance the load across GPUs in MLPerf HPC v1.0, and this was adopted in the NVIDIA submission this round. 

This algorithm first preprocesses the training data to obtain the number of edges for each sample. In the sampling stage, each GPU is given the indices of the local samples and performs a global ALLgather to get the indices of global samples. 

Then the global samples are sorted by the number of edges and distributed across workers, so that each GPU processes as close to an equal number of edges as possible. This algorithm balances the workload well but introduces a large communication overhead especially as the application scales to more GPUs. This is the same algorithm used in the Open Catalyst submission from LBNL in v1.0.

NVIDIA also improved the sampling function in v2.0. The load balancing sampler avoids global (inter-GPU) communication by fetching the indices of all the samples in the global batch to all workers at the beginning. As before, samples are sorted by the number of edges, and partitioned into different buckets such that each bucket has the same approximate number of edges. Finally, each worker gets its bucket containing the indices of the samples that correspond to its global rank.

Diagram of load imbalance in Open Catalyst, showing the number of atoms, number of edges, and ratio of angles to edges all varying significantly across iterations.
Figure 2. Load imbalance in Open Catalyst, showing the number of atoms, number of edges, and ratio of angles to edges all varying significantly across iterations

Kernel fusion using nvFuser and cuGraph-ops

There are more than 10K kernels in the original OpenCatalyst model as downloaded from MLCommons GitHub. The deep learning compiler for PyTorch, nvFuser, is a common optimization methodology that uses just-in-time (JIT) compilation to fuse multiple operations into a single kernel. The approach decreases both the number of kernels and global memory transactions. 

To achieve this, NVIDIA modified the model script to enable JIT in PyTorch. Optimized fused kernels were also implemented in cuGraph-ops that were exposed through the RAPIDS framework. With the help of nvFuser and cuGraph-ops, the total number of kernels can be reduced by more than 90%. 

Fusing small GEMMs to improve GPU utilization

In the original computation graph, there are many small general matrix multiplications (GEMMs) which are executed sequentially and cannot saturate the GPU. These small GEMM operations can be fused to reduce the number of kernels and improve GPU utilization. Three kinds of GEMM fusions were applied–packing, batching, and horizontal fusion–as explained below. The only change was made to the model script to implement these fusions.

Packing – Several linear layers share the same input. One large GEMM was used to replace a set of several small GEMMs. 

Diagram showing linear layers sharing input.
Figure 3. Linear layers sharing input

Batching – Several linear layers have no dependency on each other. These linear layers were bundled into batch operations to improve the degree of parallelism.

Diagram showing linear layers computed independently.
Figure 4. Linear layers computed independently

Horizontal fusion –The formula of the output reduction can be expressed as w1 x 01 + w2 x 02 + w3 x 03 + w4 x 04 + w5 x 05, which just matches the block multiplication of matrices and they can be packed together.

Diagram showing matrix multiplication reduction.
Figure 5. Matrix multiplication reduction

Eliminating redundant computation on triplets

In the original computation graph, each edge feature is expanded to triplets and then each triplet performs an elementwise multiplication. The number of triplets is about 30x the number of edges, which results in a large number of redundant computations. To remove the redundant computations, elementwise multiplication was performed on edge features first and then expanded to perform edge features to triplets.

Pipeline optimization

An ALLReduce communication across all the workers is required before the loss stage to obtain the total number of atoms in the current global batch. As the execution time of forward pass is longer than the execution time of ALLReduce, the communication can be well overlapped.

Figure 6 shows the training process timeline. The global batch is first loaded by multiple processes into CPU memory. Memcpy from CPU memory to GPU memory and ALLReduce (to get the number of atoms in the global batch) are overlapped with forward pass.

Diagram showing training process timeline.
Figure 6. Training process timeline

Data staging

The training data of the Open Catalyst benchmark is 300 GB, and one DGXA100 node has a system memory of 2,048 GB and 256 threads (128 threads per socket, with two sockets per node). As a result, the whole training data can be preloaded into the CPU memory at the beginning. There is no need to load the minibatch from disk to CPU memory in every training step. 

To accelerate data preload, NVIDIA launched 256 processes, each loading 300/256 (~1.2) GB of the training dataset. It took about 10s~15s to finish the preload, which is negligible with respect to the end-to-end training time.

DeepCam

Loading data

Previously, the transparent in-memory data loader utilized background processes to cache data locally in dynamic random-access memory (DRAM). This causes a large overhead, and thus the loader was reimplemented to employ threads instead. 

Performance was previously limited by the Python Global Interpreter Lock (GIL). This time, the C++ based IO helper class was optimized to release the GIL. This approach allows the background loading to overlap with other CPU work. The same optimization was applied to the distributed data stager for the weak scaling score, improving end-to-end performance by about 15%.

Full iteration CUDA graph capture

Compared to MLPerf HPC v1.0, the scope of CUDA graph capture was extended to the full iteration, forward and backward pass, optimizer, and learning rate scheduler step. For this purpose, the sync-free optimizer FusedMixedPrecisionLAMB and DistributedLAMB from the NVIDIA APEX packages were employed for weak and strong scaling benchmarks. 

Additionally, all DeepCAM learning rate schedulers were ported to GPU. By increasing the fraction of the computation that is executed inside the CUDA graph, performance variability across devices that stems from CPU execution variability is reduced. Scale-out performance improves as a result.

Distributed optimizer

For improving strong scaling performance, the DistributedLAMB optimizer was used. This optimizer is especially suited for small per-GPU local batch sizes and large scales, since optimizer cost is more pronounced in such settings. The performance gain at scale is about 3% end-to-end for DeepCAM. 

cuDNN kernel optimizations

DeepCAM features a large number of computing kernels with different performance characteristics. While NVIDIA improved the performance of grouped convolutions in v1.0, the performance of pointwise convolutions was also improved in v2.0. They are used together with the grouped convolutions to form depth-wise separable convolutions. 

MLPerf HPC v2.0 final results 

AI is changing how science is done with high performance computing. Each year, new and more accurate surrogate models are built and shown to vastly outpace physics-based simulations with sufficient accuracy to be useful. Protein folding and the advent of OpenFold, RoseTTAFold, and AlphaFold 2 have been revolutionized by this AI-based approach, bringing protein structure-based drug discovery within reach.

MLPerf HPC reflects the supercomputing industry’s need for an objective, peer-reviewed method to measure and compare AI training performance for use cases relevant to HPC. 

NVIDIA has made significant progress since the MLPerf HPC v1.0 submission in 2021. The Selene supercomputer shows that the NVIDIA A100 Tensor Core GPU and the NVIDIA DGX-A100 SuperPOD, though nearly three years old, are still the best system for AI training for HPC use cases and beyond.

For more information, see MLPerf HPC Benchmarks Show the Power of HPC+AI.

Categories
Misc

Leading MLPerf Training 2.1 with Full Stack Optimizations for AI

As AI becomes increasingly capable and pervasive, MLPerf benchmarks, developed by MLCommons, have emerged as an invaluable tool for organizations to evaluate…

As AI becomes increasingly capable and pervasive, MLPerf benchmarks, developed by MLCommons, have emerged as an invaluable tool for organizations to evaluate the performance of AI infrastructure across a wide range of popular AI-based workloads.

MLPerf Training v2.1—the seventh iteration of this AI training-focused benchmark suite—tested performance across a breadth of popular AI use cases, including the follwoing:

  • Image classification
  • Object detection
  • Medical imaging
  • Speech recognition
  • Natural language processing
  • Recommendation
  • Reinforcement learning

Many AI applications take advantage of multiple AI models deployed in a pipeline. This means that it is critical for an AI platform to be able to run the full range of models available today as well as provide both the performance and flexibility to support new model innovations.

The NVIDIA AI platform submitted results on all workloads in this round and it continues to be the only platform to have submitted results on all MLPerf Training workloads.

Diagram shows a user asking their phone to identify the type of flower in an image and the many AI models that may be used to perform this identification task across several domains[SBE1] : audio, vision, recommendation, and TTS.
Figure 1. Real-world AI example showing the use of many AI models

NVIDIA Hopper delivers a big performance boost

In this round, NVIDIA submitted its first MLPerf Training results using the new H100 Tensor Core GPU, demonstrating up to 6.7x higher performance compared to the first A100 Tensor Core GPU submission and up to 2.6x more performance compared to the latest A100 results.

Chart shows that the H100 delivers up to 6.7x more performance than the first A100 submission in MLPerf Training.
Figure 2. Performance improvements of the latest H100 and A100 submissions compared to the first A100 submission on each MLPerf Training workload

ResNet-50 v1.5: 8x NVIDIA 0.7-18, 8x NVIDIA 2.1-2060,  8x NVIDIA 2.1-2091 | BERT:  8x NVIDIA 0.7-19, 8x NVIDIA 2.1-2062, 8x NVIDIA 2.1-2091 | DLRM: 8x NVIDIA 0.7-17, 8x NVIDIA 2.1-2059, 8x NVIDIA 2.1-2091 | Mask R-CNN: 8x NVIDIA 0.7-19, 8x NVIDIA 2.1-2062, 8x NVIDIA 2.1-2091 | RetinaNet: 8x NVIDIA 2.0-2091, 8x NVIDIA 2.1-2061, 8x NVIDIA 2.1-2091 | RNN-T: 8x NVIDIA 1.0-1060, 8x NVIDIA 2.1-2061, 8x NVIDIA 2.1-2091 | Mini Go: 8x NVIDIA 0.7-20, 8x NVIDIA 2.1-2063, 8x NVIDIA 2.1-2091 | 3D U-Net: 8x NVIDIA 1.0-1059, 8x NVIDIA 2.1-2060, 8x NVIDIA 2.1-2091
First NVIDIA A100 Tensor Core GPU results normalized for throughput due to higher accuracy requirements introduced in MLPerf Training 2.0 where applicable. 
The MLPerf name and logo are trademarks. For more information, see www.mlperf.org.

In addition, in its fifth MLPerf Training, A100 continued to deliver excellent performance across the full suite of workloads, delivering up to 2.5x more performance compared to its first submission, as a result of extensive software optimizations.

This post offers a closer look at the work done by NVIDIA to deliver these results.

BERT

For this round of MLPerf, several optimizations were implemented for our BERT submission, including the use of the FP8 format and the implementation of optimizations for FP8 operations, a reduction in CPU overhead, and the application of sequence packing for small scales.   

Integration with NVIDIA Transformer Engine

One of the key optimizations employed in our BERT submission in MLPerf Training v2.1 was the use of the NVIDIA Transformer Engine library. The library accelerates transformer models on NVIDIA GPUs and takes advantage of the FP8 data format supported by the NVIDIA Hopper fourth-generation Tensor Cores.

BERT FP8 inputs were used for fully connected layers, as well as for the fused multihead attention kernel that implements multihead attention in a single kernel. Using the FP8 format improves memory access times by reducing the amount of data transferred between memory and streaming multiprocessors (SMs) compared to the FP16 format.

Using the FP8 format for the inputs of matrix multiplications also takes advantage of the higher computational rates of FP8 format compared to the FP16 format on NVIDIA Hopper architecture GPUs. By taking advantage of the FP8 format, Transformer Engine accelerates the end-to-end training time by 37% compared to not using the Transformer Engine on the same hardware.

Transformer Engine abstracts away the FP8 tensor type from the user. As a result, the tensor format at the input and output of the encoder layers remains as FP16. The details of FP8 usage are handled by the Transformer Engine library inside the encoder layer.

Both E4M3 and E5M2 formats are employed for FP8, referred to as a hybrid recipe in Transformer Engine. For more information about FP8 format and recipes, see Using FP8 with Transformer Engine.

FP8 general matrix multiply layers

The Transformer Engine library features custom fused kernel implementations to accelerate commonly used NLP and data transformation operations.

Figure 3 shows the FP8 implementation of the forward and backward passes for the Linear layer in PyTorch. The inputs to the GEMM layers are converted to FP8 using the Cast+Transpose (C+T) fused kernel provided by the Transformer Engine library and the GEMM outputs are saved in FP16 precision. FP8 GEMM layers result in a 29% improvement in end-to-end training time. In Figure 3, the C+T operations in gray are redundant and are shown for illustrative purposes only. FP8 GEMM layers use the cuBLAS library in the backend of the Transformer Engine library.

Diagrams of the three FP8 GEMM patterns used.
Figure 3. FP8 GEMM patterns employed in BERT through Transformer Engine

Higher-efficiency, fused multihead attention for FP8

In this round, we implemented a different version of fused multihead attention that is more efficient for BERT use case, inspired by the FlashAttention algorithm.

This implementation does not write out the softmax output or dropout mask in the forward pass to be used in the backward pass. Instead, it recomputes the softmax output in the backward pass and uses the random number generator states directly from the forward pass to regenerate the dropout mask in the backward pass.

This approach is much more efficient, particularly when FP8 inputs and outputs are used due to reduced register pressure. It results in an 8% improvement in end-to-end time-to-train.

Minimize overhead using dataset packing

Previously, for small scales, we used an unpadding strategy to minimize the overhead that stems from varying sequence lengths and additional padding.

An alternative approach is to pack the sequences in a way such that they almost completely fill the batch matrix, making the additional padding negligible while keeping the buffer sizes static across iterations.

In our latest submission, we used a sequence packing algorithm to preprocess training data for small and medium-scale (64 GPUs or less) NVIDIA Hopper submissions. This is similar to the technique employed in previous rounds for larger scales with 1,024 GPUs and more.

Overlap CPU preprocessing with GPU operations to improve training time

Each training step in BERT involves preprocessing the input sequences (also known as mini-batches) on the CPU before copying them to the GPU.

In this round, an optimization was introduced to pipeline the forward pass execution of the current mini-batch with preprocessing of the next mini-batch. This optimization reduces idle GPU time, which is especially important as GPU execution gets faster. It resulted in a 2% improvement in end-to-end training time.

Diagram shows forward pass execution where the iteration is pipelined with the preprocessing of the next mini-batch.
Figure 4. Pipelining of data preprocessing and forward pass

New hyperparameter and batch sizes optimized for H100

With the new H100 Tensor Core GPUs based on the NVIDIA Hopper architecture, throughput scales extremely well with growing local batch sizes. As a result, we increased per-accelerator batch sizes and optimized the training hyperparameters accordingly.

ResNet-50

In this round of MLPerf, we extended the fusion of convolution and memory-bound operations beyond epilog fusions and improved the performance of the pooling operation.

Conv-BN fprop Fusion

The ResNet-50 model consists of a Conv->BN->Relu->Conv->BN->ReLu pattern, resulting in idle Tensor Cores when memory-bound normalization layers are being executed.

In MLPerf Training 2.1, BatchNorm was split into BatchNorm Stats calculation and BatchNorm Apply.

The programmability of NVIDIA GPUs enabled us to fuse the stats calculation in the epilog of the previous convolution, and fuse Apply in the mainloop of the next convolution.

For weight gradients calculation, however, this means the input must be recomputed by fusing BatchNorm Apply and ReLu in the wgrad. With new, high-performance kernels in cuDNN, this feature yielded a 4.2% end-to-end speedup for small scales.

Faster pooling operations

The ResNet-50 model employs maxPool and AvgPool operations in the stem and classifier blocks. By using the new graph API in cuDNN and taking advantage of the higher DRAM bandwidth in the NVIDIA H100 Tensor Core GPU, we sped up the pooling operations by over 3x. This resulted in a speedup of over 3% in MLPerf Training v2.1.

RetinaNet

The major optimization for RetinaNet in this round of MLPerf was improving score computation in the NVCOCO library to remove CPU bottlenecks for the max-scale submission. Additional optimizations include new fusions, extending the reach of CUDA Graphs for reducing CPU overheads, and using the DALI library to improve data preprocessing in the evaluation phase.  

NVCOCO: Accelerating scoring

As GPU execution gets faster, the portions of the code that execute on the CPU can bottleneck performance. This is especially true at large scales where the amount of work executed per GPU is smaller.

Currently, the mAP metric computation during the evaluation phase runs on the CPU, and it was the performance bottleneck in our previous max-scale submission.

In this MLPerf round, we optimized this evaluation computation to eliminate CPU bottlenecks and enable our GPU optimizations to shine. This has particularly helped with the max-scale submission.

The C++ extensions were also further optimized in NVIDIA cocoapi. For its mAP metric computation, we improved COCO’s performance by 3x and overall 20x over the original cocoapi implementation. These optimizations mostly are focused on file I/O, memory access, and load balancing.

We replaced pybind11 with native CPython PyModules as the interface between Python and C++. By reading JSON files directly on the C++ side and interacting with the CPython pointers of NumPy objects, we eliminated deep copies that might have existed before.

Also, loop transformations, such as loop fusion and loop reordering, have significantly improved cache locality and memory access efficiency for multithreading.

We added more parallel regions in OpenMP to exploit additional parallelism and adjusted tasking schedules to better load balance across the threads.

These optimizations in the metric computation have overall resulted in a ~60% end-to-end performance improvement at the 160-node scale. Removing CPU bottlenecks from the critical path also enabled us to increase our maximum scale to 256 nodes from 160 nodes. This yields an additional ~30% reduction in the total time-to-train, despite an increase in the number of epochs required to achieve the target accuracy.

The total end-to-end speedup from the COCO optimization is 2.3x. 

Extending CUDA graphs: Sync-free Adam optimizer

CUDA Graphs provides a mechanism to launch multiple GPU kernels without CPU intervention, mitigating CPU overheads.

In MLPerf Training v2.0, CUDA Graphs was used extensively in our RetinaNet submission. However, gradient scaling and the Adam optimizer step were left out of the region that was graph-captured, due to CPU-GPU synchronization in the optimizer implementation.

In this MLPerf Training v2.1 submission, the Adam optimizer was modified to achieve a sync-free operation. This enabled us to further extend the reach of CUDA graphs and reduce CPU overhead.

Additional cuDNN runtime fusion

In addition to the conv-bias-relu fusion used in the previous MLPerf submission, a conv-scale-bias-relu fusion was employed in the RetinaNet backbone by using cuDNN runtime fusion. This enabled us to avoid kernel launch latency and data movements, resulting in a 1.5% end-to-end speedup.

Using NVIDIA DALI during evaluation

The significant speedups achieved in the training passes have resulted in an increase in the proportion of the time spent on the evaluation stages.

NVIDIA DALI was previously employed during training but not during evaluation. To address the relatively slow evaluation iteration times, we used DALI to efficiently load and preprocess data.

Mask R-CNN

In this round of MLPerf, beyond improving the parallelization of different blocks of Mask R-CNN, we enabled the use of new kernel fusions and reduced the CPU overhead in training iterations.

Faster JSON interpreter

Switching from ujson to orjson reduced the loading time of the COCO 2017 annotations file by approximately 1.5 seconds.

Faster evaluation and NVCOCO optimizations

We used the NVCOCO improvements explained for RetinaNet to reduce the end-to-end time for all Mask R-CNN configurations by about 2 seconds. On average, these optimizations reduce evaluation time by approximately 2 seconds per epoch, but only the last evaluation is exposed in end-to-end time.

The optimized NVCOCO library is a drop-in replacement, making the optimizations directly available to end users.

Vectorized batched ROI Align

Region of Interest (ROI) Align performs bilinear interpolation, which requires a fair bit of math work. As this work is the same for all channels, vectorizing across the channel dimension reduced the amount of work needed by about 4x.

The way launch configurations are calculated were also changed to avoid launching more CUDA threads than needed.

Combined, these efforts improved performance for ROI Align forward propagation by about 5x.

Exposing more parallelism in model code

Mask R-CNN, like most models, incorporates many sections of code that can be executed in parallel. For example, mask-head loss calculation involves calculating a loss for multiple proposals, where each proposal loss can be calculated independently.

We achieved a 3-5% speedup by identifying such sections that can be parallelized and placing them on separate CUDA streams.

Removing more CPU-GPU syncs

In sections of the code where CUDA Graphs is not employed, the GPU kernels are launched from the CPU code. CPU code primarily performs bookkeeping tasks, such as managing memory, keeping track of pointers and indices, and so on.

If the CPU code is not fast enough, the GPU kernel finishes and sits idle before the next kernel is launched. Improving CPU performance to the point where the CPU portion of the code runs faster than the GPU portion is critical to maximizing training performance. This requires some amount of CPU run-ahead.

CPU-GPU synchronizations prevent this because they keep the CPU idle until the current GPU work completes, so removing CPU-GPU synchronizations is also critical for training performance.

We have done this in past rounds for submissions using the NVIDIA A100 Tensor Core GPU. However, the significant performance increase provided by the NVIDIA H100 Tensor Core GPU necessitated that more of these CPU-GPU synchronizations be removed.

This had a small impact on the NVIDIA A100 Tensor Core GPU results, as CPU overhead is not as pronounced for that GPU on Mask R-CNN. However, it improved performance on the H100 Tensor Core GPU by 25-30% for the 32-GPU configuration.

Runtime fusions in RPN and FPN

In previous rounds, we sped up the code by using the cudnn v8 API to perform runtime fusions for the ResNet-50 backbone of Mask R-CNN.

In this round, previous work was leveraged for RetinaNet to extend run time fusions to the RPN and FPN modules, further improving end-to-end performance by about 2%.

Boosting AI performance by 6.7x

The NVIDIA H100 GPU based on the NVIDIA Hopper architecture delivers the next large performance leap for the NVIDIA AI platform. It boosts performance by up to 6.7x compared to the first submission using the A100 GPU.

With software improvements alone, the A100 GPU demonstrated up to 2.5x more performance in this latest round compared to its debut submission, showcasing the continuous full-stack innovation delivered by the NVIDIA AI platform.

All software used for NVIDIA MLPerf submissions is available from the MLPerf repository, enabling you to reproduce our benchmark results. We constantly incorporate these cutting-edge MLPerf improvements to our deep learning framework containers. These containers are available on NGC, our software hub for GPU-optimized applications.

Categories
Misc

HORN Free! Roaming Rhinos Could Be Guarded by AI Drones

Call it the ultimate example of a job that’s sometimes best done remotely. Wildlife researchers say rhinos are magnificent beasts, but they like to be left alone, especially when they’re with their young. In the latest example of how researchers are using the latest technologies to track animals less invasively, a team of researchers has Read article >

The post HORN Free! Roaming Rhinos Could Be Guarded by AI Drones appeared first on NVIDIA Blog.

Categories
Misc

New Volvo EX90 SUV Heralds AI Era for Swedish Automaker, Built on NVIDIA DRIVE

It’s a new age for safety. Volvo Cars unveiled the Volvo EX90 SUV today in Stockholm, marking the beginning of a new era of electrification, technology and safety for the automaker. The flagship vehicle is redesigned from tip to tail — with a new powertrain, branding and software-defined AI compute — powered by the centralized Read article >

The post New Volvo EX90 SUV Heralds AI Era for Swedish Automaker, Built on NVIDIA DRIVE appeared first on NVIDIA Blog.

Categories
Misc

3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’

The warm, friendly animation Mushroom Spirit is featured In the NVIDIA Studio this week, modeled by talented 3D illustrator Julie Greenberg, aka Juliestrator.

The post 3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Open Source Simulation Expands with NVIDIA PhysX 5 Release

NVIDIA today released the latest version of the NVIDIA PhysX 5 SDK under the same open source license terms as PhysX 4 to help expand simulation workflows and…

NVIDIA today released the latest version of the NVIDIA PhysX 5 SDK under the same open source license terms as PhysX 4 to help expand simulation workflows and applications across global industries. We are pleased to release this much-anticipated update on the NVIDIA-Omniverse/PhysX GitHub repository. 

A longtime GameWorks technology, PhysX has become the primary physics engine and a key foundational technology pillar of NVIDIA Omniverse. It is a powerful simulation engine currently used by industry leaders for robotics, deep reinforcement learning, autonomous driving, factory automation, and visual effects. For next-generation robotics applications, it will enable high fidelity simulations at real-time speeds that are needed for simulating and testing autonomous machines.

“Having a powerful, open-source tool for physics like NVIDIA’s new PhysX 5 library is a critical part of the realism delivered by the Open 3D Engine,” said Royal O’Brien, Executive Director at the Open 3D Foundation and General Manager of Digital Media and Games at the Linux Foundation.

“As PhysX use cases spread to other important 3D domains like simulation and digital twins, we are excited to see NVIDIA working with open source, allowing everyone to harness the innovation and collaboration that these communities can bring,” O’Brien said.

PhysX has become a key reference implementation of the similarly open source Pixar Universal Scene Description (USD) Physics standard available at PixarAnimationStudios/USD on GitHub. This informed the decision to return to the more permissive licensing terms used for PhysX 4. All CPU source code is available under the simple BSD3 open source license, and NVIDIA GPU binaries are included at no cost.

“This release of the PhysX SDK goes hand in hand with USD Physics, a description of a scene’s physical properties that was co-developed with Pixar,” said Dave Eberle, Tools-Sim Lead at Pixar. “Pixar’s ongoing USD collaboration with NVIDIA and other parties is aimed at enabling creators to imbue physics into their scenes with more ease, and we are excited that the open sourcing of the SDK will accelerate the adoption of simulation behaviors in more creative tools.”

What’s new in PhysX 5 open source

The NVIDIA Flow and NVIDIA Blast libraries, while technically not dependent on PhysX, are now a part of the PhysX product family and licensed together. They are available in the same GitHub repo with Blast.

PhysX 5 SDK now supports the capabilities of NVIDIA Flex, which enables various new features. These features include finite element model-based soft body dynamics as well as liquid, cloth, and inflatable objects using position-based dynamics, optimized to run on GPUs. A signed distance field collision feature on GPU has also been added, which allows the user to perform collision detection using a voxelized version of the source mesh, eliminating the need to create a convex decomposition.

Video 1. An NVIDIA Flow dust emitter moving around a scene in Omniverse Create

In terms of new CPU features, PhysX 5 users can now define custom geometries, meaning cylinder shapes or implicit block-based worlds can now be supported. Both CPU and GPU parallel computing performance for large simulations has been significantly improved.

The evolved role of PhysX also brings some fundamental technical changes. Formerly a game physics engine with optimized ports available for a broad range of video game consoles, PhysX is now a high-fidelity GPU-accelerated physics simulation engine used in robotics, deep reinforcement learning, autonomous driving, factory automation, and visual effects, just to name a few.  As a result, video game console ports are no longer available from NVIDIA, though given our permissive licensing, the community is now able to create and maintain ports to such platforms.

Video 2. A digital twin of a kinetic sculpture simulated using gears and cams modeled with PhysX 5

As part of the update, some of the tools and utilities such as digital content creation tool exporters, debugging telemetry and diagnostics, demos, and samples have now been merged into the Omniverse platform.

Advanced demos are no longer bundled with the SDK. Visit the physics demos in NVIDIA Omniverse at NVIDIA On-Demand for more advanced examples of what is possible with PhysX. NVIDIA Omniverse is also where you should look for any content creation tools. NVIDIA is investing in creating the best possible physics toolset in Omniverse, which will continue to evolve and improve.  

The future of PhysX

NVIDIA continues to embrace open source in support of building an inclusive ecosystem. This is a first step in the process of opening up more and more Omniverse source code. As you browse through the source, you might come across some files that have existed as far back as 2001 and can still be used today. 

“PhysX is essential in making video game worlds feel more realistic and believable, not to mention fun. We are excited to see NVIDIA going open source with the latest version,” said Mika Vehkala, Director of Technology at Remedy.

In the near future, watch for source code releases showing how to build a user-modified version of this PhysX SDK into a custom Omniverse extension. NVIDIA also plans to have a full reference implementation of a USD Physics parser and simulation stack available with full source. 

You can access the open source code by visiting the NVIDIA-Omniverse/PhysX GitHub repository , which also includes the NVIDIA Flow library. Watch the latest tutorials on PhysX at NVIDIA On-Demand.

Visit the Omniverse Developer Resource Center and the USD page for additional resources, view the latest tutorials on Omniverse, and check out the forums for support. Join the Omniverse community, Discord server, and Twitch Channel to chat with the community, and subscribe to get the latest Omniverse news.

Follow NVIDIA Omniverse on Instagram, Twitter, YouTube, and Medium for additional resources and inspiration.

Categories
Misc

Anyone Can Build Metaverse Applications With New Beta Release of Omniverse

The new beta release of NVIDIA Omniverse is now available with major updates to core reference applications and tools for developers, creators, and novices…

The new beta release of NVIDIA Omniverse is now available with major updates to core reference applications and tools for developers, creators, and novices looking to build metaverse applications.

Each of the core components of the Omniverse platform have been updated to make it even faster, more accessible, and more flexible for collaborative workflows across applications. These updates empower developers of any background to easily build their custom applications, connections, and extensions anywhere. Learn more about how to develop on NVIDIA Omniverse.

Powered by support for new NVIDIA Ada Generation GPUs and advances in NVIDIA simulation technology, this new beta release focuses on maximizing ease of ingesting large, complex scenes from multiple third-party applications, and maximizing real-time rendering, path tracing, and physics simulation.

Graphic of the five core components of NVIDIA Omniverse: Nucleus, Connect, Kit, Simulation, and RTX Renderer.
Figure 1. The five core components of NVIDIA Omniverse

Nucleus, the central database and collaboration engine of NVIDIA Omniverse, now enables faster live collaboration and copying between servers. Nucleus Navigator 3.2 makes it possible to move files and folders seamlessly between servers located on-premises and in the cloud. It also adds enhanced search functionality to quickly retrieve images, objects, and other assets. OmniObjects with Omniverse Live 2.0 allows faster collaboration between Connectors.

New and updated Connectors for popular apps are available through Omniverse Connect, the libraries that allow you to create Connectors from your favorite apps to the Omniverse platform. The beta release includes new and updated Connectors for PTC Creo, Autodesk Alias, Kitware ParaView, Siemens JT, and Autodesk Maya, among others.

PhysX 5, the flagship tool of Omniverse Simulation, has been open sourced so you can easily modify, build, and distribute your own physics simulation applications. The new version of PhysX comes with exciting new features like support for multiple scenes, collision-triggered audio, and an inspector for robotic applications. Experience Omniverse Simulation by downloading Omniverse and testing technical demos in Omniverse Showroom to see the power of PhysX 5 and real-time RTX Rendering.

New features and capabilities across Omniverse applications are driven by Omniverse Kit 104, which now allows novice or experienced Python and C++ developers to more easily develop, package, and publish their own custom metaverse applications and extensions to accelerate industry-specific workflows.

Connecting to Omniverse with Universal Scene Description

Our software partners are leading the way building useful extensions and Connectors on Omniverse Kit. Some of the more recently published extensions and Connectors include:

  • Updates to Omniverse Connectors for Autodesk 3ds Max, Autodesk Maya, Autodesk Revit, Epic Games’ Unreal Engine, McNeel Rhino, Trimble SketchUp, Graphisoft Archicad, and Kitware’s ParaView
  • New Omniverse Connectors for Autodesk Alias and PTC Creo
  • Reallusion iClone 8.1.0 live sync Connector for seamless interactions between Omniverse apps and iClone 8
  • The OTOY OctaneRender hydra render delegate, which enables Omniverse users to use OctaneRender directly in the Omniverse Create or View viewport
  • The Nextspace digital twin platform extension for normalizing data and geometry to drive the use of AI, analytics, and simulation
  • SmartCow’s Omniverse extension for synthetic data generation of large datasets of license plates for license plate recognition AI

More extensions and Connectors are on the way from companies like Lumirithmic, which is connecting their Hollywood-grade avatar scan provider to Omniverse.

“We’ve been using NVIDIA Omniverse as our primary content delivery engine to serve our enterprise customers,” said Jayanth Kannan, VP of Software Engineering at Lumirithmic. “NVIDIA Omniverse does all the heavy lifting and enables seamless integration of our Avatars with industry standard DCC tools, helping our customers readily use our assets in their commercial projects.”

Move.ai, another partner extending the Omniverse, will soon be publishing an extension to put markerless motion capture in the hands of Omniverse users. 

“We’re excited by the potential to enable users to enhance their creative pipelines with our Move extension, which will allow users of Omniverse to access our free Motion Library,” said Niall Hendry, Head of Partnerships & Delivery at Move.ai. “The Omniverse team has been super responsive, guiding us every step of the way.”

Developers are invited to apply for early access to the new Omniverse Exchange Publishing Portal, which offers a new channel to distribute their custom tools and applications.

A new foundation for developing metaverse tools with Omniverse Kit 104

Omniverse Kit is the SDK on which every Omniverse microservice (like DeepSearch) or reference application (such as Omniverse Create, View, or Isaac Sim) is built. These microservices and reference applications are built as samples for developers to copy and customize.

Most Omniverse development work is exposed in Python workflows. This Omniverse Kit 104 beta release includes a new set of extension templates for C++ developers and technical artists to build extensions using C++. 

Omniverse Kit extension templates contain various example extensions to act as references for developing UI widgets, Universal Scene Description (USD) interactions, and more. These templates remove the need to create extensions from scratch and speed your application development. 

New Omniverse Kit app templates are also now available to make it easier than ever to build advanced 3D tools similar to NVIDIA’s reference applications that leverage core Omniverse technologies like RTX, PhysX, OmniGraph, and USD.

Screencapture of the new Omniverse Kit application template used to create your own apps leveraging technologies from the Omniverse platform like RTX, PhysX, Nucleus, OmniGraph, and USD.
Figure 2. Use the new Omniverse Kit application template to create your own apps leveraging technologies from the Omniverse platform like RTX, PhysX, Nucleus, OmniGraph, and USD

Other key updates in Omniverse Kit include the following:

  • Viewport 2.0 for fully customizable, open workflows 
  • New navigation possibilities for user interfaces in Omni.ui.menu
  • The ability to encapsulate extension features in Actions
  • A centralized API and UI to manage Hotkeys

To learn more about Omniverse Kit 104, see Create Your Own Metaverse Applications with C++ and Python in Omniverse Kit 104. You can also watch the GTC session, How to Build Extensions and Apps for Virtual Worlds with NVIDIA Omniverse on demand.

See Omniverse Kit 104 in action with Omniverse reference applications

Omniverse Code is the integrated development environment (IDE) where developers can take advantage of all the new features of Kit 104. All the latest documentation and samples for building Omniverse applications, extensions, and microservices are integrated in Omniverse Code, making it easy for developers of all backgrounds to learn to develop and use Kit extensions. Omniverse Code makes it easier than ever to leverage Omniverse’s extensibility so that non-traditional developers can quickly build tools and applications to make their workflows more efficient and personalized.

The Omniverse Create application has been updated as part of the beta release with animation improvements and better capabilities for large world authoring. Creators can collaborate more seamlessly on large worlds with layer-based live workflows and Viewport icons showing locations of other users in a scene. 

This release also supports the new DLSS 3 included in the Ada Generation GeForce RTX and NVIDIA RTX GPUs, enabling massive improvements in performance and quality in the RTX renderer by generating additional high-quality frames in real time. 

You can also use many new PhysX extensions in Omniverse Create, including PhysX Authoring Toolbar and Signed Distance Field (SDF) Colliders.

  • PhysX Authoring Toolbar – A simple authoring toolbar to make all your content behave correctly in a simulated environment.
  • SDF Colliders – SDF-based collision detection can now be used for physics objects, enabling direct real-time simulation of gears and cams.

This year, Omniverse Create has launched over 300 extensions built in Kit, including the following:

  • ActionGraph – A special type of OmniGraph in Create, allows you to create event-driven behaviors and logic inside scenes with node-based visual programming.
  • Omni.ui.scene – An extension in Omni.ui that allows you to build interactable UI for widgets and manipulators directly inside the viewport or 3D environment.
  • DeepSearch – An AI-powered microservice that enables instant natural language or 2D image-based search into Omniverse Nucleus’s asset database to retrieve images, objects, or other assets.
A screenshot showing a car in Action Graph. You can use Action Graph to add event-driven behaviors to an asset. For the car shown, you can open/close doors, raise/lower the spoiler, and change paint colors.
Figure 3. Use Action Graph to add event-driven behaviors to an asset. For the car shown, you can open/close doors, raise/lower the spoiler, and change paint colors

“For architectural design/visualization workloads, normally we use software out of the box, but you can run into limitations with those out of the box implementations,” said Eric Craft, XR & Visualization Program Manager at architectural firm Mead & Hunt. “NVIDIA’s Omniverse development platform gives me the ability to easily tweak and customize their tools, so I can build a more efficient, more effective toolkit for our company.” 

“Since it’s based on USD,” Craft added, “the platform interconnects with other popular industry tools which means I can build a custom Omniverse tool in one place, but use it across our multi-app workflows. And because of the USD layer-based workflow changes in Omniverse stay even when the design export is updated.” 

Audio2Gesture, an AI-powered tool that creates realistic body gestures based on an audio file, is now available in Omniverse Machinima.

Omniverse View, a simple review and approval app, now features a focused, collaborative review and markup experience. 

NVIDIA Omniverse Replicator, an SDK for generating 3D synthetic data for AI and simulation workflows, is now available as a container for easy deployment on your preferred Cloud Service Provider (CSP). AWS users can leverage the Omniverse GPU-Optimized AMI available on AWS marketplace and deploy the replicator container seamlessly on an EC2 instance. 

Get started with NVIDIA Omniverse

With a new set of diverse tools and updated applications now available in Omniverse, there has never been a better time to get started. Download the Omniverse free license for individuals to start building with the beta release of Omniverse. 

The Omniverse team is eager to hear your feedback about the beta release and actively looking for input in our Omniverse forums to improve the experience for individual users. Join our community Omniverse livestream on Wednesday, November 9 to learn more about the beta release of Omniverse and get ideas for how to take advantage of the new features.

Subscribe to the Omniverse newsletter to receive updates about Omniverse Enterprise. Follow us on Instagram, Twitter, YouTube, and Medium to stay up to date with the next releases and use cases.

Visit the Omniverse Developer Resource Center and the USD page for additional resources, view the latest tutorials on Omniverse, and check out the forums for support. Join the Omniverse community, Discord server, and Twitch Channel to chat with the community, and subscribe to get the latest Omniverse news.

Categories
Misc

Enabling Enterprise AI Transformations for Telcos with NVIDIA and VMware

AI has the power to transform every industry, but transformation takes time, and it’s rarely easy. For enterprises across industries to be as successful as…

AI has the power to transform every industry, but transformation takes time, and it’s rarely easy. For enterprises across industries to be as successful as possible in their own transformations, they need access to AI-ready technology platforms. They also must be able to use 5G connectivity at the edge to harness valuable data and inform their AI and ML models.

Sign up for the latest telecommunications news from NVIDIA.

The advantages of 5G, such as lower latency and improved mobility as well as data throughput, also increase the application footprint of AI/ML applications within enterprises. According to an analysis by Verified Market Research, the market size for enterprise AI is projected to hit over $88 billion by 2030, up from $7 billion in 2022.

The future lives on the edge with AI, and any business that doesn’t stake its claim now risks falling behind. Industries across the spectrum are only beginning to unlock the tremendous value of AI when it is operationalized:

  • Banks are looking to understand the behavior of customers using AI-powered mobile apps to customize the experience and provide personalized service.
  • Manufacturers are beginning to use real-time data to prevent issues and challenges in their processes proactively, lowering maintenance costs and optimizing their operations.
  • Educators are using AI learning platforms to give students uninterrupted access to lessons from any device or location.

One of the most powerful of these potential applications combines AI and visual computing to begin pushing the boundaries of the metaverse, the 3D evolution of the internet.

By analyzing the unending stream of data generated by connected devices, it is possible to generate a digital twin of anything from a car to the factory in which the car is built. These digital twins are virtual simulations of the physical objects that can be manipulated and altered using AI. Digital twins can simulate how these objects would behave before applying changes in the real world.

Across all these use cases, a common factor is the need to combine AI-ready, digital compute platforms with 5G connectivity at the edge to best leverage the data that increasingly resides there.

As the telecommunications industry has tackled the evolution to 5G connectivity, it has also recognized the value of helping other industries embrace AI to transform their own businesses.

For enterprises that want to realize the immense value of AI but lack the necessary IT infrastructure, telcos represent the best option to provide a managed offering to deliver these services. Telcos are uniquely positioned to take the core connectivity services they’ve perfected and combine them with AI-ready infrastructure to provide enterprises with a managed, connected, and end-to-end AI offering.

For telcos looking to offer new B2B services outside of their core connectivity services, this represents an enormous opportunity to grow revenue and increase profitability.

Most telcos today are not necessarily experts in IT infrastructure platforms or AI. Thankfully, they don’t have to be.

The AI-Ready Enterprise Platform, running VMware’s Cloud Director cloud service delivery platform with NVIDIA AI Enterprise, offers telcos a suite of data science tools and frameworks they can use to harness countless AI applications and reduce time to ROI while overcoming problems posed by unplanned implementations—all to help telcos transform their business and capture the opportunity AI represents. 

Challenges of AI implementation

Companies that don’t adopt AI risk being left behind. But even those that embrace its potential find that it requires a high degree of operational effectiveness and cooperation between AI development teams and business stakeholders.

A recent report by Gartner predicted that 85% of AI projects would fail to deliver on their promises, due in part to a lack of internal skill within the implementing enterprise. It’s one thing to initiate an AI trial, but several factors have led enterprises to find that scaling any trials to generate financial impact is beyond their means:

  • A well-trained AI system requires quality data to function; poor data will give bad results.
  • The cost of replacing outdated hardware with AI-based systems can be prohibitive.
  • Gaps in AI enablement for network and device performance monitoring lead to problems in gathering real-time insights.

Without a concerted, platform-based approach, costs can quickly spin out of control and a company’s return on investment is slowed. That’s why many enterprises looking to embrace AI will be on the lookout for a managed solution.

As a telco, if you can overcome these challenges yourself and deliver on that promise, you can set yourself up as the long-term AI and connectivity platform provider for enterprises looking to transform.

AI-Ready Enterprise Platform value

NVIDIA and VMware have partnered to make it as easy as possible for telcos to surmount any potential hurdles and begin offering AI as a service.

By supplying the application frameworks—including SDKs, tools, APIs, and documentation—NVIDIA and VMware enable telcos to become true SaaS players with the AI-Ready Enterprise Platform. NVIDIA GPUs and DPUs enable these new applications while VMware provides a unified, multi-cloud infrastructure for networking, security, and compute services out to the edge. This combination enables operators and enterprises to start from any compute workload and expand to other workloads on the same infrastructure.

AI-Ready Enterprise Platform with VMware Cloud Director and NVIDIA AI Enterprise unlocks the power of AI by delivering an end-to-end enterprise platform optimized for AI workloads. VMware Cloud Director virtualizes the GPU and enables multiple tenants to share and consume the GPU as a service. When telcos implement AI-Ready Enterprise Platform, the full value of AI and ML applications is achievable and can be delivered alongside a telco’s connectivity services as a true managed AI offer.

Ultimately, this sets up telcos to provide end-to-end infrastructure including the connectivity, edge computing, and applications that are key for AI democratization. Enterprises are free to scale without compromise, enabling more complex AI training and data analytics. This can include services like the following:

  • Intelligent video analytics: Helps retailers keep a closer eye on shopper experiences and merchandise loss.
  • Immersive digital twins over 5G VR: Helps teams virtually collaborate on product designs without being limited by location.
  • AI-enabled traffic monitoring systems: Help municipalities take advantage of a telco’s subscriber base to improve congestion.

Conclusion

As new functions for AI continue to spread across every sector of the economy, late adopters may find themselves at a competitive disadvantage.

Telcos, with their development of 5G edge connectivity and vast troves of consumer data, find themselves on the front lines of this burgeoning frontier. With access to a growing ecosystem of AI applications built around the NVIDIA platform, telcos have a unique opportunity to deliver AI services. They can drive profitability to enterprises in the world’s largest industries, from transportation and healthcare to retail.

For more information, see the following resources:

Categories
Misc

Accelerating Load Times for DirectX Games and Apps with GDeflate for DirectStorage

Photo of a woman looking at her monitor.Load times. They are the bane of any developer trying to construct a seamless experience. Trying to hide loading in a game by forcing a player to shimmy…Photo of a woman looking at her monitor.

Load times. They are the bane of any developer trying to construct a seamless experience. Trying to hide loading in a game by forcing a player to shimmy through narrow passages or take extremely slow elevators breaks immersion.

Now, developers have a better solution. NVIDIA collaborated with Microsoft and IHV partners to develop GDeflate for DirectStorage 1.1, an open standard for GPU compression. The current Game Ready Driver (version 526.47) contains NVIDIA RTX IO technology, including optimizations for GDeflate.

GDeflate: An Open GPU Compression Standard

GDeflate is a high-performance, scalable, GPU-optimized data compression scheme that can help applications make use of the sheer amount of data throughput available on modern NVMe devices. It makes streaming decompression from such devices practical by eliminating CPU bottlenecks from the overall I/O pipeline. GDeflate also provides bandwidth amplification effects, further improving the effective throughput of the I/O subsystem.

GDeflate Open Source will be released on GitHub with a permissive license for IHVs and ISVs. We want to encourage the quick embrace of GDeflate as a data-parallel compression standard, facilitating its adoption across the PC ecosystem and on other platforms.

To show the benefits of GDeflate, we measured system performance without compression, with standard CPU-side decompression, and with GPU-accelerated GDeflate decompression on a representative game-focused dataset, containing texture and geometry data.

A plot depicting the achieved bandwidth over varying staging buffer sizes using no compression, Zlib, a CPU implementation of GDeflate, and the GPU version of GDeflate.
Figure 1. Data throughput of various data compressed formats compared to varying staging buffer sizes
A plot depicting the processing cycles over varying staging buffer sizes using no compression, Zlib, a CPU implementation of GDeflate, and the GPU version of GDeflate.
Figure 2. Processing cycles of various data compressed formats compared to varying staging buffer sizes

As you can see from Figures 1 and 2, the data throughput of uncompressed streaming is limited by the system bus bandwidth at about ~3 GB/s, which happens to be the limit of a Gen3 PCIe interconnect.

When applying traditional compression with decompression happening on the CPU, it’s the CPU that becomes the overall bottleneck, resulting in lower throughput than would otherwise be possible with uncompressed streaming. Not only does it underutilize available I/O resources of the system, but it also takes away CPU cycles from other tasks needing CPU resources.

With GPU-accelerated GDeflate decompression, the system can deliver effective bandwidth well in excess of what’s possible without applying compression. It is effectively multiplying data throughput by its compression ratio. The CPU remains fully available for performing other important tasks, maximizing system-level performance.

GDeflate is available as a standard GPU decompression option in DirectStorage 1.1—a modern I/O streaming API from Microsoft. We’re looking forward to next-generation game engines benefiting from GDeflate by dramatically reducing loading times.

Resource streaming and data compression

Today’s video games feature extremely detailed interactive environments, requiring the management of enormous assets. This data must be delivered first to the end user’s system, and then, at runtime, actively streamed to the GPU for processing. The bulk of a game’s content package is made up of resources that naturally target the GPU: textures, materials, and geometry data.

Traditional data compression techniques are applicable to game content that rarely changes. For example, a texture that is authored only one time may have to be loaded multiple times as the player advances through a game level. Such assets are usually compressed when they are packaged for distribution and decompressed on demand when the game is played. It has become standard practice to apply compression to game assets to reduce the size of the downloadable (and its installation footprint).

However, most data compression schemes are designed for CPUs and assume serial execution semantics. In fact, the process of data compression is usually described in fundamentally serial terms: a stream of data is scanned serially while looking for redundancies or repeated patterns. It replaces multiple occurrences of such patterns with a reference to their previous occurrence. As a result, such algorithms can’t easily scale to data-parallel architectures or accommodate the need for faster decompression rates demanded by modern game content.

At the same time, recent advances in I/O technology have dramatically improved available I/O bandwidth on the end user system. It’s typical to see a consumer system equipped with a PCIe Gen3 or Gen4 NVMe device, capable of delivering up to 7 GB/s of data bandwidth.

To put this in perspective, at this rate, it is possible to fill the entire 24 GBs of frame buffer memory on the high-end NVIDIA GeForce RTX 4090 GPU in a little over 3 seconds!

To keep up with these system-level I/O speed improvements, we need dramatic advances in data compression technology. At these rates, it is no longer practical to use the CPU for data decompression on the end user’s system. That requires an unacceptably large fraction of precious CPU cycles to be spent on this auxiliary task. It may also slow down the entire system.

The CPU shouldn’t become the bottleneck that holds back the I/O subsystem.

Data-parallel decompression and GDeflate architecture

With Moore’s law ending, we can no longer expect to get “free” performance improvements from serial processors.

High-performance systems have long embraced large-scale data parallelism to continue scaling performance for many applications. On the other hand, parallelizing the traditional data compression algorithms has been challenging, due to fundamental serial assumptions “baked” into their design.

What we need is a GPU-friendly data compression approach that can scale performance as GPUs become wider and more parallel.

This is the problem that we set out to address with GDeflate, a novel data-parallel compression scheme optimized for high-throughput GPU decompression. We designed GDeflate with the following goals:

  • High-performance GPU-optimized decompression to support the fastest NVMe devices
  • Offload the CPU to avoid making it the bottleneck during I/O operations
  • Portable to a variety of data-parallel architectures, including CPUs and GPUs
  • Can be implemented cheaply in fixed-function hardware, using existing IP
  • Establish as a data-parallel data compression standard

As you could guess from its name, GDeflate builds upon the well-established RFC 1951 DEFLATE algorithm, expanding and adapting it for data-parallel processing. While more sophisticated compression schemes exist, the simplicity and robustness of the original DEFLATE data coding make it an appealing choice for highly tuned GPU-based implementations.

Existing fixed-function implementations of DEFLATE can also be easily adapted to support GDeflate for improved compatibility and performance.

Two-level parallelism

A many-core SIMD machine consumes the GDeflate bitstream by design, explicitly exposing parallelism at two levels.

First, the original data stream is segmented into 64 KB tiles, which are processed independently. This coarse-grained decomposition provides thread-level parallelism, enabling multiple tiles to be processed concurrently on multiple cores of the target processor. This also enables random access to the compressed data at tile granularity. For example, a streaming engine may request a sparse set of tiles to be decompressed in accordance with the required working set for a given frame.

Also, 64 KB happens to be the standard tile size for tiled or sparse resources in graphics APIs (DirectX and Vulkan), which makes GDeflate compatible with future on-demand streaming architectures leveraging these API features.

Second, the bitstream within tiles is specifically formatted to expose finer-grained, SIMD-level parallelism. We expect that a cooperative group of threads will process individual tiles, as the group can directly parse the GDeflate bitstream using hardware-accelerated data-parallel operations, commonly available on most SIMD architectures.

All threads in the SIMD group share the decompression state. The formatting of the bitstream is carefully constructed to enable highly optimized cooperative processing of compressed data.

This two-level parallelization strategy enables GDeflate implementations to scale easily across a wide range of data-parallel architectures, also providing necessary headroom for supporting future, even wider data-parallel machines without compromising decompression performance.

NVIDIA RTX IO supports DirectStorage 1.1

NVIDIA RTX IO is now included in the current Game Ready Driver (version 526.47), which offers accelerated decompression throughput.

Both DirectStorage and RTX IO leverage the GDeflate compression standard.

“Microsoft is delighted to partner with NVIDIA to bring the benefits of next-generation I/O to Windows gamers. DirectStorage for Windows will enable games to leverage NVIDIA’s cutting-edge RTX IO and provide game developers with a highly efficient and standard way to get the best possible performance from the GPU and I/O system. With DirectStorage, game sizes are minimized, load times reduced, and virtual worlds are free to become more expansive and detailed, with smooth and seamless streaming.”

Bryan Langley, Group Program Manager for Windows Graphics and Gaming

Getting started with DirectStorage in RTX IO drivers

We have a few more recommendations to help ensure the best possible experience using DirectStorage with GPU decompression on NVIDIA GPUs.

Preparing your application for DirectStorage

Achieving maximum end-to-end throughput with DirectStorage with GPU decompression requires enqueuing a sufficient number of read requests, to keep the pipeline fully saturated.

In preparation for DirectStorage integration, applications should group resource I/O and creation requests close together in time. Ideally, resource I/O and creation operations occur in their own CPU thread, separate from threads doing other loading screen activities like shader creation.

Assets on disk should be also packaged together in large enough chunks so that DirectStorage API call frequency is kept to a minimum and CPU costs are minimized. This ensures that enough work can be submitted to DirectStorage to keep the pipeline fully saturated.

For more information about general best practices, see Using DirectStorage and the DirectStorage 1.1 Now Available Microsoft post.

Deciding the staging buffer size

  • Make sure to change the default staging buffer size whenever GPU decompression is used. The current 32 MB default isn’t sufficient to saturate modern GPU capabilities.
  • Make sure to benchmark different platforms with varying NVMe, PCIe, and GPU capabilities when deciding on the staging buffer size. We found that a 128-MB staging buffer size is a reasonable default. Smaller GPUs may require less and larger GPUs may require more.

Compression ratio considerations

  • Make sure to measure the impact that different resource types have on compression savings and GPU decompression performance.
  • In general, various data types, such as texture and geometry, compress at different ratios. This can cause some variation in GPU decompression execution performance.
  • This won’t have a significant effect on end-to-end throughput. However, it may result in variation in latency when delivering the resource contents to their final locations.

Windows File System

  • Try to keep disk files accessed by DirectStorage separate from files accessed by other I/O APIs. Shared file use across different I/O APIs may result in the loss of bypass I/O improvements.

Command queue scheduling when background streaming

  • In Windows 10, command queue scheduling contention can occur between DirectStorage copy and compute command queues, and application-managed copy and compute command queues.
  • The NVIDIA Nsight Systems, PIX, and GPUView tools can assist in determining whether background streaming with DirectStorage is in contention with important application-managed command queues.
  • In Windows 11, overlapped execution between DirectStorage and application command queues is fully expected.
  • If overlapped execution results in suboptimal performance of application workloads, we recommend throttling back DirectStorage reads. This helps maintain critical application performance while background streaming is occurring.

Summary

Next-generation game engines require streaming huge amounts of data, aiming to create increasingly realistic, detailed game worlds. Given that, it’s necessary to rethink game engines’ resource streaming architecture, and fully leverage improvements in I/O technology.

Using the GPU as an accelerator for compute-intensive, data decompression becomes critical for maximizing system performance and reducing load times.

The NVIDIA RTX IO implementation of GDeflate is a scalable GPU-optimized compression technology that enables applications to benefit from the computational power of the GPU for I/O acceleration. It acts as a bandwidth amplifier for high-performance I/O capabilities of today and future systems.

Categories
Misc

Data Storytelling Best Practices for Data Scientists and AI Practitioners

Storytelling with data is a crucial soft skill for AI and data professionals. To ensure that stakeholders understand the technical requirements, value, and…

Storytelling with data is a crucial soft skill for AI and data professionals. To ensure that stakeholders understand the technical requirements, value, and impact of data science team efforts, it is necessary for data scientists, data engineers, and machine learning (ML) engineers to communicate effectively.

This post provides a framework and tips you can adopt to incorporate key elements of data storytelling into your next presentation, pitch, or proposal. It aims to accomplish the following:

  • Introduce storytelling within the context of data science and machine learning
  • Highlight the benefits of effective storytelling for data science practitioners
  • Provide tips on how to cultivate data storytelling skills

What is storytelling with data

Data storytelling is the ability to add contextual information to key data and insights to help develop viewpoints and realizations for project stakeholders. Data scientists and AI practitioners must effectively convey the impact of data-driven action or reasoning.  

Data and machine learning practitioners can use data storytelling to more effectively communicate with clients, project stakeholders, team members, and other business entities. A compelling narrative can help your audience understand complex concepts and can help win new projects.

Data storytelling case study

This section explores the key structural components of a data-driven story. 

The article, What Africa Will Look Like in 100 Years, leverages data and visualizations to tell a narrative of the ongoing transformation occurring in Africa from the viewpoint of major African cities such as Lagos, Dakar, and Cairo.

The strategic composition of this article presents the problem, background, and solution. This approach provides a strong foundation for any data-driven narrative. The article also includes facts, anecdotes, data, and charts and graphs. Together, these produce a free-flowing, well-structured, engaging, and informative account of the subject matter.

The opening sections of this article describe the context and main point: “Can Africa translate its huge population growth into economic development and improved quality of life?” 

Information such as key dates, figures, and first-person statements create a picture grounded in reality, allowing the reader to form a full understanding of the subject matter. The presentation of data using charts and graphs allows for the visualization of Africa’s major cities transformations. Specific data points include population growth, education rate, and life expectancy. Personal experiences and first-hand accounts from citizens of the focus cities provide additional context.

An effective framework for storytelling in data science

This section explores how storytelling in the data science field should be structured and presented. The goal is to equip you with an easy-to-follow framework for your next presentation, article, or video to stakeholders. 

The recipe for success when storytelling can be distilled into three individual components: context, dispute, and solution (Figure 1). These components can be combined with other methods to tell a compelling story with data. 

  • Context: Lay the foundation for your narrative and provide some background
  • Dispute: Discuss the problem associated with the context
  • Solution: Explain and discuss the solution that either ends or mitigates the identified problem
Graphic showing the components of storytelling: context, dispute, and solution.
Figure 1. The components of storytelling

Context

In storytelling, context involves providing information to reinforce, support, and reveal the key findings extracted from data samples. Without context, collated data are only collections of alphanumeric representations of information that alone don’t provide any actionable insight into the issue or topic. Presenting data together with reinforcing context and other supporting elements can aid understanding and help audiences reach meaningful conclusions. 

You can use many different methods to create context when storytelling. A context within data is produced by leveraging a collection of reinforcing materials such as actors, anecdotes, visualization, data labels, diagrams, and more.

To provide an example, consider the sentence below:

“200,000 plug-in electric vehicles were sold in the United Kingdom in 2021, representing an approximate 140% year-on-year increase.” 

Adding contextual information and supporting anecdotes can increase relatability, as shown in the paragraph below: 

“James’s interest in electric vehicles was sparked by a conversation he overheard on the radio about climate change. He did some research and found that a Volkswagen ID.3 would be a great choice for him. James decided to buy the car and by mid-2021, he was one of the many UK residents who had made the switch to electric vehicles. Sales of electric vehicles in 2021 more than doubled what they were in 2020, due to the public’s increasing awareness of climate change and its effects.”

Charts and diagrams are also important to include. They visualize data to aid understanding and provide additional support (Figure 2).

Bar chart showing the sales volume of plug-in electric vehicles in selected European countries in 2021, as an example of data visualization.
Figure 2. A bar chart is an example of data visualization that helps to provide context in data storytelling

Dispute

Dispute, in the context of data storytelling, is a problem, conflict, argument, debate, or issue. To drive the impact of introducing a new tool or adopting a new methodology, it helps to include mention of the key dispute. 

Below is an example of a dispute that helps drive the point of the initial electric vehicle data:

“The United Kingdom is a net importer of fossil fuels for the use of energy and electricity generation. Fossil fuels power our transportation, electrical, and technological services, and even domestic items heavily reliant on fossil fuels’ energy output. The problem is that the UK is determined to significantly reduce its dependence on fossil fuels by 2050. Hence, the question is how the UK can reduce its fossil fuel consumption and move to low-carbon energy sources as an alternative. In addition, fossil fuels are a massive contributor to climate change and extreme weather.”

Solution

The third, and final element to consider when connecting storytelling with data is the solution. The solution can come in many forms, such as reconfiguring an existing system, implementing new methodologies, or becoming aware of educational materials and how to best use them.

The proposed solution should be direct, obvious, and memorable. If proposed solutions are ambiguous, stakeholders will ask more questions. A direct solution, on the other hand, allows for action and the formation of future steps.

Below is an example of a proposed solution:

“Awareness is the first step to making the national UK goal of reducing fossil fuel dependency by 2050. To reach more people like James, we propose a scale-up of the WWF Carbon footprint app to include AI-powered functionality that enables services such as energy consumption prediction per household based on historical data and predicted energy demands. This scale-up initiative will require funding of £100 million and will be delivered to the public a year after project approval.”

The proposed solution contains a reference to the story to make it easier to remember. It also includes information about the project cost and timeline to show that it is direct. 

Sample outline 

Use the sample outline below as a reference for your next data storytelling project.

Opening section

  • Start with a factual statement of your key data point or dataset summary that highlights the impact of the dispute, lack of solution, or the impact of a possible solution. For example, “305,300 plug-in electric vehicles were sold in the United Kingdom in 2021, representing an approximate 140% year-on-year increase.”
  • Expand on the initial opening section by including several paragraphs introducing, explaining, and expanding on the context.

Middle section

  • Introduce, explain, and expand on the dispute.
  • Include anecdotes, facts, figures, charts, and diagrams to contextualize the dispute and present the problem.
  • Introduce, explain, and expand on the dispute concerning the solution.
  • Include anecdotes, facts, figures, charts, and diagrams to illustrate the impact and value of the proposed solution.

Closing section

  • Summarize your main points. Show the benefits a solution would bring, and the undesired consequences of not having a solution.
  • Include a call to action as a next step that encapsulates the desired outcome of the story told with data.
Complete diagram of the components, elements, and considerations for storytelling.
Figure 3. The key components and accompanying attributes of effective data storytelling

Summary

Companies and organizations are becoming more data-driven every day. As a result, AI and data professionals of all levels need to develop data storytelling skills to bridge gaps of understanding related to technicalities, datasets, and technologies. The information in this post will give you a strong foundation from which to start building your data storytelling skills.