Categories
Misc

Leading MLPerf Inference v3.1 Results with NVIDIA GH200 Grace Hopper Superchip Debut

NVIDIA Jetson Orin modules.AI is transforming computing, and inference is how the capabilities of AI are deployed in the world’s applications. Intelligent chatbots, image and video…NVIDIA Jetson Orin modules.

AI is transforming computing, and inference is how the capabilities of AI are deployed in the world’s applications. Intelligent chatbots, image and video synthesis from simple text prompts, personalized content recommendations, and medical imaging are just a few examples of AI-powered applications.

Inference workloads are both computationally demanding and diverse, requiring that platforms be able to process many predictions on never-seen-before data quickly as well as run inference on a breadth of AI models. Organizations looking to deploy AI need a way to evaluate the performance of infrastructure objectively across a breadth of workloads, environments, and deployment scenarios. This is true for both AI training and inference.

MLPerf Inference v3.1, developed by the MLCommons consortium, is the latest edition of an industry-standard AI inference benchmark suite. It complements MLPerf Training and MLPerf HPC. MLPerf Inference v3.1 measures inference performance across a variety of important workloads, including image classification, object detection, natural language processing, speech recognition, and recommender systems, across common data center and edge deployment scenarios.

MLPerf Inference v3.1 includes two important updates to better reflect modern AI use cases:

  • The addition of a large language model (LLM) test based on GPT-J–an open source, 6B-parameter LLM–to represent text summarization, a form of generative AI.
  • An updated DLRM test with a new model architecture and a substantially larger dataset that mirrors the DLRM update introduced in MLPerf Training v3.0. The update better reflects the scale and complexity of modern recommender systems.

Powered by the full NVIDIA AI Inference software stack, including the latest TensorRT 9.0, NVIDIA made submissions in MLPerf Inference v3.1 using a wide array of products. These included the debut submission of the NVIDIA GH200 Grace Hopper Superchip, which extended the great per-accelerator performance delivered by the NVIDIA H100 Tensor Core GPU. NVIDIA also submitted the NVIDIA L4 Tensor Core GPU for mainstream servers, as well as both the NVIDIA Jetson AGX Orin and Jetson Orin NX platforms for edge AI and robotics.  

The rest of this post provides highlights of the NVIDIA submissions as well as a peek into how these exceptional results were achieved.

Grace Hopper Superchip extends NVIDIA Hopper inference performance

The NVIDIA GH200 Grace Hopper Superchip combines the NVIDIA Hopper GPU and the NVIDIA Grace CPU through the coherent NVLink-C2C at 900 GB/s to create a single superchip. That’s 7x higher than PCIe Gen5 at 5x lower power. It also incorporates up to 576 GB of fast access memory through the combination of 96 GB of HBM3 GPU memory and up to 480 GB of low-power, high-bandwidth LPDDR5X memory.

The GH200 Grace Hopper Superchip has integrated power management features that enable the GH200 to take advantage of the energy efficiency of the Grace CPU to balance efficiency and performance. For more information, see NVIDIA Grace Hopper Superchip Architecture In-Depth and the NVIDIA Grace Hopper Superchip Architecture whitepaper.

Diagram shows the GH200 with 96 GB HBM3 was used for MLPerf Inference v3.1 submission.
Figure 1. Logical overview of the NVIDIA GH200 Grace Hopper Superchip

The NVIDIA GH200 Grace Hopper Superchip is designed for the versatility required to deliver leading performance across compute and memory-intensive workloads. It also delivers substantially higher performance on the most demanding frontier workloads, such as large transformer-based models with hundreds of billions or trillions of parameters, recommender systems with multi-terabyte embedding tables, and vector databases.

In addition to being built for the most intensive AI workloads, the GH200 Grace Hopper Superchip also shines on the popular, mainstream workloads tested by MLPerf Inference. It ran every test, demonstrating its seamless support for the full NVIDIA software stack. It extended the exceptional performance achieved by NVIDIA’s single H100 SXM submission on every workload.

Bar chart shows that NVIDIA Grace Hopper delivered up to 17% better performance than H100 SXM with the help of larger memory capacity, wider memory bandwidth, and sustaining higher GPU clock frequency.
Figure 2. NVIDIA Grace Hopper MLPerf Inference data center performance results compared to DGX H100 SXM

MLPerf Inference: Datacenter v3.1, Closed. Submission IDs: NVIDIA 3.1-0107(1xH100 SXM), 3.1-0110(1xGH200 Grace Hopper Superchip)
The MLPerf name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. For more information, see www.mlcommons.org.

The GH200 Grace Hopper Superchip incorporates 96 GB of HBM3 and provides up to four TB/s of HBM3 memory bandwidth, compared to 80 GB and 3.35 TB/s for H100 SXM, respectively. This larger memory capacity, as well as greater memory bandwidth, enabled the use of larger batch sizes for workloads on the NVIDIA GH200 Grace Hopper Superchip compared to the NVIDIA H100 SXM. For example, both RetinaNet and DLRMv2 ran with up to double the batch sizes in the Server scenario and 50% greater batch sizes in the Offline scenario.

The GH200 Grace Hopper Superchip’s high-bandwidth NVLink-C2C link between the NVIDIA Hopper GPU and the Grace CPU enables fast communication between the CPU and GPU, which can help boost performance.

For example, in the MLPerf DLRMv2 workload, transferring a batch of tensors over PCIe takes approximately 22% of the batch inference time on H100 SXM. The GH200 Grace Hopper Superchip, however, performed the same transfer using just 3% of the inference time as a result of NVLink-C2C.

Thanks to higher memory bandwidth and larger memory capacity, the Grace Hopper Superchip delivered up to 17% higher per-chip performance advantage compared to the H100 GPU on MLPerf Inference v3.1 workloads. These results showcase the performance and versatility of both the GH200 Grace Hopper Superchip and the NVIDIA software stack.

Optimizing GPT-J 6B for LLM inference

To represent LLM inference workloads, MLPerf Inference v3.1 introduces a new test based on the GPT-J 6B model: an LLM with 6B parameters. The task tested by the new benchmark is text summarization using the CNN/DailyMail dataset.

The NVIDIA platform delivered strong results on the GPT-J workload, with GH200 Grace Hopper Superchip delivered the highest per-accelerator performance on both the Offline and Server scenarios on a per-accelerator basis. The NVIDIA L4 GPU also delivered strong performance, outpacing the best CPU-only result up to 6x in a 1-slot PCIe card with a thermal design power (TDP) of just 72 Watts.

To achieve these results, NVIDIA software for LLM inference intelligently applies both FP8 and FP16 precisions to increase performance while also meeting target accuracy requirements.

A key challenge for performing GPT-J inference is the high memory consumption of the key-value (KV) cache in the transformer block. By storing the KV cache in the FP8 data format, the NVIDIA submission significantly increased the batch size used. This boosted GPU memory utilization and enabled better use of the immense compute performance of NVIDIA GPUs.

Diagram shows the architecture of the GPT-J model, including input, output, and internal mechanism.
Figure 3. GPT-J architecture

Enabling DLRM-DCNv2 submissions

MLPerf Inference v3.1 introduced an update to the DLRMv1 model used in prior versions of the benchmark. This DLRMv2 model replaces the interactions layer with a three-layer DCNv2 cross network. DLRMv2 also uses multi-hot categorical inputs rather than one-hot, which are synthetically generated from the Criteo Terabyte Click Logs Dataset.

One of the challenges of recommender inference arises from fitting the embedding tables on the system. By converting the model to FP16 precision, including the embedding table, we could both improve performance and halve the memory footprint of the embedding table, reducing it to 49 GB. This enables the entire embedding table to fit within a single H100 GPU. 

To enable our submission on the L4 GPU, which has 24 GB of memory, NVIDIA software intelligently splits the embedding table between GPU and host memory using row-frequency data obtained by analyzing the training dataset. Using this data, NVIDIA software can minimize memory transfers between the host CPU and GPU  by storing the most frequently used embedding table rows on the GPU.

The NVIDIA platform demonstrated exceptional results on DLRMv2, with GH200 showing up to a 17% increase compared to the great performance delivered by H100 SXM. 

Maximizing parallelism on NVIDIA Jetson Orin with Programmable Vision Accelerator

The Jetson AGX Orin series and Jetson Orin NX series are embedded modules for edge AI and robotics, based on the NVIDIA Orin system-on-chip (SoC). To deliver exceptional AI performance and efficiency across a range of use cases, Jetson Orin incorporates many compute engines:

These accelerators can be used to offload the GPU and enable additional AI inference performance on the Jetson Orin modules.

NVDLA is a fixed-function accelerator optimized for deep learning operations and is designed to do full hardware acceleration of convolutional neural network inferencing. 

Diagram of the NVIDIA Orin SoC shows the individual blocks, including CPU, GPU, dedicated accelerators, cache, and memory interface.
Figure 4. NVIDIA Orin system-on-chip

For the first time in MLPerf Inference v3.1, we demonstrate the concurrent use of the PVA alongside GPU and DLA for inference. The second-generation PVA provides dedicated hardware for various computer vision kernels such as filtering, warping, and fast Fourier transforms (FFT). It also supports advanced programmed kernels, which can serve as the backend runtime of TensorRT custom plug-ins.

With the 23.08 Jetson CUDA-X AI Developer Preview, we’ve included a sample PVA SDK. This package provides runtime support for a non-maximum suppression (NMS) layer. It demonstrates that the PVA can serve as a highly capable accelerator, complementing the powerful Jetson Orin GPU.

NVIDIA has developed a TensorRT custom NMS PVA plug-in as a reference for Jetson Orin users and it was included as part of the NVIDIA MLPerf Inference v3.1 submission.

In the NVIDIA MLPerf Inference v3.0 RetinaNet submission on NVIDIA Orin platforms, the GPU handled all outputs from the ResNext + FPN backbone from the GPU as well as from the two DLAs.

Diagram shows how inference queries are sent to the GPU and DLAs, and then the GPU handles the outputs from the DLAs.
Figure 5. GPU responsible for GPU and DLA outputs in MLPerf Inference v3.0

Figure 5 shows how, in MLPerf Inference v3.0 submissions, the GPU was responsible for outputs from the ResNext+FPN backbone from both the GPU and the DLAs.

By using the NMS PVA plug-in, the NMS operator is now offloaded from GPU to PVA, enabling three fully parallel inference flows on Jetson Orin AGX and Jetson Orin NX. The output from the ResNext and the FPN backbone running on the two DLAs is now consumed by the two PVAs running the NMS PVA plug-in inside the end-to-end RetinaNet TensorRT engine.

Diagram shows how inference queries are sent to the GPU and DLAs, the PVAs consume the outputs from the DLAs, and then the GPU and two PVAs create output.
Figure 6. Fully parallel computations in MLPerf Inference v3.1

In Figure 6, the NVIDIA MLPerf Inference v3.1 submission enables computations to run fully in parallel through optimized use of Jetson Orin PVAs.

This careful use of PVA along with GPU and DLA boosts performance by 30% on both the Jetson AGX Orin 64GB and the Jetson Orin NX 16GB modules. When this use of PVA is coupled with a newly optimized NMS Opt GPU plug-in, Jetson AGX Orin delivers 61% higher performance and 38% better power efficiency on the RetinaNet workload. The Jetson Orin NX 16GB showed an even larger gain, with an 84% performance boost on the same test.

Algorithmic optimizations further improve BERT performance

In MLPerf Inference v3.1, NVIDIA made a submission on the BERT Large workload using the L4 GPU in the open division using techniques developed by the OmniML team. OmniML is a startup acquired by NVIDIA in early 2023 that brought expertise in machine learning algorithmic model optimization for use cases spanning cloud platforms to edge devices.

The open division submission on BERT applied structured pruning with distillation techniques to improve the performance by up to 4.7x while maintaining 99% accuracy. This submission demonstrates the potential of algorithmic optimizations for enhancing significantly the already exceptional performance of the NVIDIA platform.

NVIDIA deployed a proprietary, automatic, structured pruning tool that uses a gradient-based sensitivity analysis to prune the model to the given target FLOPs and fine-tune it with distillation to recover most of the accuracy. The number of transformer layers, attention heads, and linear layer dimensions were pruned in all the transformer layers in the model while the embedding dimension was kept unchanged.

Compared to the original MLPerf Inference BERT INT8 model, our pruned model reduced the number of parameters by 4x and the number of FLOPs by 5.6x. This model has a varying number of heads and linear layer dimensions in each layer. The resulting TensorRT engine built from the pruned model is 3.4x smaller, 177 MB compared to 607 MB.

The fine-tuned model is quantized to INT8 precision using the same technique employed in the NVIDIA closed division submission. The submission also employed distillation during quantization-aware training (QAT) to achieve an accuracy that is 99% or higher.

Scenario Closed Division Open Division Speedup
Offline samples/sec 1029 4609 4.5x
Server samples/sec 899 4265 4.7x
Single Stream p90 Latency (ms) 2.58 0.82 3.1x
Table 1. BERT Large performance metrics for both closed division and open division

To understand better how each of the model optimizations affects performance, NVIDIA performed a stacking analysis and applied different model optimization methods individually (Figure 8).

Diagram stacks quantization performance on optimization and pruning (Closed Division) and then on distillation (Open Division). The accuracy baseline is the FP32 model (not listed).
Figure 7. Stacking performance analysis

Figure 7 shows that, through model pruning and distillation, the NVIDIA open division submission on the BERT workload using L4 provides a 4.5x speedup compared to the same GPU running the closed division workload in the offline scenario.

Each model optimization method applied can be easily integrated with each other. Together, they yielded a substantial performance improvement compared to the baseline model.

NVIDIA accelerated computing boosts performance for inference and AI training workloads

In its MLPerf debut, the GH200 Grace Hopper Superchip turned in exceptional performance on all workloads and scenarios in the closed division of the data center category, boosting performance by up to 17% on the NVIDIA single-chip H100 SXM submission. The NVIDIA software stack fully supports the GH200 Grace Hopper Superchip today. 

For mainstream servers, the L4 GPU showed delivery of a large performance leap over CPU-only offerings in a compact, low-power, PCIe add-in card.

For edge AI and robotics applications, the Jetson AGX Orin and Jetson Orin NX modules achieved great performance. Software optimizations helped to further unlock the potential of the powerful NVIDIA Orin SoC that powers those modules. It boosted performance on RetinaNet, a popular AI network for object detection, by up to 84%.

In this round, NVIDIA also submitted results in the open division, providing a first look at the potential for model optimizations to speed inference performance dramatically while still achieving excellent accuracy.

The latest MLPerf Inference v3.1 benchmarks show that the NVIDIA accelerated computing platform continues to deliver leadership performance and versatility. There’s innovation at every layer of the technology stack, from cloud to edge, at the speed of light.

Categories
Misc

Webinar: Boost Your AI Development with ClearML and NVIDIA TAO

Event promo card.On Sept. 19, learn how NVIDIA TAO integrates with the ClearML platform to deploy and maintain machine learning models in production environments.Event promo card.

On Sept. 19, learn how NVIDIA TAO integrates with the ClearML platform to deploy and maintain machine learning models in production environments.

Categories
Misc

NVIDIA Partners with India Giants to Advance AI in World’s Most Populous Nation

The world’s largest democracy is poised to transform itself and the world, embracing AI on an enormous scale. Speaking with the press Friday in Bengaluru, in the context of announcements from two of India’s largest conglomerates, Reliance Industries Limited and Tata Group, NVIDIA founder and CEO Jensen Huang detailed plans to bring AI technology and Read article >

Categories
Misc

NVIDIA TensorRT-LLM Supercharges Large Language Model Inference on NVIDIA H100 GPUs

Large language models offer incredible new capabilities, expanding the frontier of what is possible with AI. But their large size and unique execution…

Large language models offer incredible new capabilities, expanding the frontier of what is possible with AI. But their large size and unique execution characteristics can make them difficult to use in cost-effective ways. 

NVIDIA has been working closely with leading companies, including Meta, Anyscale, Cohere, Deci, Grammarly, Mistral AI, MosaicML, now a part of Databricks, OctoML, Tabnine and Together AI, to accelerate and optimize LLM inference. 

Those innovations have been integrated into the open-source NVIDIA TensorRT-LLM software, set for release in the coming weeks. TensorRT-LLM consists of the TensorRT deep learning compiler and includes optimized kernels, pre- and post-processing steps, and multi-GPU/multi-node communication primitives for groundbreaking performance on NVIDIA GPUs. It enables developers to experiment with new LLMs, offering peak performance and quick customization capabilities, without requiring deep knowledge of C++ or NVIDIA CUDA.

TensorRT-LLM improves ease of use and extensibility through an open-source modular Python API for defining, optimizing, and executing new architectures and enhancements as LLMs evolve, and can be customized easily.  

For example, MosaicML has added specific features that it needs on top of TensorRT-LLM seamlessly and integrated them into their inference serving. Naveen Rao, vice president of engineering at Databricks notes that “it has been an absolute breeze.”

“TensorRT-LLM is easy to use, feature-packed with streaming of tokens, in-flight batching, paged-attention, quantization, and more, and is efficient,” Rao said. “It delivers state-of-the-art performance for LLM serving using NVIDIA GPUs and allows us to pass on the cost savings to our customers.”

Performance comparison

Summarizing articles is just one of the many applications of LLMs. The following benchmarks show performance improvements brought by TensorRT-LLM on the latest NVIDIA Hopper architecture. 

The following figures reflect article summarization using an NVIDIA A100 and NVIDIA H100 with CNN/Daily Mail, a well-known dataset for evaluating summarization performance.  

In Figure 1, H100 alone is 4x faster than A100. Adding TensorRT-LLM and its benefits, including in-flight batching, result in an 8X total increase to deliver the highest throughput. 

GPT-J performance comparison between A100 and H100 with and without TensorRT-LLM.
Figure 1. GPT-J-6B  A100 compared to H100 with and without TensorRT-LLM

Text summarization, variable I/O length, CNN / DailyMail dataset | A100 FP16 PyTorch eager mode | H100 FP8 | H100 FP8, in-flight batching, TensorRT-LLM

On Llama 2—a popular language model released recently by Meta and used widely by organizations looking to incorporate generative AI—TensorRT-LLM can accelerate inference performance by 4.6x compared to A100 GPUs.

Llama 2 70B performance comparison between A100 and H100 with and without TensorRT-LLM.
Figure 2. Llama 2 70B, A100 compared to H100 with and without TensorRT-LLM

Text summarization, variable I/O length, CNN / DailyMail dataset | A100 FP16 PyTorch eager mode| H100 FP8 | H100 FP8, in-flight batching, TensorRT-LLM

LLM ecosystem explosion

The ecosystem is innovating rapidly, developing new and diverse model architectures. Larger models unleash new capabilities and use cases. Some of the largest, most advanced language models, like Meta’s 70-billion-parameter Llama 2, require multiple GPUs working in concert to deliver responses in real time. Previously, developers looking to achieve the best performance for LLM inference had to rewrite and manually split the AI model into fragments and coordinate execution across GPUs.

TensorRT-LLM uses tensor parallelism, a type of model parallelism in which individual weight matrices are split across devices. This enables efficient inference at scale–with each model running in parallel across multiple GPUs connected through NVLink and across multiple servers–without developer intervention or model changes.

As new models and model architectures are introduced, developers can optimize their models with the latest NVIDIA AI kernels available open source in TensorRT-LLM. The supported kernel fusions include cutting-edge implementations of FlashAttention and masked multi-head attention for the context and generation phases of GPT model execution, along with many others.

Additionally, TensorRT-LLM includes fully optimized, ready-to-run versions of many LLMs widely used in production today. This includes Meta Llama 2, OpenAI GPT-2 and GPT-3, Falcon, Mosaic MPT, BLOOM, and a dozen others, all of which can be implemented with the simple-to-use TensorRT-LLM Python API.

These capabilities help developers create customized LLMs faster and more accurately to meet the needs of virtually any industry.

In-flight batching

Today’s large language models are extremely versatile. A single model can be used simultaneously for a variety of tasks that look very different from one another. From a simple question-and-answer response in a chatbot to the summarization of a document or the generation of a long chunk of code, workloads are highly dynamic, with outputs varying in size by several orders of magnitude. 

This versatility can make it difficult to batch requests and execute them in parallel effectively—a common optimization for serving neural networks—which could result in some requests finishing much earlier than others.

To manage these dynamic loads, TensorRT-LLM includes an optimized scheduling technique called in-flight batching. This takes advantage of the fact that the overall text generation process for an LLM can be broken down into multiple iterations of execution on the model. 

With in-flight batching, rather than waiting for the whole batch to finish before moving on to the next set of requests, the TensorRT-LLM runtime immediately evicts finished sequences from the batch. It then begins executing new requests while other requests are still in flight. In-flight batching and the additional kernel-level optimizations enable improved GPU usage and minimally double the throughput on a benchmark of real-world LLM requests on H100 Tensor Core GPUs, helping to minimize TCO.

H100 Transformer Engine with FP8

LLMs contain billions of model weights and activations, typically trained and represented with 16-bit floating point (FP16 or BF16) values where each value occupies 16 bits of memory. At inference time, however, most models can be effectively represented at lower precision, like 8-bit or even 4-bit integers (INT8 or INT4), using modern quantization techniques. 

Quantization is the process of reducing the precision of a model’s weights and activations without sacrificing accuracy. Using lower precision means that each parameter is smaller, and the model takes up less space in GPU memory. This enables inference on larger models with the same hardware while spending less time on memory operations during execution. 

NVIDIA H100 GPUs with TensorRT-LLM give users the ability to convert their model weights into a new FP8 format easily and compile their models to take advantage of optimized FP8 kernels automatically. This is made possible through Hopper Transformer Engine technology and done without having to change any model code. 

The FP8 data format introduced by the H100 enables developers to quantize their models and radically reduce ‌memory consumption without degrading model accuracy. FP8 quantization retains higher accuracy compared to other data formats like INT8 or INT4 while achieving the fastest performance and offering the simplest implementation.  

Summary

LLMs are advancing rapidly. Diverse model architectures are being developed daily and contribute to a growing ecosystem. In turn, larger models unleash new capabilities and use cases, driving adoption across all industries.

LLM inference is reshaping the data center. Higher performance with increased accuracy yields better TCO for enterprises. Model innovations enable better customer experiences, translating into higher revenue and earnings.

When planning inference deployment projects, there are still many other considerations to achieve peak performance using state-of-the-art LLMs. Optimization rarely happens automatically. Users must consider fine-tuning factors such as parallelism, end-to-end pipelines, and advanced scheduling techniques. And they require a computing platform that can handle mixed precision without diminishing accuracy.

TensorRT-LLM comprises TensorRT’s Deep Learning Compiler, optimized kernels, pre- and post-processing, and multi-GPU/multi-node communication in a simple, open-source Python API for defining, optimizing, and executing LLMs for inference in production. 

Get started with TensorRT-LLM

NVIDIA TensorRT-LLM is now available in early access and soon will be integrated into the NVIDIA NeMo framework—part of NVIDIA AI Enterprise, an enterprise-grade AI software platform with security, stability, manageability, and support. Developers and researchers will be able to access TensorRT-LLM through the NeMo framework on NGC or through the source repository on GitHub. 

Note that you must be registered in the NVIDIA Developer Program to apply for the early access release. You must also be logged in using your organization’s email address. We cannot accept applications from accounts using Gmail, Yahoo, QQ, or other personal email accounts.

To participate, fill out the short application form and provide details about your use case.

Categories
Misc

Workshop: Fundamentals of Deep Learning

An illustration of computer monitors showing data science models.Learn key techniques and tools required to train a deep learning model in this virtual hands-on workshop.An illustration of computer monitors showing data science models.

Learn key techniques and tools required to train a deep learning model in this virtual hands-on workshop.

Categories
Misc

Tata Partners With NVIDIA to Build Large-Scale AI Infrastructure

NVIDIA today announced an extensive collaboration with Tata Group to deliver AI computing infrastructure and platforms for developing AI solutions. The collaboration will bring state-of-the-art AI capabilities within reach to thousands of organizations, businesses and AI researchers, and hundreds of startups in India.

Categories
Misc

Reliance and NVIDIA Partner to Advance AI in India, for India

In a major step to support India’s industrial sector, NVIDIA and Reliance Industries today announced a collaboration to develop India’s own foundation large language model trained on the nation’s diverse languages and tailored for generative AI applications to serve the world’s most populous nation.

Categories
Offsites

A novel computational fluid dynamics framework for turbulent flow research

Turbulence is ubiquitous in environmental and engineering fluid flows, and is encountered routinely in everyday life. A better understanding of these turbulent processes could provide valuable insights across a variety of research areas — improving the prediction of cloud formation by atmospheric transport and the spreading of wildfires by turbulent energy exchange, understanding sedimentation of deposits in rivers, and improving the efficiency of combustion in aircraft engines to reduce emissions, to name a few. However, despite its importance, our current understanding and our ability to reliably predict such flows remains limited. This is mainly attributed to the highly chaotic nature and the enormous spatial and temporal scales these fluid flows occupy, ranging from energetic, large-scale movements on the order of several meters on the high-end, where energy is injected into the fluid flow, all the way down to micrometers (μm) on the low-end, where the turbulence is dissipated into heat by viscous friction.

A powerful tool to understand these turbulent flows is the direct numerical simulation (DNS), which provides a detailed representation of the unsteady three-dimensional flow-field without making any approximations or simplifications. More specifically, this approach utilizes a discrete grid with small enough grid spacing to capture the underlying continuous equations that govern the dynamics of the system (in this case, variable-density Navier-Stokes equations, which govern all fluid flow dynamics). When the grid spacing is small enough, the discrete grid points are enough to represent the true (continuous) equations without the loss of accuracy. While this is attractive, such simulations require tremendous computational resources in order to capture the correct fluid-flow behaviors across such a wide range of spatial scales.

The actual span in spatial resolution to which direct numerical calculations must be applied depends on the task and is determined by the Reynolds number, which compares inertial to viscous forces. Typically, the Reynolds number can range between 102 up to 107 (even larger for atmospheric or interstellar problems). In 3D, the grid size for the resolution required scales roughly with the Reynolds number to the power of 4.5! Because of this strong scaling dependency, simulating such flows is generally limited to flow regimes with moderate Reynolds numbers, and typically requires access to high-performance computing systems with millions of CPU/GPU cores.

In “A TensorFlow simulation framework for scientific computing of fluid flows on tensor processing units”, we introduce a new simulation framework that enables the computation of fluid flows with TPUs. By leveraging latest advances on TensorFlow software and TPU-hardware architecture, this software tool allows detailed large-scale simulations of turbulent flows at unprecedented scale, pushing the boundaries of scientific discovery and turbulence analysis. We demonstrate that this framework scales efficiently to accommodate the scale of the problem or, alternatively, improved run times, which is remarkable since most large-scale distributed computation frameworks exhibit reduced efficiency with scaling. The software is available as an open-source project on GitHub.

Large-scale scientific computation with accelerators

The software solves variable-density Navier-Stokes equations on TPU architectures using the TensorFlow framework. The single-instruction, multiple-data (SIMD) approach is adopted for parallelization of the TPU solver implementation. The finite difference operators on a colocated structured mesh are cast as filters of the convolution function of TensorFlow, leveraging TPU’s matrix multiply unit (MXU). The framework takes advantage of the low-latency high-bandwidth inter-chips interconnect (ICI) between the TPU accelerators. In addition, by leveraging the single-precision floating-point computations and highly optimized executable through the accelerated linear algebra (XLA) compiler, it’s possible to perform large-scale simulations with excellent scaling on TPU hardware architectures.

This research effort demonstrates that the graph-based TensorFlow in combination with new types of ML special purpose hardware, can be used as a programming paradigm to solve partial differential equations representing multiphysics flows. The latter is achieved by augmenting the Navier-Stokes equations with physical models to account for chemical reactions, heat-transfer, and density changes to enable, for example, simulations of cloud formation and wildfires.

It’s worth noting that this framework is the first open-source computational fluid dynamics (CFD) framework for high-performance, large-scale simulations to fully leverage the cloud accelerators that have become common (and become a commodity) with the advancement of machine learning (ML) in recent years. While our work focuses on using TPU accelerators, the code can be easily adjusted for other accelerators, such as GPU clusters.

This framework demonstrates a way to greatly reduce the cost and turn-around time associated with running large-scale scientific CFD simulations and enables even greater iteration speed in fields, such as climate and weather research. Since the framework is implemented using TensorFlow, an ML language, it also enables the ready integration with ML methods and allows the exploration of ML approaches on CFD problems. With the general accessibility of TPU and GPU hardware, this approach lowers the barrier for researchers to contribute to our understanding of large-scale turbulent systems.

Framework validation and homogeneous isotropic turbulence

Beyond demonstrating the performance and the scaling capabilities, it is also critical to validate the correctness of this framework to ensure that when it is used for CFD problems, we get reasonable results. For this purpose, researchers typically use idealized benchmark problems during CFD solver development, many of which we adopted in our work (more details in the paper).

One such benchmark for turbulence analysis is homogeneous isotropic turbulence (HIT), which is a canonical and well studied flow in which the statistical properties, such as kinetic energy, are invariant under translations and rotations of the coordinate axes. By pushing the resolution to the limits of the current state of the art, we were able to perform direct numerical simulations with more than eight billion degrees of freedom — equivalent to a three-dimensional mesh with 2,048 grid points along each of the three directions. We used 512 TPU-v4 cores, distributing the computation of the grid points along the x, y, and z axes to a distribution of [2,2,128] cores, respectively, optimized for the performance on TPU. The wall clock time per timestep was around 425 milliseconds and the flow was simulated for a total of 400,000 timesteps. 50 TB data, which includes the velocity and density fields, is stored for 400 timesteps (every 1,000th step). To our knowledge, this is one of the largest turbulent flow simulations of its kind conducted to date.

Due to the complex, chaotic nature of the turbulent flow field, which extends across several magnitudes of resolution, simulating the system in high resolution is necessary. Because we employ a fine-resolution grid with eight billion points, we are able to accurately resolve the field.

Contours of x-component of velocity along the z midplane. The high resolution of the simulation is critical to accurately represent the turbulent field.

The turbulent kinetic energy and dissipation rates are two statistical quantities commonly used to analyze a turbulent flow. The temporal decay of these properties in a turbulent field without additional energy injection is due to viscous dissipation and the decay asymptotes follow the expected analytical power law. This is in agreement with the theoretical asymptotes and observations reported in the literature and thus, validates our framework.

Solid line: Temporal evolution of turbulent kinetic energy (k). Dashed line: Analytical power laws for decaying homogeneous isotropic turbulence (n=1.3) (l: eddy turnover time).
Solid line: Temporal evolution of dissipation rate (ε). Dashed line: Analytical power laws for decaying homogeneous isotropic turbulence (n=1.3).

The energy spectrum of a turbulent flow represents the energy content across wavenumber, where the wavenumber k is proportional to the inverse wavelength λ (i.e., k ∝ 1/λ). Generally, the spectrum can be qualitatively divided into three ranges: source range, inertial range and viscous dissipative range (from left to right on the wavenumber axis, below). The lowest wavenumbers in the source range correspond to the largest turbulent eddies, which have the most energy content. These large eddies transfer energy to turbulence in the intermediate wavenumbers (inertial range), which is statistically isotropic (i.e., essentially uniform in all directions). The smallest eddies, corresponding to the largest wavenumbers, are dissipated into thermal energy by the viscosity of the fluid. By virtue of the fine grid having 2,048 points in each of the three spatial directions, we are able to resolve the flow field up to the length scale at which viscous dissipation takes place. This direct numerical simulation approach is the most accurate as it does not require any closure model to approximate the energy cascade below the grid size.

Spectrum of turbulent kinetic energy at different time instances. The spectrum is normalized by the instantaneous integral length (l) and the turbulent kinetic energy (k).

A new era for turbulent flows research

More recently, we extended this framework to predict wildfires and atmospheric flows, which is relevant for climate-risk assessment. Apart from enabling high-fidelity simulations of complex turbulent flows, this simulation framework also provides capabilities for scientific machine learning (SciML) — for example, downsampling from a fine to a coarse grid (model reduction) or building models that run at lower resolution while still capturing the correct dynamic behaviors. It could also provide avenues for further scientific discovery, such as building ML-based models to better parameterize microphysics of turbulent flows, including physical relationships between temperature, pressure, vapor fraction, etc., and could improve upon various control tasks, e.g., to reduce the energy consumption of buildings or find more efficient propeller shapes. While attractive, a main bottleneck in SciML has been the availability of data for training. To explore this, we have been working with groups at Stanford and Kaggle to make the data from our high-resolution HIT simulation available through a community-hosted web-platform, BLASTNet, to provide broad access to high-fidelity data to the research community via a network-of-datasets approach. We hope that the availability of these emerging high-fidelity simulation tools in conjunction with community-driven datasets will lead to significant advances in various areas of fluid mechanics.

Acknowledgements

We would like to thank Qing Wang, Yi-Fan Chen, and John Anderson for consulting and advice, Tyler Russell and Carla Bromberg for program management.

Categories
Misc

How Industries Are Meeting Consumer Expectations With Speech AI

Thanks to rapid technological advances, consumers have become accustomed to an unprecedented level of convenience and efficiency. Smartphones make it easier than ever to search for a product and have it delivered right to the front door. Video chat technology lets friends and family on different continents connect with ease. With voice command tools, AI Read article >

Categories
Misc

NVIDIA CUDA Toolkit Symbol Server

Decorative image of two boxes with libcuda.sym labels.NVIDIA has already made available a GPU driver binary symbols server for Windows. Now, NVIDIA is making available a repository of CUDA Toolkit symbols for…Decorative image of two boxes with libcuda.sym labels.

NVIDIA has already made available a GPU driver binary symbols server for Windows. Now, NVIDIA is making available a repository of CUDA Toolkit symbols for Linux.

What are we providing?

NVIDIA is introducing CUDA Toolkit symbols for Linux for an application development enhancement. During application development, you can now download obfuscated symbols for NVIDIA libraries that are being debugged or profiled in your application. This is shipping initially for the CUDA Driver (libcuda.so) and the CUDA Runtime (libcudart.so), with more libraries to be added.

For instance, when an issue appears to relate to a CUDA API, it may not always be possible to provide NVIDIA with a reproducing example, core dump, or unsymbolized stack traces with all library load information. Providing a symbolized call stack can help speed up the debug process.

We are only hosting symbol files, so debug data will not be distributed. The symbol files contain obfuscated symbol names.

Quickstart guide

There are two recommended ways to use the obfuscated symbols for each library:

  • By unstripping the library
  • By deploying the .sym file as a symbol file for the library
# Determine the symbol file to fetch and obtain it
$ readelf -n /usr/local/cuda/lib64/libcudart.so

# ... Build ID: 70f26eb93e24216ffc0e93ccd8da31612d277030
# Browse to https://cudatoolkit-symbols.nvidia.com/libcudart.so/70f26eb93e24216ffc0e93ccd8da31612d277030/index.html to determine filename to download
$ wget https://cudatoolkit-symbols.nvidia.com/libcudart.so/70f26eb93e24216ffc0e93ccd8da31612d277030/libcudart.so.12.2.128.sym

# Then with appropriate permissions, either unstrip,
$ eu-unstrip /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudart.so.12.2.128 libcudart.so.12.2.128.sym –o /usr/local/cuda-12.2/targets/x86_64-linux/lib/libcudart.so.12.2.128

# Or, with appropriate permissions, deploy as symbol file
# By splitting the Build ID into first two characters as directory, then remaining with ".debug" extension
$ cp libcudart.so.12.2.128.sym /usr/lib/debug/.build-id/70/f26eb93e24216ffc0e93ccd8da31612d277030.debug

Example: Symbolizing

Here is a simplified example to show the uses of symbolizing. The sample application test_shared has a data corruption that leads to an invalid handle being passed to the CUDA Runtime API cudaStreamDestroy. With a default install of CUDA Toolkit and no obfuscated symbols, the output in gdb might look like the following:

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault.
0x00007ffff65f9468 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffff65f9468 in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff6657e1f in ?? () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff6013845 in ?? () from /usr/local/cuda/lib64/libcudart.so.12
#3  0x00007ffff604e698 in cudaStreamDestroy () from /usr/local/cuda/lib64/libcudart.so.12
#4  0x00005555555554e3 in main ()

After applying the obfuscated symbols in one of the ways described earlier, it would give a stack trace like the following example:

Thread 1 "test_shared" received signal SIGSEGV, Segmentation fault.
0x00007ffff65f9468 in libcuda_8e2eae48ba8eb68460582f76460557784d48a71a () from /lib/x86_64-linux-gnu/libcuda.so.1
(gdb) bt
#0  0x00007ffff65f9468 in libcuda_8e2eae48ba8eb68460582f76460557784d48a71a () from /lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff6657e1f in libcuda_10c0735c5053f532d0a8bdb0959e754c2e7a4e3d () from /lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff6013845 in libcudart_43d9a0d553511aed66b6c644856e24b360d81d0c () from /usr/local/cuda/lib64/libcudart.so.12
#3  0x00007ffff604e698 in cudaStreamDestroy () from /usr/local/cuda/lib64/libcudart.so.12
#4  0x00005555555554e3 in main ()

The symbolized call stack can then be documented as part of the bug description provided to NVIDIA for analysis.

Conclusion

When you have to profile and debug applications using CUDA and want to share a call stack with NVIDIA for analysis, use the CUDA symbol server. Profiling and debugging will be faster and easier.

For questions or issues, dive into the forum at Developer Tools.