Categories
Misc

NVIDIA CUDA Toolkit 12.2 Unleashes Powerful Features for Boosting Applications

CUDA abstract image.The latest release of NVIDIA CUDA Toolkit 12.2 introduces a range of essential new features, modifications to the programming model, and enhanced support for…CUDA abstract image.

The latest release of NVIDIA CUDA Toolkit 12.2 introduces a range of essential new features, modifications to the programming model, and enhanced support for hardware capabilities accelerating CUDA applications.

Now out through general availability from NVIDIA, CUDA Toolkit 12.2 includes many new capabilities, both major and minor. 

The following post offers an overview of many of the key capabilities, including:

  • NVIDIA Hopper (H100) GPU support.
  • Early access to NVIDIA Confidential Computing (CC) for Hopper GPUs.
  • Heterogeneous Memory Management (HMM) support.
  • A lazy loading default setting.
  • Application prioritization with CUDA Multi-Process Service (MPS).
  • Enhanced math libraries like cuFFT.
  • NVIDIA Nsight Compute and NVIDIA Nsight Systems Developer Tools updates.

As pioneers in accelerated computing, NVIDIA creates solutions for helping solve the world’s toughest computing challenges. Accelerated computing requires full-stack optimization, from chip architecture, systems, and acceleration libraries, to security and network connectivity. It all begins with the CUDA Toolkit.

Watch the following CUDA Toolkit 12.2 YouTube Premiere webinar.

Hopper GPU support

New H100 GPU architecture features are now supported with programming model enhancements for all GPUs, including new PTX instructions and exposure through higher-level C and C++ APIs. An instance of this is ‌Hopper Confidential Computing (see the following section to learn more), which offers early access deployment exclusively available with the Hopper GPU architecture.

Confidential computing for Hopper

The Hopper Confidential Computing early-access software release features a complete software stack targeting a single H100 GPU in passthrough mode, with a single session key for encryption and authentication, and basic use of NVIDIA Developer Tools. User code and data is encrypted and authenticated to the AES-GCM standard.

There is no need for any specific H100 SKUs, drivers, or toolkit downloads. Confidential computing with H100 GPUs requires a CPU that supports virtual machine (VM)–based TEE technology, such as AMD SEV-SNP and Intel TDX.

Read the Protecting Sensitive Data and AI Models with Confidential Computing post, which highlights OEM partners shipping CC-compatible servers.

The following figure compares data flow when using a VM when CC is on and off.

This figure shows two green rectangles with workflow diagrams showing data transfer between CPU to a virtual machine and GPU. The left rectangle has Confidential Computing off, while the right rectangle has Confidential Computing on. With CC on, a new blue rectangle within the green one is introduced and called the T-E-E. It shows a new protected area depicting encrypted transfers between VM and GPU.
Figure 1. Comparing the flow of instructions and data with confidential computing on and off

In Figure 1, a traditional VM is set up on the left. In this mode, the hypervisor assigns an H100 GPU (without CC mode enabled). While the hypervisor is isolated and protected from a malicious VM, the reverse isn’t true: the hypervisor can access the entirety of the VM space as well as direct access to the GPU. 

The right side of Figure 1 shows the same environment but on a confidential computing-capable machine. The CPU architecture isolates the now confidential virtual machine (CVM) from the hypervisor such that it can no longer access its memory pages. The H100 is also configured so all external accesses are disabled, except for the path between it and the CVM. The CVM and the H100 have encrypted and signed transfers across the PCIe bus, preventing an attacker with a bus analyzer from making use of, or silently corrupting the data. 

While using the early-access release, employ good practices and test only synthetic data and non-proprietary AI models. Security reviews, performance enhancements, and audits aren’t finalized.

Hopper Confidential Computing does not include encryption key rotation at this time. To learn more, see the post What Is Confidential Computing?

Heterogeneous memory management

The release also introduces heterogeneous memory management (HMM). This technology extends unified virtual memory support for seamless sharing of data between host memory and accelerator devices without needing memory allocated by or managed through CUDA. This makes porting applications into CUDA, or working with external frameworks and APIs, significantly easier. 

Currently, HMM is supported on Linux only and requires a recent kernel (6.1.24+ or 6.2.11+) along with using the NVIDIA GPU Open Kernel Modules driver.

Some limitations exist with the first release and the following are not yet supported:

  • GPU atomic operations on file-backed memory.
  • Arm CPUs.
  • HugeTLBfs pages on HMM.
  • The fork() system call when attempting to share GPU-accessible memory between parent and child processes.

HMM is also not yet fully optimized and may perform slower than programs using cudaMalloc(), cudaMallocManaged(), or other existing CUDA memory management APIs.

Lazy loading

A feature NVIDIA initially introduced in CUDA 11.7 as an opt-in, lazy loading is now enabled by default on Linux with the R535 driver and beyond. Lazy loading can substantially reduce both the host and device memory footprint by loading only CUDA kernels and library functions as needed. It’s common for complex libraries to contain thousands of different kernels and variants. This results in substantial savings.

Lazy loading is under user control and only the default value is changed. You can disable the feature on Linux by setting the environment variable before launching your application:

CUDA_MODULE_LOADING=EAGER 

While disabling in Windows is currently unavailable, you can enable it in Windows by setting the environment variable before launch:

CUDA_MODULE_LOADING=LAZY

Application prioritization with CUDA MPS

When running applications with CUDA MPS, each application is often coded as the only application present in the system. As such, its individual stream priorities may assume no system-level contention. In practice, however, users often want to make certain processes a higher or lower priority globally.

To help address this requirement, a coarse-grained per-client priority mapping at runtime for CUDA MPS is now available. This gives multiple processes running under MPS to arbitrate priority at a coarse-grained level between multiple processes without changing the application code.

A new environment variable called CUDA_MPS_CLIENT_PRIORITY accepts two values: NORMAL priority, 0, and BELOW_NORMAL priority, 1.

For example, given two clients, a potential configuration is as follows:

Client 1 Environment Client 2 Environment
export CUDA_MPS_CLIENT_PRIORITY=0 // NORMAL export CUDA_MPS_CLIENT_PRIORITY=1 // BELOW NORMAL
Table 1. An example configuration for setting priority variables

It’s worth noting that this doesn’t introduce priority-preemptive scheduling or hard real-time processing into the GPU scheduler. It does provide additional information to the scheduler about which kernels should enqueue when.

cuFFT LTO preview

An early access preview of the cuFFT library containing support for new and enhanced LTO-enabled callback routines is now available for download on Linux and Windows. LTO-enabled callbacks bring callback support for cuFFT on Windows for the first time. On Linux, these new enhanced callbacks offer a significant boost to performance in many callback use cases. You can learn more and download this new update on the cuFFT web page.

In CUDA 12.0, NVIDIA introduced the nvJitLink library for supporting Just-In-Time Link-Time Optimization (JIT LTO) in CUDA applications. This preview builds upon nvJitLink to leverage JIT LTO for LTO-enabled callbacks by enabling runtime fusion of user callback code and library kernel code. 

Nsight Developer Tools

Nsight Developer Tools are included in the CUDA Toolkit to help with debugging and performance profiling for CUDA applications. Tools for GPU development are already compatible with the H100 architecture. Support for the NVIDIA Grace CPU architecture is now available in Nsight Systems, for system-wide performance profiling.

Nsight Systems traces and analyzes platform hardware metrics, like CPU and GPU interactions, as well CUDA apps, APIs, and libraries on a unified timeline. Version 2023.2, available in CUDA Toolkit 12.2, introduces Python backtrace sampling.

GPU-accelerated Python is transforming AI workloads. With a periodic sampling of Python code, the Nsight Systems timeline offers a deeper understanding of what algorithms are involved in refactoring toward maximum GPU usage. Python sampling joins multi-node analysis and network metric collection to help optimize computing at the data center scale; learn more about accelerating data center and HPC performance analysis with Nsight Systems.

Nsight Compute provides detailed performance profiling and analysis of CUDA kernels running on a GPU. Version 2023.2 adds a new sorted list of detected performance issues on the summary page, including estimated speedups for correcting the issue. This list guides ‌performance tuning focus and helps users avoid spending time on unnecessary issues.

Another key feature added is performance rule markers at the source-line level on the source page. Previously, issues detected with the built-in performance rules were only displayed on the details page. Now, issues are marked with a warning icon on the source page. Performance metrics identify the location.   

This figure shows an Nsight Compute screen capture displaying issues with the source code that may reduce performance.
Figure 3.  Examine performance issues line-by-line in the source code viewe

These new features extend the guided analysis at both the high-level summary view and low-level source view, further improving Nsight Compute performance profiling and analysis capabilities.  

CUDA Toolkit 12.2 also equips you with the latest debugging tools. These include:

Learn about how to debug CUDA code with Compute Sanitizer.  

Summary

The latest CUDA Toolkit release introduces new features essential to boosting CUDA applications that create the foundation for accelerated computing applications. From chip architecture, NVIDIA DGX Cloud and NVIDIA DGX SuperPOD platforms, AI Enterprise software, and libraries, to security and accelerated network connectivity, the CUDA Toolkit offers incomparable full-stack optimization.

Do you still have questions? Register now to join our CUDA experts in a special AMA covering everything featured in CUDA 12 on July 26, 2023: https://nvda.ws/3XEcy2m.

For more information, see the following:

Categories
Misc

New MLPerf Inference Network Division Showcases NVIDIA InfiniBand and GPUDirect RDMA Capabilities

Image of Infiniband with decorative images in front.In MLPerf Inference v3.0, NVIDIA made its first submissions to the newly introduced Network division, which is now part of the MLPerf Inference Datacenter…Image of Infiniband with decorative images in front.

In MLPerf Inference v3.0, NVIDIA made its first submissions to the newly introduced Network division, which is now part of the MLPerf Inference Datacenter suite. The Network division is designed to simulate a real data center setup and strives to include the effect of networking—including both hardware and software—in end-to-end inference performance.

In the Network division, there are two types of nodes: Frontend nodes generate the queries, which are sent over the network to be processed by the Accelerator nodes, which perform inference. These are connected through standard network fabrics such as Ethernet or InfiniBand.

Diagram shows CPUs, memory, storage, and GPUs all in one node. A separate Frontend node with CPUs, memory, and storage connects through the network to an Accelerator node with CPUs, memory, storage, and GPUs.
Figure 1. Closed division vs. Network division nodes

Figure 1 shows that the Closed division runs entirely on a single node. In the Network division, queries are generated on the Frontend node and transferred to the Accelerator node for inferencing.

In the Network division, the Accelerator nodes incorporate the inference accelerators as well as all networking components. This includes the network interface controller (NIC), network switch, and network fabric. So, while the Network division seeks to measure the performance of the Accelerator node and network, it excludes the impact of the Frontend node as the latter plays a limited role in the benchmarking.

NVIDIA performance in the MLPerf Inference v3.0 Network division

In MLPerf Inference v3.0, NVIDIA made Network division submissions on both the ResNet-50 and BERT workloads. The NVIDIA submissions achieved 100% of single-node performance on ResNet-50 by using the extremely high network bandwidth and low latency of GPUDirect RDMA technology on NVIDIA ConnectX-6 InfiniBand smart adapter cards.

Benchmark DGX A100 (8x A100 80GB) Performance of the Network division compared to the Closed division
ResNet-50 (Low Accuracy) Offline 100%
Server 100%
BERT (Low Accuracy) Offline 94%
Server 96%
BERT (High Accuracy) Offline 90%
Server 96%
Table 1. ResNet-50 and BERT performance achieved in the Network division compared to the Closed division, with minimum network bandwidth required to sustain the achieved performance

The NVIDIA platform also showcased great performance on the BERT workload, with only a slight performance impact relative to the corresponding Closed division submission observed due to host-side batching overhead.

Technologies used in the NVIDIA Network division submission

A host of full-stack technologies enabled the strong NVIDIA Network division performance:

  • A NVIDIA TensorRT backend for the optimized inference engine
  • InfiniBand RDMA network transfers for low-latency, high-throughput tensor communication, built upon IBV verbs in Mellanox OFED software stack
  • An Ethernet TCP socket for configuration exchange, run-state synchronization, and heartbeat monitoring
  • A NUMA-aware implementation that uses CPU, GPU, and NIC resources for the best performance

Network division implementation details

Here’s a closer look at the implementation details of the MLPerf Inference Network division:

  • InfiniBand for high-throughput, low-latency communication
  • Network division inference flow
  • Performance optimizations

InfiniBand for high-throughput, low-latency communication

The Network division requires the submitter to implement a query dispatch library (QDL) that takes the queries from the load generator and dispatches them to the Accelerator nodes in a way suited to the submitter’s setup.

  • In the Frontend node, where the input tensor sequence is generated, the QDL abstracts the LoadGen system under the test (SUT) API to make it appear to the MLPerf Inference LoadGen that the accelerators are available to test locally.
  • In the Accelerator node, the QDL makes it appear that it interacts with LoadGen for inference requests and responses directly. Within the QDL implementation by NVIDIA, we implement seamless data communication and synchronization using InfiniBand IBV verbs and an Ethernet TCP socket.
Architecture diagram: IBDevice captures SmartNIC discovered in the system. IBCfgs provides user configuration about the NIC device to use, and its parameters such as buffer size, queue size, and NUMA information. IBResources maintains local and remote buffers as well as memory regions registered for RDMA and the associated IBDevice. IBConnection establishes InfiniBand queue pairs and provides communication hooks.
Figure 2. InfiniBand data exchange component inside the QDL

Figure 2 shows the data exchange component built in the QDL with InfiniBand technology.

Diagram shows the connection of the Frontend node and the Accelerator node through their respective network interface cards (NICs).
Figure 3. Example connection established between the Frontend and Accelerator nodes

Figure 3 shows how connections are established between two nodes using this data exchange component.

InfiniBand queue pairs (QPs) are the basic connection point between the nodes. The NVIDIA implementation uses the lossless reliable connection (RC), which is similar to TCP, and transfer mode, while relying on an InfiniBand HDR optical fabric solution to sustain up to 200 Gbits/sec throughput.

When the benchmarking begins, the QDL initialization discovers the InfiniBand NICs available in the system. Following the configuration stored in IBCfgs, the NICs designated for use are populated as IBDevice instances. During this population, memory regions for RDMA transfers are allocated, pinned, and registered as RDMA buffers and kept in IBResources, together with proper handles.

RDMA buffers of the Accelerator node reside in GPU memory to leverage GPUDirect RDMA. The RDMA buffer information, as well as the proper protection keys, are then communicated with the peer node through a TCP socket over Ethernet. This creates the IBConnection instance for the QDL.

The QDL implementation is NUMA-aware and maps the closest NUMA host memory, CPUs, and GPUs to each NIC. Each NIC uses the IBConnection to talk to a peer’s NIC.

Network division inference flow

Diagram shows the key components of the Frontend node and Accelerator nodes, as well as how an inference request is transmitted from the Frontend node to the Accelerator node.
Figure 4. Inference request flow from Frontend node to Accelerator node using DirectGPU RDMA

Figure 4 shows how the inference request is sent from the Frontend node and processed at the Accelerator node:

  1. LoadGen generates a query (inference request), which contains input tensors.
  2. The QDL redirects this query to the appropriate IBConnection based on the arbitration scheme.
  3. The query sample library (QSL) may be already registered within the RDMA buffer. If not, the QDL stages (copies) the query to the RDMA buffer.
  4. QDL initiates the RDMA transfer with the associated QP.
  5. The InfiniBand network transfer happens through the network switch.
  6. The query arrives at the peer’s QP.
  7. The query is then transferred to the destination RDMA buffer through Direct Memory Access.
  8. RDMA completion is acknowledged in the Accelerator node QDL.
  9. The QDL enables the Accelerator node to batch this query. The QDL tags the batch of queries to be issued to one of the Accelerator node’s accelerators.
  10. The Accelerator node’s accelerator performs inference using CUDA and TensorRT and produces a response in the RDMA buffer.

When the inference is ultimately performed as in step 10, the output tensors are generated and populated in the RDMA buffer. Then the Accelerator node starts transferring the response tensors to the Frontend node in a similar fashion but in the opposite direction.

Performance optimizations

The NVIDIA implementation uses InfiniBand RDMA Write and takes advantage of the shortest latency. For the RDMA Write to happen successfully, the sender must explicitly manage target memory buffers.

Both Frontend and Accelerator nodes manage buffer trackers to make sure that each query and response is kept in memory until consumed. As an example, ResNet-50 requires that up to 8K transactions be managed per connection (QP) to sustain the performance.

Some of the following key optimizations are included in the NVIDIA implementation.

The following key optimizations support better scalability:

  • Transaction tracker per IBConnection (QP):  Each IBConnection has an isolated transaction tracker resulting in lock-free, intra-connection transaction bookkeeping.
  • Multiple QP support per NIC:  An arbitrary number of IBConnections can be instantiated on any NIC, making it easy to support a large number of transactions spontaneously.

The following key optimizations improve InfiniBand resource efficiency:

  • Use of INLINE transfer for small messages:  Transferring small messages (typically less than 64 bytes) through INLINE transfer significantly improves performance and efficiency by avoiding PCIe transfers.
  • Use of UNSIGNALLED RDMA Write:  CQ maintenance becomes much more efficient as UNSIGNALLED transactions wait in CQ until SIGNALLED transaction happens, triggering completion handling of all transactions queued up so far (bulk completion) in the same node.
  • Use of solicited IB transfers:  Unsolicited RDAM transactions may queue up in the remote node until a solicited RDMA transaction happens, triggering bulk completion in the remote node.
  • Event-based CQ management:  Avoiding busy waiting for CQ management frees up CPU cycles.

The following key optimizations improve memory system efficiency:

  • RDMA transfer without staging in Frontend node:  When sending input tensors, avoid host memory copies by populating input tensors in the RDMA registered memory.
  • Aggregating (CUDA) memcpys in Accelerator node:  Improve the efficiency of GPU memory copies and PCIe transfers by gathering tensors in the consecutive memory as much as possible.

Each vendor’s QP implementation details the maximum number of completion queue entries (CQEs) supported, as well as the supported maximum QP entry sizes. It’s important to scale the number of QPs per NIC to cover the latency while sustaining enough transactions on-the-fly to achieve maximum throughput.

Host CPUs can also be stressed significantly if an extremely large number of transactions are handled in a short time from CQ by polling. Event-based CQ management, together with a reduction in the number of notifications, helps greatly in this case. Maximize the memory access efficiency by aggregating data in contiguous space as much as possible and, if possible, in the RDMA registered memory space. This is crucial to achieving maximum performance.

Summary

The NVIDIA platform delivered exceptional performance in its inaugural Network division submission on top of our continued performance leadership in the MLPerf Inference: Datacenter Closed division. These results were achieved using many NVIDIA platform capabilities:

The results provide a further demonstration of the performance and versatility of the NVIDIA AI platform in real data center deployments on industry-standard, peer-reviewed benchmarks.

Categories
Misc

Here’s the Deal: Steam Summer Sale Games Streaming on GeForce NOW

GFN Thursday arrives alongside the sweet Steam Summer Sale — with hundreds of PC games playable on GeForce NOW available during Valve’s special event for PC gamers. Also on sale, OCTOPATH TRAVELER and OCTOPATH TRAVELER II join the GeForce NOW library as a part of five new games coming to the service this week. Saved Read article >

Categories
Misc

‘My Favorite 3D App’: Blender Fanatic Shares His Japanese-Inspired Scene This Week ‘In the NVIDIA Studio’

A diverse range of artists, fashionistas, musicians and the cinematic arts inspired the creative journey of Pedro Soares, aka Blendeered, and helped him fall in love with using 3D to create art.

Categories
Misc

XPENG Unleashes G6 Coupe SUV for Mainstream Market

China electric vehicle maker XPENG Motors has announced its new G6 coupe SUV — featuring an NVIDIA-powered intelligent advanced driver assistance system — is now available to the China market. The G6 is XPENG’s first model featuring the company’s proprietary Smart Electric Platform Architecture (SEPA) 2.0, which aims to reduce development and manufacturing costs and Read article >

Categories
Misc

NVIDIA CEO, European Generative AI Execs Discuss Keys to Success

Three leading European generative AI startups joined NVIDIA founder and CEO Jensen Huang this week to talk about the new era of computing. More than 500 developers, researchers, entrepreneurs and executives from across Europe and further afield packed into the Spindler and Klatt, a sleek, riverside gathering spot in Berlin. Huang started the reception by Read article >

Categories
Misc

A Change in the Weather: AI, Accelerated Computing Promise Faster, More Efficient Predictions

The increased frequency and severity of extreme weather and climate events could take a million lives and cost $1.7 trillion annually by 2050, according to the Munich Reinsurance Company. This underscores a critical need for accurate weather forecasting, especially with the rise in severe weather occurrences such as blizzards, hurricanes and heatwaves. AI and accelerated Read article >

Categories
Misc

Event: CUDA 12.2 YouTube Premiere

A screenshot of the YouTube page for the event.On July 6, join experts for a deep dive into CUDA 12.2, including support for confidential computing.A screenshot of the YouTube page for the event.

On July 6, join experts for a deep dive into CUDA 12.2, including support for confidential computing.

Categories
Misc

Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines

Decorative image of structured sparsityDeep learning is achieving significant success in various fields and areas, as it has revolutionized the way we analyze, understand, and manipulate data. There…Decorative image of structured sparsity

Deep learning is achieving significant success in various fields and areas, as it has revolutionized the way we analyze, understand, and manipulate data. There are many success stories in computer vision, natural language processing (NLP), medical diagnosis and health care, autonomous vehicles, recommendation systems, and climate and weather modeling.

In an era of ever-growing neural network models, the high demand for computational speed becomes a big challenge for hardware and software. Model pruning and low-precision inference are useful solutions.

Starting with the NVIDIA Ampere architecture and the introduction of the A100 Tensor Core GPU, NVIDIA GPUs have the fine-grained structured sparsity feature, which can be used to accelerate inference. For more information, see the NVIDIA A100 Tensor Core GPU Architecture: Unprecedented Acceleration at Every Scale whitepaper. 

In this post, we introduce some training recipes for such sparse models to maintain accuracy, including the basic recipes, the progressive recipes, and the combination with int8 quantization. We also discuss how to do the inference with the structured sparsity in the NVIDIA Ampere architecture.

Tencent’s Machine Learning Platform department (MLPD) used the progressive training techniques to simplify training and achieve better accuracy. With the sparsity feature and some quantization techniques, they achieved 1.3–1.8x acceleration in Tencent’s offline services.

Structured sparsity in the NVIDIA Ampere architecture

NVIDIA Ampere and NVIDIA Hopper architecture GPUs add the new feature of fine-grained structured sparsity, which can mainly be used to accelerate inference workloads. This feature is supported by sparse Tensor Cores, which require a 2:4 sparsity pattern. Among each group of four contiguous values, at least two must be zero, which is a 50% sparsity rate.

This pattern can have efficient memory access, good speedup, and can easily recover accuracy. After compression, only non-zero values and the associated index metadata are stored (Figure 1). The sparse Tensor Cores process only the non-zero values when doing matrix multiplication and theoretically, the compute throughput would be 2x compared to the equivalent dense matrix multiplication.

Diagram shows a structured-sparse matrix with a 2:4 sparsity pattern. In four contiguous values, at least two values are zero. After compression, only non-zero values and the associated index metadata are stored.
Figure 1. 2:4 structured sparsity pattern and compression

Structured sparsity can mainly be applied on fully connected layers and convolution layers where 2:4 sparse weights are provided. If you prune the weights of these layers in advance, then these layers can be accelerated by structured sparsity.

Training recipes

As directly pruning the weights decreases the model accuracy, you should do some training to recover the accuracy when using the structured sparsity. Here, we introduce some basic recipes and new progressive recipes.

Basic recipes

The basic recipes maintain the model accuracy without any hyperparameter tuning. For more information, see Accelerating Sparse Deep Neural Networks.

The workflow is easy to follow:

  1. Train a model without sparsity.
  2. Prune the model in a 2:4 sparse pattern for the FC and convolution layers.
  3. Retrain the pruned model by following these rules:
    1. Initialize the weights to the values from Step 2.
    2. Use the same optimizer and hyperparameter (learning rate, schedule, number of epochs, and so on) as in Step 1.
    3. Maintain the sparsity pattern computed in Step 2.
Diagram shows that the basic training recipe is simply repeating the original dense training with the pruned weights and the masked optimizer.
Figure 2. Basic training recipe

There are also some advanced recipes for complicated cases.

For example, apply sparse training in multiple stages. For some object detection models, if the dataset in the downstream task is large enough, you can just repeat the fine-tuning with sparsity. For models like BERT-SQuAD, the dataset is relatively small, and you must apply the sparsity to the pretraining phase for better accuracy.

Also, you can easily combine the sparsity training with int8 QAT by inserting the quant nodes before fine-tuning. All these training and fine-tuning methods are one-shot, as the final model is obtained after one sparse training process.

Progressive training recipes

One-shot sparse fine-tuning can cover most tasks and achieve speedup without accuracy loss. On the other hand, for some difficult tasks that are sensitive to weight changes, one-shot sparsity for all weights causes large information loss. It’s difficult to recover accuracy by only fine-tuning with small datasets, and sparse pretraining is also needed for these tasks.

Sparse pretraining requires more data and is more time-consuming. So, inspired by the pruning method for CNN, the progressive sparsity is introduced to apply sparsity only on fine-tuning phase for such tasks without too much accuracy loss. For more information, see Learning Both Weights and Connections for Efficient Neural Networks.

Diagram compares one-shot sparsity and progressive sparsity to show that the latter divides the sparsity ratio into several steps for easy recovery.
Figure 3. The idea of progressive sparsity

The key idea of progressive sparsity is to divide the target sparsity ratio into several small steps.

S_{target} = sumlimits_{n-1}^{N} S_n

As shown in the equation and Figure 4, for a target sparsity ratio S, you divide it into N steps, which facilitates the rapid recovery of information during the fine-tuning process. Based on our experiments, progressive sparsity can achieve higher accuracy than one-shot sparsity with the same fine-tuning epochs.

Diagram shows three-step example case for progressive sparsity: prune 25% weights, fine-tune, and then prune 50% weights and finetune.
Figure 4. Progressive sparsity (50% sparsity in a 2:4 pattern)

Taking 50% sparsity in a 2:4 pattern as a simple case, divide the sparsity ratio into two steps, and progressively sparse and fine-tune the weights in the network.

As shown in Figure 4, you first compute the mask of weights to achieve 25% sparsity, and then perform sparse fine-tuning to recover the accuracy. Finally, you recalculate the mask to 50% sparsity and fine-tuning the network to obtain a sparse model without loss of accuracy.

Sparse-QAT: Combine sparsity with quantization and distillation

To obtain lighter models, you can further combine sparsity with quantization and distillation in a method called sparse-QAT.

Quantization (PTQ and QAT)

The following equation formulates a general quantization procedure. For a float32 value x, use Q[x] to denote its quantized value with K-bit representation.

Q[x] = s times round bigl( clamp left( frac{x}{s},l_{min},l_{max} right) bigr)

In general, you first quantize the original parameters to a certain range and round them to integers. Then, you use the quantization scale to recover the original value. This motivates your first quantization method, calibration, also known as post-training quantization (PTQ).

In calibration, a critical thing is to set an appropriate quantization scale. If the scale is too large, it is less accurate for numbers within the range. On the contrary, if the scale is too small, it results in too many numbers outside the range of l_{min} to l_{max}.

Therefore, to balance these two aspects, you first obtain the distribution of tensor values and then set the quantization scale to include 99.99% of the numbers. It has been proved in multiple works that this method is helpful in finding a good quantization scale during calibration.

However, although you have set a reasonable quantization scale for calibration, the accuracy still drops significantly for 8 bits. You then introduce quantization-aware training (QAT) to further improve the accuracy after calibration. The idea of QAT is to train a model with simulation quantization.

In the forward pass, you quantize the weights to int8 and dequantize them to float in the node to simulate quantization. In the backward pass, the gradients are used to update the model weights, which is called straight-through estimation (STE). The key idea is the following equation:

frac{partial f}{partial x} = frac{partial f }{partial Q[x] }

Gradients of values in the threshold ranges are passed directly in the backward pass, and gradients of values outside the threshold ranges are clipped to 0.

Knowledge distillation

In addition to the previous methods, we introduced knowledge distillation (KD) to further ensure the accuracy of the sparse-QAT model. Take the original model as the teacher and the quantized sparse model as the student.

During the fine-tuning process, we adopted mini-distillation, which is a layer-wise distillation. With MiniLM, you only have to use the self-attention output of the last transformer layer to do the distillation. You can even achieve higher accuracy than the teacher model after Sparse-QAT. For more information, see MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers.

Pipeline of Sparse-QAT

Figure 5 shows the pipeline of Sparse-QAT. You use sparsity, quantization, and KD together to get the final sparse-int8 model. There are three passes:

  • Sparsity pass: Apply progressive sparsity to get a sparse tensor.
  • Quantization pass: Use PTQ and QAT to get an int8 tensor.
  • Knowledge distillation pass: Use MiniLM to guarantee the accuracy of the final sparse-int8 model.
Workflow diagram shows that sparsity, quantization, and knowledge distillation are combined to get the final sparse-int8 model.
Figure 5. Pipeline of sparse-QAT

Inference with NVIDIA Ampere architecture sparsity

After the sparse model is trained, you can use TensorRT and cuSPARSELt to accelerate the inference with NVIDIA Ampere architecture structured sparsity.

Inference with NVIDIA TensorRT

NVIDIA TensorRT supports sparse convolution as of version 8.0. GEMMs should be replaced with 1×1 convolutions to use the sparsity inference. Enabling sparsity in TensorRT is easy. Before importing into TensorRT, the weights of the model should have 2:4 sparsity. If trtexec is used to build the engine, set the –sparsity=enable flag. If you are writing codes or scripts to build the engine, set the build config as follows:

For C++: config->setFlag(BuilderFlag::kSPARSE_WEIGHTS)

For Python: config.set_flag(trt.BuilderFlag.SPARSE_WEIGHTS)

Use NVIDIA cuSPARSELt to enhance TensorRT

For some use cases, the input sizes may vary and TensorRT may not provide the best performance. You can use NVIDIA cuSPARSELt to further accelerate these cases.

The solution is to write TensoRT plug-ins with cuSPARSELt. Initialize multiple descriptors and plans of cuSPARSELt sparse GEMM for different input sizes and choose an appropriate plan for each input.

Assuming that you implement the SpmmPluginDynamic plug-in, it inherits from nvinfer1::IPluginV2DynamicExt and you use a private struct to store the plans.

 struct cusparseLtContext {
    cusparseLtHandle_t handle;
    std::vector plans;
    std::vector matAs, matBs, matCs;
    std::vector matmuls;
    std::vector alg_sels;
}

A TensorRT plug-in should implement the configurePlugin method, which sets up the plug-in according to the input and output types and sizes. You initialize the cuSPARSELt-related structures in this method.

There is a constraint that the input size of cuSPARSELt should be a multiple of 4, 8, or 16 depending on the data type. For this post, we set it to a multiple of 16. For more information about the constraints, see cusparseLtDenseDescriptorInit.

for (int i = 0; i 

In the enqueue function, you can select the proper plan to do the matmul operation:

int m = inputDesc->dims.d[0];
int idx = (m + 15) / 16 - 1;
float alpha = 1.0f;
float beta = 0.0f;
auto input = static_cast(inputs[0]);
auto output = static_cast(outputs[0]);
cusparseStatus_t status = cusparseLtMatmul(
    &handle, &plans[idx], &alpha, input,
    weight_compressed, &beta, output, output, workSpace, &stream, 1);

Some applications in search engines

In this section, we show four applications that take advantage of sparsity in search engines:

  • Search relevance prediction aims to evaluate the relevance between the input text and the videos in the database.
  • Query performance prediction is used for the document recall delivery strategy.
  • A recall task for recalling the most relevant texts.
  • The text-to-image task automatically generates corresponding pictures according to the input prompt.

Search relevance cases

The results of these cases are evaluated with positive negative rate (PNR) or accuracy (Acc).

In relevance case 1, we ran Sparse-QAT and obtained a sparse-int8 model with higher PNR than the online int8 model in two important evaluation indices.

Model Evaluation Index A Evaluation Index B
float32 4.0138 3.1384
int8 (online) 3.9454 2.9120
Sparse-int8 4.0406 2.9591
Table 1. Relevance case 1 in a search engine (PNR)

In relevance case 2, the sparse-int8 model can achieve comparable Acc scores to the float32 model with a 1.4x inference speedup compared to the dense-int8 model.

  Evaluation Index
Bert_12L (float32) 0.8015
Bert_12L (sparse-int8) 0.8010
Table 2. Relevance case 2 in a search engine (Acc)

Query performance prediction cases

In this section, we show four cases of query performance prediction (QPP) that are evaluated with normalized discounted cumulative gain (NDCG). As shown in Table 3, these sparse-fp16 models can achieve even higher accuracy than the original float32 models, with a four-fold speedup in inference and negligible impact on the NDCG.

  Case A Case B Case C Case D
NDCG
(float32 model)
39723 39858 39878 32471
NDCG
(sparse-float16)
39744 39927 39909 32494
Relative inference time (float32) 1 1 1 1
Relative inference time (sparse-float16) 0.25 0.25 0.25 0.25
Inference speedup 4x 4x 4x 4x
Table 3. Query performance prediction cases in a search engine

Query-doc case

Table 4 shows the result of a query-doc case in the search engine. With the proposed sparse-QAT pipeline, the sparse-int8 models can achieve 1.4x inference speedup with negligible accuracy loss compared with the dense-int8 model.

   Acc f1_345 recall_345
FP32 model 0.7839 0.8627 0.8312
Sparse-int8 model 0.7814 0.8622 0.8416
Table 4. Query-doc case in a search engine

Text-to-image cases

Figure 6 shows the results of text-to-image models. The top four images are the output of the dense float32 model, and the bottom four images are the output of the sparse float16 model.

From the results, you can see that, given the same prompt, the sparse model can produce comparable results with the ones from the dense model. Some of the results are more reasonable as the model pruning and extra progressive sparse fine-tuning make the model learn more from the data.

For the text-to-image cases, the sparse model produces more reasonable results than the dense model.
Figure 6. Text-to-image cases in search engine

Summary

The structured sparsity feature in the NVIDIA Ampere architecture can accelerate many deep learning workloads, and it is easy to use with TensorRT and cuSPARSELt.

For more information, see the Structured Sparsity in the NVIDIA Ampere Architecture and its Applications in Tencent WeChat Search GTC session. Download the latest TensorRT and cuSPARSELt versions.

Categories
Misc

AI, Digital Twins to Unleash Next Wave of Climate Research Innovation

AI and accelerated computing will help climate researchers achieve the miracles they need to achieve breakthroughs in climate research, NVIDIA founder and CEO Jensen Huang said during a keynote Monday at the Berlin Summit for the Earth Virtualization Engines initiative. “Richard Feynman once said that “what I can’t create, I don’t understand” and that’s the Read article >