NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through…
NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through new hardware capabilities.
New architecture-specific features in NVIDIA Hopper and Ada Lovelace are initially being exposed through libraries and framework enhancements. The full programming model enhancements for the NVIDIA Hopper architecture will be released starting with the CUDA Toolkit 12 family.
CUDA 11.8 has several important features. This post offers an overview of the key capabilities.
NVIDIA Hopper and NVIDIA Ada architecture support
CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families.
CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.
Lazy module loading
Building on the lazy kernel loading feature in 11.7, NVIDIA added lazy loading to the CPU module side. What this means is that functions and libraries load faster on the CPU, with sometimes substantial memory footprint reductions. The tradeoff is a minimal amount of latency at the point in the application where the functions are first loaded. This is lower overall than the total latency without lazy loading.
All libraries used with lazy loading must be built with 11.7+ to be eligible for lazy loading.
Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY
set.
Improved MPS signal handling
You can now terminate with SIGINT
or SIGKILL
any applications running in MPS environments without affecting other running processes. While not true error isolation, this enhancement enables more fine-grained application control, especially in bare-metal data center environments.
NVIDIA JetPack installation simplification
NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Starting from CUDA Toolkit 11.8, Jetson users on NVIDIA JetPack 5.0 and later can upgrade to the latest CUDA versions without updating the NVIDIA JetPack version or Jetson Linux BSP (board support package) to stay on par with the CUDA desktop releases.
For more information, see Simplifying CUDA Upgrades for NVIDIA Jetson Developers.
CUDA developer tool updates
Compute developer tools are designed in lockstep with the CUDA ecosystem to help you identify and correct performance issues.
Nsight Compute
In Nsight Compute, you can expose low-level performance metrics, debug API calls, and visualize workloads to help optimize CUDA kernels. New compute features are being introduced in CUDA 11.8 to aid performance tuning activity on the NVIDIA Hopper architecture.
You can now profile and debug NVIDIA Hopper thread block clusters, which provide performance boosts and increased control over the GPU. Cluster tuning is being released in combination with profiling support for the Tensor Memory Accelerator (TMA), the NVIDIA Hopper rapid data transfer system between global and shared memory.
A new sample is included in Nsight Compute for CUDA 11.8 as well. The sample provides source code and precollected results that walk you through an entire workflow to identify and fix an uncoalesced memory access problem. Explore more CUDA samples to equip yourself with the knowledge to use toolkit features and solve similar cases in your own application.
Nsight Systems
Profiling with Nsight Systems can provide insight into issues such as GPU starvation, unnecessary GPU synchronization, insufficient CPU parallelizing, and expensive algorithms across the CPUs and GPUs. Understanding these behaviors and the load of deep learning frameworks, such as PyTorch and TensorFlow, helps you tune your models and parameters to increase overall single or multi-GPU utilization.
Other tools
Also included in the CUDA toolkit, both CUDA-GDB for CPU and GPU thread debugging as well as Compute Sanitizer for functional correctness checking have support for the NVIDIA Hopper architecture.
Summary
This release of the CUDA 11.8 Toolkit has the following features:
- First release supporting NVIDIA Hopper and NVIDIA Ada Lovelace GPUs
- Lazy module loading extended to support lazy loading of CPU-side modules in addition to device-side kernels
- Improved MPS signal handling for interrupting and terminating applications
- NVIDIA JetPack installation simplification
- CUDA developer tool updates
For more information, see the following resources:
- Optimizing CUDA Machine Learning Codes with Nsight Profiling Tools
- CUDA Toolkit download
- NVIDIA Hopper architecture
- NVIDIA Ada Lovelace Architecture
- CUDA Compatibility
- NVIDIA Releases Open-Source GPU Kernel Modules
- NVIDIA Nsight Compute and NVIDIA Nsight Systems
- NVIDIA Jetson and NVIDIA JetPack SDKs