Explainer: What Is Extended Reality?

Extended reality, or XR, is a collective term that refers to immersive technologies, including virtual reality, augmented reality, and mixed reality.

Extended reality, or XR, is a collective term that refers to immersive technologies, including virtual reality, augmented reality, and mixed reality.


Meet the Omnivore: Ph.D. Student Lets Anyone Bring Simulated Bots to Life With NVIDIA Omniverse Extension

When not engrossed in his studies toward a Ph.D. in statistics, conducting data-driven research on AI and robotics, or enjoying his favorite hobby of sailing, Yizhou Zhao is winning contests for developers who use NVIDIA Omniverse — a platform for connecting and building custom 3D pipelines and metaverse applications. 

The post Meet the Omnivore: Ph.D. Student Lets Anyone Bring Simulated Bots to Life With NVIDIA Omniverse Extension appeared first on NVIDIA Blog.


Researchers Use AI to Help Earbud Users Mute Background Noise

Thanks to earbuds, people can take calls anywhere, while doing anything. The problem: those on the other end of the call can hear all the background noise, too, whether it’s the roommate’s vacuum cleaner or neighboring conversations at a café. Now, work by a trio of graduate students at the University of Washington, who spent Read article >

The post Researchers Use AI to Help Earbud Users Mute Background Noise appeared first on NVIDIA Blog.


Explain Your Machine Learning Model Predictions with GPU-Accelerated SHAP

Machine learning (ML) is increasingly used across industries. Fraud detection, demand sensing, and credit underwriting are a few examples of specific use…

Machine learning (ML) is increasingly used across industries. Fraud detection, demand sensing, and credit underwriting are a few examples of specific use cases. 

These machine learning models make decisions that affect everyday lives. Therefore, it’s imperative that model predictions are fair, unbiased, and nondiscriminatory. Accurate predictions become vital in high-risk applications where transparency and trust are crucial. 

One way to ensure fairness in AI is to analyze the predictions obtained from a machine learning model. This exposes disparities and provides the opportunity to take corrective actions to diagnose and rectify the underlying cause. 

Explainable AI (XAI) is a field of Responsible AI dedicated to studying techniques that explain how a machine learning model makes predictions. These explanations are human-understandable, enabling all stakeholders to make sense of the model’s output and make the necessary decisions. SHAP is one such technique used widely in industry to evaluate and explain a model’s prediction.

This post explains how you can train an XGBoost model, implement the SHAP technique in Python using a CPU and GPU, and finally compare results between the two. By the end of the post, you should be able to answer the following questions:

  • Why is it crucial to explain machine learning models, especially in high-stakes decisions?
  • How do we differentiate between Interpretable and Explainable techniques?
  • What is the SHAP technique, and how is it used to explain a model’s predictions?
  • What is the advantage of GPU-accelerated SHAP?

Explainability versus interpretability 

In the context of artificial intelligence and machine learning, it is helpful to distinguish explainability from interpretability. The terms have distinct meanings but are often used interchangeably.

In the seminal paper, Psychological Foundations of Explainability and Interpretability in Artificial Intelligence, the researchers at the US National Institute of Standards and Technology (NIST) have proposed the following definitions of explainability and interpretability:


Explainability is a low-level, detailed mental representation that seeks to describe some complex processes. An explanation describes how some model mechanism or output came to be.


Interpretability is a high-level, meaningful mental representation that contextualizes a stimulus and leverages human background knowledge. An interpretable model should provide users with a description of what a data point or model output means in context.

In addition, according to Explanation in Artificial Intelligence: Insights from the Social Sciences, interpretability refers to the degree to which humans can understand and trust an ML model’s predictions. 

Overview of explanation methods

The approaches to explaining model predictions can be broadly divided into model-specific and post-hoc techniques


Algorithms like generalized linear models, decision trees, and generalized additive models are designed to be interpretable. These are called glassbox models because it is possible to trace and reason how a prediction was made. The techniques used to explain such models are model-specific because each method is based on some specific model’s internals. For instance, the interpretation of weights in linear models counts toward model-specific explanations.


Post-hoc explainability techniques, as the name suggests, are applied after a model has been trained. Some well-known post-hoc techniques include SHAP, LIME, and Partial Dependence Plots. These are model agnostic. They work by treating the model as a BlackBox and assume they only have access to the model’s inputs and outputs. This makes them beneficial for complex algorithms, like boosted trees and deep neural nets, which are not explainable through model-specific techniques. 

This post focuses on SHAP, a post-hoc technique for explaining model predictions.

Using the SHAP technique to explain models

SHAP is an acronym for SHapley Additive Explanations. It is one of the most commonly used post-hoc explainability techniques. SHAP leverages the concept of cooperative game theory to break down a prediction to measure the impact of each feature on the prediction.

Shapley values are defined as the average marginal contribution of a feature value across all possible feature coalitions. A technique with origins in economics and game theory, Shapley values assign fair payouts to players in a coalition depending upon their contribution to the total gain. Translating this into a machine learning scenario means assigning importance to features in a model depending on their contribution to the model’s prediction.

SHAP unifies several approaches to generate accurate local feature importance values using Shapley values which can then be aggregated to obtain global explanations. SHAP values interpret the impact on the model’s prediction of a given feature having a specific value, compared to the prediction we’d make if that feature took some baseline value. A baseline value is a value that the model would predict if it had no information about any feature values. 

SHAP is one of the most widely used post-hoc explainability technique for calculating feature attributions. It is model agnostic, can be used both as a local and global  feature attribution technique and has credible theoretical support from economics. Additionally, a variant of SHAP for tree based models reduces the computation time considerably, thereby helping users to gain insights from models quickly.

The following section provides an example of how to use the SHAP technique.

Step 1: Training an XGBoost model and calculating SHAP values

 Use the well-known Adult Income Dataset to perform the following :

  • Train an XGBoost model on the given dataset to predict whether a person earns more than $50K a year. Such data could be helpful in various use cases like target marketing.
  • Compute the SHAP values to explain the individual feature contributions.
  • Visualize and interpret the SHAP values.


SHAP can be installed using its stand-alone Python package called shap available on GitHub:

pip install shap

conda install -c conda-forge shap

SHAP is also inherently supported by popular algorithms like LightGBM, and XGBoost, and several R packages.

Setting up the environment

Start by setting up the environment and importing the necessary libraries:

import numpy as np   
import pandas as pd  

# Visualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline

## Machine learning packages
from sklearn.model_selection import train_test_split
import xgboost as xgb

## Model Interpretation package
import shap

# Ensuring Reproducibility
SEED = 12345

# Ignoring the warnings
import warnings  
warnings.filterwarnings(action = "ignore")


This dataset comes from the UCI Machine Learning Repository Irvine and is available on Kaggle. It contains information about the demographic information of people based on census data. The dataset has attributes such as education, and hours of work per week, age, and so on. 

The shap library ships with some commonly used datasets, including the preprocessed version of the Adult Income Dataset used below. 

Code block displaying the datasets available in the shap library
Figure 1. Accessing datasets from the shap library
X,y =
X_view,y_view =
Code block generating a dataframe consisting of various predictor variables like Age, Work class, Education, etc., and one target variable. The target variable is 'True' if a person earns >$50K annually and 'False' if the earned income is <figcaption><em>Table 1. DataFrame showing the first five rows of the dataset</em></figcaption></figure></div>

<p>As shown above, the dataset consists of various predictor variables like Age, Work class, and Education, plus one target variable. The target variable is <code>True</code> if a person earns >$50K annually and <code>False</code> if the earned income is ≤$50K. After ensuring the dataset is preprocessed and clean,  proceed with the model training.</p>

<h3>Training an XGBoost model </h3>

<p>XGBoost performs exceptionally well for tabular datasets and is very popular in the machine learning community. To begin, split the dataset into train and validation sets.</p>

<pre class=# create a train/test split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=7)

Next, train an XGBoost model on the training data for 5K boosting rounds.

# read in the split dataset into an optimized data structure called Dmatrix required by XGBoost
dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

# Feed the model the global bias
base_score = np.mean(y_train)
#Set hyperparameters for model training
params = {
   'objective': 'binary:logistic',
   'eval_metric': 'logloss',
   'eta': 0.01,
   'subsample': 0.5,
   'colsample_bytree': 0.8,
   'max_depth': 5,
   'base_score': base_score,
   'seed': SEED
# Train using early stopping on the validation dataset.
watchlist = [(dtrain, 'X_train'), (dvalid, 'X_test')]
model = xgb.train(params,        
Wall time: 14.3 s

The training execution takes 14.3 seconds on an Apple M1 8 Core CPU, with early stopping enabled.

See the XGBoost Parameters for more information on the configurable parameters within the XGBoost module.

Calculating SHAP values

SHAP comes in many different flavors depending on the nature of the algorithm. The most popular seem to be KernelSHAP, DeepSHAP, and TreeSHAP. While KernelSHAP is model-agnostic, TreeSHAP is only suitable for tree-based models (including the XGBoost model we just trained). Use the TreeExplainer class from the shap library to explain the entire dataset containing over 30K samples with over a thousand trees.

explainer = shap.TreeExplainer(model=model)
shap_values = explainer.shap_values(X) 
CPU times: user 4min 12s, sys: 116 ms, total: 4min 12s
Wall time: 1min 4s

Using the same hardware outlined above, the SHAP values were calculated in 1.4 minutes. Note the timing  to compare them with the values obtained in the next secNow that the SHAP values have been determined, the next step is interpretation. To understand the values, shap provides several different types of visualizations like force plots, summary plots, decision plots, and more, each highlighting a specific aspect of the model. Two plots are included below. 

SHAP force plot

A force plot is used to explain a single instance in the dataset. Generate a force plot for the first row in the training set and see how the different features impact the prediction. First print the ground truth label and the model’s prediction for this data instance.

classes = {0 : 'False', 1: 'True'}

# ground truth label

# Model Prediction
y_pred = [round(value) for value in model.predict(dvalid)]


The ground truth label for this person is False; that is, the person earns ≤$50K annually. The model also predicts the same. Figure 2 shows a force plot for the same person giving an insight into how the various features contributed towards the model’s prediction for this particular observation.

shap.force_plot(explainer.expected_value, shap_values[0,:],X.iloc[0,:])
A Force plot explaining the predictions for the first person in the dataset. The base_value here is -1.143, while the target value for the selected sample is -3.89. Features such as Age and Education Number are depicted in red and they push the prediction towards the base value. Features like Capital Gain, Relationship and marital status are shown in blue and they push the prediction away from  the base value.
Figure 2. SHAP force plot

The base_value here is –1.143, while the target value for the selected sample is –3.89. All the values greater than the base value will have income ≥$50K and vice versa. For the chosen sample, the features appearing in red in Figure 2 push the prediction toward the base value while those in blue push the prediction away from the base value. It is therefore possible to infer that having a capital gain of 2,174 and relationship status of 0 negatively influences the model’s prediction for this particular person earning >$50K. 

SHAP summary plot

The previous section looked at how SHAP was able to provide local explanations, or explanations specific to a single prediction. These values can be aggregated to get insights into a global view. A way to do this is by using the SHAP summary plots

SHAP summary plots provide an overview of which features are more important for the model. This can be accomplished by plotting the SHAP values of every feature for every sample in the dataset. Figure 3 depicts a summary plot where each point in the graph corresponds to a single row in the dataset.

shap.summary_plot(shap_values_gpu, X_test)
A SHAP summary  plot summarizing the explanations for the entire dataset.
Figure 3. SHAP summary plot

Each point in Figure 3 represents a row from the original dataset. For every point:

  • The y-axis indicates the features in order of importance from top to bottom. The x-axis refers to the actual SHAP values.
  • The horizontal location of a point represents the feature’s impact on the model’s prediction for that particular sample.
  • The color shows whether the value of a feature is high (red) or low (blue) for any row of the dataset.

From the summary plot, it is possible to infer that ‘Relationship status’ and ‘Age’ have a higher total model impact on predicting whether a person will earn a higher income or not, as compared to other features.

Step 2: GPU-accelerated SHAP

As discussed in the previous section, TreeSHAP is a version of SHAP tailored specifically for tree ensemble models. While TreeSHAP is relatively faster, it can also face typical computation issues when the ensemble size becomes too big. In such situations, taking advantage of the GPU acceleration is advisable to speed up the process.

However, the complexity of the TreeSHAP algorithm causes difficulty when mapping to hardware accelerators. This has led to the development of GPUTreeShap, a variation of the TreeSHAP algorithm suited to work with GPUs. It is now possible to take advantage of GPU hardware while computing SHAP values, thereby speeding up the entire model explanation process.

GPUTreeShap enables massively exact calculation of the shape values for tree-based algorithms. Figure 4 shows how GPUTreeSHAP provides an estimate of the gain achieved when using SHAP with GPU over CPU. According to GPUTreeShap: Massively Parallel Exact Calculation of SHAP Scores for Tree Ensembles, “With a single NVIDIA Tesla V100-32 GPU, we achieve speedups of up to 19x for SHAP values and speedups of up to 340x for SHAP interaction values over a state-of-the-art multi-core CPU implementation executed on two 20-core Xeon E5-2698 v4 2.2 GHz CPUs. We also experiment with multi-GPU computing using eight V100 GPUs, demonstrating throughput of 1.2M rows per second–equivalent CPU-based performance is estimated to require 6850 CPU cores.”

Bar chart showing GPUTreeSHAP speedups for different models and datasets on  a single NVIDIA Tesla V100-32 GPU The bar graph is annotated with labels from left to right: “Adult”, “Cal-housing”, Covtype” and “Fashion MNIST.” Vertical axis is labeled “Throughput (GPU/CPU).” For all the four datasets  GPU-accelerated SHAP for XGBoost shows speedups of 20x or more
Figure 4. SHAP acceleration with GPU


GPUTreeShap already comes integrated with the Python shap package. Another way to access GPUTreeShap is by installing the RAPIDS data science framework. This ensures access to GPUTreeShap and a host of different libraries for executing end-to-end data science pipelines entirely in the GPU. 

RAPIDS also comes integrated with XGBoost (as of 0.14). XGBoost was, in fact, the first popular ML Toolkit accelerated under what eventually became the RAPIDS ecosystem. Figure 5 highlights XGBoost speedup on GPU, comparing a single V100 GPU to a dual 20-core CPU.

Developers can take advantage of GPU acceleration for XGBoost and SHAP values with RAPIDS. The default open-source XGBoost packages already include GPU CUDA-capable GPUs support.

Bar chart showing XGBoost GPU speedups for different models and datasets on  a single NVIDIA Tesla V100-32 GPU.
Figure 5. RAPIDS works closely with the XGBoost community to accelerate GBDTs on GPU

Training an XGBoost model with GPU acceleration

The previous section demonstrates who to train an XGBoost model on the Adult Income Dataset. This section repeats the same process enabled with GPU acceleration. This requires a change in the value of a single parameter called tree_method and delivers a massive decrease in the computation time.

Specify the tree_method parameter as gpu_hist, keeping all other parameters unchanged.

# Feed the model the global bias
base_score = np.mean(y_train)

#Set hyperparameters for model training
params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'eta': 0.01,
    'subsample': 0.5,
    'colsample_bytree': 0.8,
    'max_depth': 5,
    'base_score': base_score,
    'tree_method': "gpu_hist", # GPU accelerated training
    'seed': SEED

# Train using early stopping on the validation dataset.
watchlist = [(dtrain, 'X_train'), (dvalid, 'X_test')]

model_gpu = xgb.train(params,         

CPU times: user 2.43 s, sys: 484 ms, total: 2.91 s
Wall time: 3.27 s

Training an XGBoost model with a single Tesla T4 GPU (available through Google Colab) helped in decreasing the training time from 14.3 seconds to just 3.27 seconds. A decrease in the compute time is beneficial since training machine learning models, especially on large datasets, can be both challenging and expensive. 

Calculating SHAP values using GPU

When the GPU predictor is selected, XGBoost uses GPUTreeShap as a backend for computing shap values.

model_gpu.set_param({"predictor": "gpu_predictor"})
explainer_gpu = shap.TreeExplainer(model=model_gpu)
shap_values_gpu = explainer_gpu.shap_values(X)
CPU times: user 1.34 s, sys: 252 ms, total: 1.59 s
Wall time: 1.56 s

Using GPU, the computation time for calculating Shapley values decreases to 1.56 seconds, from 1.4 minutes, gaining a massive reduction in computation time. The gain would be even more prominent when the dataset involves millions of data points, which is typical in many industries.


Techniques like SHAP can make machine learning systems more trustworthy. If a model can be faithfully explained, it can be analyzed to determine whether it is fit to be deployed. This is an essential step toward inculcating trust in any technology. With GPU acceleration, it is possible to compute SHAP values faster, allowing you to gain insights into predictive models more quickly.

However, SHAP is not a silver bullet and has its own set of limitations.The main criticism of SHAP is that it can be misinterpreted. SHAP essentially helps in answering the question of why a particular observation received a prediction rather than a baseline value. This baseline value is dictated by the choice of a background dataset and can offer contrasting results if the reference dataset changes. 

Consequently, the same observation can result in different SHAP values, depending on the choice of the background dataset. It is therefore important to carefully and appropriately choose background datasets keeping in mind the context of their use. It is important to understand the assumptions and trade-offs associated with explainability techniques in machine learning.

To see the code used in this post, visit parulnith/Data-Science-Articles on GitHub. 


Upcoming Workshop: Fundamentals of Accelerated Computing with CUDA C/C++

Learn tools and techniques for accelerating C/C++ applications to run on massively parallel GPUs with CUDA.

Learn tools and techniques for accelerating C/C++ applications to run on massively parallel GPUs with CUDA.


Safeguarding Networks and Assets with Digital Fingerprinting

Use of stolen or compromised credentials remains at the top of the list as the most common cause of a data breach. Because an attacker is using credentials or…

Use of stolen or compromised credentials remains at the top of the list as the most common cause of a data breach. Because an attacker is using credentials or passwords to compromise an organization’s network, they can bypass traditional security measures designed to keep adversaries out.

When they’re inside the network, attackers can move laterally and gain access to sensitive data, which can be extremely costly for an organization. In fact, it’s estimated that breaches caused by stolen or compromised credentials cost an average of $4.50 million in 2022.

Malicious activities in a network are hard to detect when performed by existing users, roles, or machine credentials. For this reason, these types of breaches take the longest, on average, to identify: 243 days and another 84 days on average to contain.

Companies might leverage user behavior analytics (UBA) to detect abnormal behavior based on a defined set of risks. With UBA, a baseline for each user or device is created and from that deviations from normal behaviors can be detected by comparing with past actions. UBA looks for patterns that might indicate anomalous behavior, based on known past behaviors.

There is an ever-increasing volume of data produced by a modern enterprise. Server logs, application logs, cloud logs, sensor telemetry, network, and disk information are now orders of magnitude larger than what can be stored by traditional security information and event management (SIEM) systems. The security operations team can examine only a fraction of that data. 

What is digital fingerprinting?

Because enterprises are generating more data than they can collect and analyze, the vast majority of the data coming in goes untapped. Without tapping into this data, enterprises can’t build robust and rich models to enable them to detect deviations in their environment. The inability to examine this data leads to undetected security breaches, long remediation times, and ultimately huge financial issues for the company being breached.

But what if you could analyze 100% of the data across an enterprise—every user, every machine? People have unique characteristics and different ways that they interact with the network depending on their role. Understanding the day-to-day and moment-by-moment interactions of every user and device across the network is what we refer to as digital fingerprinting. Every user account within an organization has a unique digital fingerprint.

The value of digital fingerprinting

UBA looks for patterns that correlate bad behavior and focuses on threshold-based alerting. Digital fingerprinting is different because it identifies anti-patterns, or when things deviate from their normal patterns. For example, when a user account starts performing atypical yet permissible actions, traditional security methods may not trigger an alert.

To detect these anti-patterns, there must be a model for each user, to measure deviation. UBA is a shortcut because it tries to predict indicators of bad behavior. With digital fingerprinting, there are individual models to measure against. 

To maximize the value of digital fingerprinting requires granularity and the ability to deploy thousands of models using unsupervised learning on a massive scale.

This can be done with NVIDIA Morpheus, a GPU-accelerated AI cybersecurity framework enabling developers to build optimized AI pipelines for filtering, processing, and classifying large volumes of real-time data.  Morpheus includes a prebuilt, end-to-end workflow for digital fingerprinting, making it possible to achieve 100 percent data visibility.

A typical user may interact with 100 or more applications while doing their job. Integrations between these applications means that there may be tens of thousands of interconnections and permissions shared across those 100 applications. If you have 10,000 users, you’d need 10,000 models initially.

With the Morpheus digital fingerprinting pretrained workflow, massive amounts of data can be addressed, and hundreds of thousands, or even millions of models can be managed. Implementations of a digital fingerprinting workflow for cybersecurity enable organizations to analyze all the data across the network, as AI performs massive data filtration and reduction for real-time threat detection. Critical behavior anomalies can be rapidly identified for security analysts, so that they can more quickly identify and react to threats.

Screenshot of a cyberattack across an enterprise without NVIDIA Morpheus, compared to with NVIDIA Morpheus
Figure 1. NVIDIA Morpheus digital fingerprinting workflow deployed across an enterprise of 25,000 employees
Video 1. Enterprise-Scale Cybersecurity Pinpoints Threats Faster

Experience the NVIDIA digital fingerprinting prebuilt model with a free hands-on lab on NVIDIA LaunchPad.


AI Esperanto: Large Language Models Read Data With NVIDIA Triton

Julien Salinas wears many hats. He’s an entrepreneur, software developer and, until lately, a volunteer fireman in his mountain village an hour’s drive from Grenoble, a tech hub in southeast France. He’s nurturing a two-year old startup, NLP Cloud, that’s already profitable, employs about a dozen people and serves customers around the globe. It’s one Read article >

The post AI Esperanto: Large Language Models Read Data With NVIDIA Triton appeared first on NVIDIA Blog.


Simplifying CUDA Upgrades for NVIDIA Jetson Users

NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Previously, a standalone version of NVIDIA…

NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Previously, a standalone version of NVIDIA JetPack supports a single release of CUDA, and you did not have the ability to upgrade CUDA on a given NVIDIA JetPack version. NVIDIA JetPack is released on a rolling cadence with a single version of CUDA, typically being supported throughout each major release cycle (for example, NVIDIA JetPack 4.x or NVIDIA JetPack 5.x).

Starting with CUDA Toolkit 11.8, Jetson users on NVIDIA JetPack 5.0 and later can upgrade to the latest CUDA release without updating the NVIDIA JetPack version or Jetson Linux BSP (Board Support Package). You can stay on par with the CUDA Desktop releases.

CUDA on Jetson compared with CUDA on desktop

To understand why the CUDA support model has been different between the desktop with discrete-GPU (dGPU) and Jetson with integrated-GPU (iGPU), it helps to understand the following:

  • How CUDA is packaged on Jetson
  • How CUDA is packaged on desktop
  • The differences between them

Figure 1 shows the Jetson software architecture, with a core of the Jetson Linux BSP and layers of the various software components that make up the NVIDIA JetPack SDK. For more information, see Jetson Software Architecture.

Block diagram image shows the key software modules that make up the Jetson software architecture and NVIDIA JetPack SDK for embedded applications.
Figure 1. Jetson software architecture

Figure 2 shows where CUDA resides in the overall NVIDIA JetPack SDK packaging structure and how it interacts with all other components of the Jetson Linux BSP. As you can see in Figure 2, the CUDA driver is part of the Jetson Linux BSP, along with other components. All these components update as per the release cadence and frequency of the Jetson Linux BSP, which has been different from the quarterly CUDA release cadence. The CUDA toolkit is separate from the BSP and does not package the CUDA driver.

When you install the NVIDIA JetPack SDK, the Jetson Linux BSP (containing the CUDA driver) and the CUDA toolkit get installed by default.

Block diagram shows the compatibility of software modules between the Jetson Linux BSP and the CUDA Toolkit.
Figure 2. CUDA packaging on Jetson (iGPU); the CUDA driver is baked into the Jetson Linux BSP
Block diagram shows the interdependency of software modules between a standard Linux OS distribution, the NVIDIA UDA package, and the CUDA Toolkit as managed with the CUDA Installer.
Figure 3. CUDA packaging on Desktop (dGPU); the CUDA driver is part of the NV Display driver and UDA package

Due to this packaging structure, CUDA developers on desktop have the flexibility to stay up to date with the latest CUDA releases aligning with the CUDA quarterly release cadence. Moreover, features such as forward compatibility and minor version compatibility help you pick up combinations of driver and toolkit, and tailor it per your application needs.

CUDA upgradable package on Jetson

Starting from CUDA 11.8, CUDA has introduced an upgrade path that provides Jetson developers with an option to update the CUDA driver and the CUDA toolkit to the latest versions.

Figure 4 shows blue boxes that depict components that are present by default in the NVIDIA JetPack 5.0 SDK. The dotted line separates Jetson Linux BSP from the other components that are part of the NVIDIA JetPack SDK. The green boxes indicate the CUDA components that you can upgrade to through this feature.

Flow diagram of the steps needed to upgrade CUDA software from previous releases.
Figure 4. CUDA upgrade path on Jetson

These upgrades are made possible by the introduction of the CUDA driver upgrade (also referred to as the CUDA compatibility package), as shown in Figure 5.

This upgrade package mainly contains the CUDA driver (*) and its dependencies that enable you to access the latest and greatest CUDA functionalities that come with every quarterly CUDA release.

Without this package, you were previously limited to the functionality provided by the default CUDA driver that was packaged in the Jetson Linux BSP. You had no mechanism to upgrade to the latest CUDA driver and toolkit.

With this package, Jetson users who have invested in long and thorough validation cycles for the existing Jetson Linux BSP can upgrade to the latest CUDA versions. This upgrade is done over the existing Jetson Linux BSP, keeping it unchanged.

Figure shows which Jetson software modules are affected and how the new flexible upgrade path works to install the latest CUDA software release.
Figure 5. Introducing the new CUDA upgrade package

How to upgrade CUDA on Jetson

With CUDA 11.8, the CUDA Downloads page now displays a new architecture, aarch64-Jetson, as shown in Figure 6, with the associated aarch64-Jetson CUDA installer and provides step-by-step instructions on how to download and use the local installer, or CUDA network repositories, to install the latest CUDA release.

Screenshot of the CUDA downloads web page showing the different CUDA architecture versions available to download and use for Jetson.
Figure 6. CUDA 11.8 downloads page with the aarch64-Jetson installer download option

The new aarch64-Jetson CUDA installer packages both the CUDA Toolkit and the upgrade package together. The step-by-step installation instructions provided ensure that the CUDA upgrade package gets downloaded and installed along with the corresponding CUDA toolkit for Jetson devices.

Block diagram of Jetson and CUDA software modules that will be installed automatically when using the CUDA Installer utility.
Figure 7. aarch64-Jetson CUDA installer for Jetson devices

The installed upgrade package is available in the versioned toolkit file directory. For example, you can find 11.8 in the following directory:


The upgrade package consists of the following files:

  •*: The CUDA driver.
  •*: Just-in-time link-time optimization (CUDA 11.8 and later only).
  •*: The JIT (just-in-time) compiler for PTX files.

These files together implement the CUDA driver interface. This package only provides the files and does not configure the system.

If you are working on an x86 host and cross-compiling to the aarch64-Jetson target, the U20.04 CUDA host installer can be found on the CUDA Downloads page. The cross-compile bits can be found in the following directory:

aarch64-jetson/cross/Ubuntu/20.04/deb installer


The following code example shows how the CUDA Upgrade package can be installed and used to run the applications.

$ sudo apt-get -y install cuda

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  cuda-11-8 cuda-cccl-11-8 cuda-command-line-tools-11-8 cuda-compat-11-8

The following NEW packages will be installed:
  cuda cuda-11-8 cuda-cccl-11-8 cuda-command-line-tools-11-8 cuda-compat-11-8

0 upgraded, 48 newly installed, 0 to remove and 38 not upgraded.
Need to get 15.7 MB/1,294 MB of archives.
After this operation, 4,375 MB of additional disk space will be used.
Get:1  cuda-compat-11-8 11.8.31058490-1 [15.8 MB]
Fetched 15.7 MB in 12s (1,338 kB/s)
Selecting previously unselected package cuda-compat-11-8.
(Reading database ... 

(Reading database ... 100%
(Reading database ... 148682 files and directories currently installed.)
Preparing to unpack .../00-cuda-compat-11-8_11.8.31058490-1_arm64.deb ...
Unpacking cuda-compat-11-8 (11.8.31058490-1) ...

Unpacking cuda-11-8 (11.8.0-1) ...
Selecting previously unselected package cuda.
Preparing to unpack .../47-cuda_11.8.0-1_arm64.deb ...
Unpacking cuda (11.8.0-1) ...
Setting up cuda-toolkit-config-common (11.8.56-1) ...
Setting up cuda-compat-11-8 (11.8.31058490-1) ...

$ ls -l /usr/local/cuda-11.8/compat
total 55300
lrwxrwxrwx 1 root root       12 Jan  6 19:14 ->
lrwxrwxrwx 1 root root       14 Jan  6 19:14 ->
-rw-r--r-- 1 root root 21702832 Jan  6 19:14
lrwxrwxrwx 1 root root       19 Jan  6 19:14 ->
lrwxrwxrwx 1 root root       23 Jan  6 19:14 ->
-rw-r--r-- 1 root root 24255256 Jan  6 19:14
-rw-r--r-- 1 root root 10665608 Jan  6 19:14
lrwxrwxrwx 1 root root       27 Jan  6 19:14 ->
The user can set LD_LIBRARY_PATH to include the libraries installed by upgrade package before running the CUDA 11.8 application:
$ LD_LIBRARY_PATH=/usr/local/cuda-11.8/compat:$LD_LIBRARY_PATH ~/Samples/1_Utilities/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.8 / 11.8
  CUDA Capability Major/Minor version number:    8.7
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.8, CUDA Runtime Version = 11.8, NumDevs = 1
Result = PASS

The default drivers (originally installed with NVIDIA JetPack and part of the Jetson Linux BSP) are retained by the installer. The application can use either the default version of CUDA (originally installed with NVIDIA JetPack) or the one installed by the upgrade package. Use the LD_LIBRARY_PATH variable to choose the required version.

Only a single CUDA upgrade package can be installed at any point in time on a given system. While installing a new CUDA upgrade package, the previous version of the installed upgrade package is removed and replaced with the new one. Installation of the upgrade package fails if it is not compatible with the NVIDIA JetPack version.

For example, applications that were previously compiled with CUDA 11.4 continue to work with the CUDA 11.8 upgrade package due to backward compatibility in the CUDA driver.

Table 1 shows the CUDA user-mode driver (UMD) and CUDA Toolkit version compatibility for the NVIDIA JetPack 5.0 release.

Table 1. CUDA UMD version compatibility with CUDA Toolkit release​

part of NVIDIA JetPack)
part of NVIDIA JetPack)

(minor version compatibility)
(with the upgrade package)

C = Compatible; X = Not compatible

Points to note

  • This feature is available from CUDA 11.8 and NVIDIA JetPack 5.0 onwards and will be supported on the latest Jetson Linux releases.
  • CUDA upgrade package only updates the CUDA driver interfaces while leaving the rest of the NVIDIA JetPack SDK components unchanged. If a new feature in the latest CUDA driver needs an updated NVIDIA JetPack SDK component or interface, it might return an error when called. For more information about feature compatibility, see the CUDA release notes.
  • Users are requested to check for compatibility of new CUDA versions with the NVIDIA JetPack SDK version being used, as not all NVIDIA JetPack SDKs support all versions of CUDA. For more information about compatible versions, see CUDA for Tegra App Note.

On Jetson, the compute stack of CUDA, cuDNN, TensorRT, and so on, was tightly tied to a particular version of Jetson Linux (L4T). To upgrade to a newer version of the compute stack, you also had to deal with upgrading to Jetson Linux.

We are working towards a future where Jetson developers can migrate to newer versions of the compute libraries without upgrading Jetson Linux. This CUDA feature that enables upgrading CUDA is a step in that direction.

Upgrade to the latest CUDA release on your Jetson today!

  • On the CUDA 11.8 Downloads page, download the CUDA installer for aarch64-Jetson and follow the installation instructions to upgrade your Jetson device to CUDA 11.8.
  • For more information about the CUDA upgradable package on Jetson, see CUDA for Tegra App Note.
  • For information about all the new features that CUDA 11.8 brings in, see CUDA 11.8 Omnibus.
  • If you have any questions or require support, post your questions on the Jetson forum.

Do register for the NVIDIA JetPack 5 deep-dive webinar. The CUDA and Jetson team walk you through details on this new feature and you get an opportunity to ask questions live!


Implementing Path Tracing in ‘Justice’: An Interview with Dinggen Zhan of NetEase

We sat down with Dinggen Zhan of NetEase to discuss his team’s implementation of path tracing in the popular martial arts game, Justice Online. What is your…

We sat down with Dinggen Zhan of NetEase to discuss his team’s implementation of path tracing in the popular martial arts game, Justice Online.

What is your professional background and current job role?

I have more than 20 years of experience in the gaming industry. I joined NetEase in 2012, and am now senior technical expert and lead programmer for Justice. 

Why did NetEase decide to integrate a path tracer into Justice?

Back in 2018, NVIDIA launched the first RTX GPU. At that time, we immediately integrated RTX features into Justice and quickly pushed it online. NVIDIA RTX Path Tracing is the ultimate solution for ray tracing. It has excellent visual results and solves all the pain points caused by illumination under rasterization. We stick to using cutting-edge technologies in our development work to create high-image quality games and enhance players’ immersive gaming experience.

A photo of a group of NetEase employees.
Figure 1. A group of NetEase employees

What NVIDIA technologies did you use to make the path tracing work?

We used DLSS 3, Real-time Denoisers (NRD), Reflex, and Restir GI.

How did the path tracer affect your lighting production during the Justice development process?

The path tracing technology provides a way to create realistic illumination systems, especially suitable for producing natural and delicate indirect illuminations. Therefore, we do not need to spend time manually adjusting lights in scenes. Instead, we only need to add the corresponding lights for emissive objects such as lanterns and leave the rest to the path tracer to complete the calculation. 

Why is physically accurate lighting important for the games you develop?

The rendering pipeline of Justice is built on physically based rendering (PBR). Realistic physical illumination is naturally implemented with path tracing, which improves visual appeal and reduces defects. The artists have more control over the look, and it is convenient to integrate.

What challenges did you face during the process of integrating ray tracing?

New technologies generally bring new problems, and the debugging process is particularly more difficult. Fortunately, NVIDIA has upgraded the NVIDIA Nsight debugging tool in time, making it an easier process for development work. The current real-time path tracer needs to be improved over several optical effects including caustics, translucency, and the skin materials of subsurface scattering.

Screenshot showing RTX path tracing in a temple scene from the NetEase game, Justice.
Figure 2. RTX path tracing in a temple scene from the NetEase game, Justice

What challenges were you looking to solve with the path tracer?

In the past, rasterized rendering of direct illumination, indirect illumination, reflection, and shadow were done with separated passes, which could not ensure accuracy. Path tracing unifies the computation of light transport, simplifies the whole rendering pipeline, and makes the final results immediately visible, allowing artists more control for content creation.

How long did it take for you to get the path tracer up and running?

From beginning to end, it took us about five to six months. The first three months were mainly for function integration, while the later stage was focused on effect tuning, performance optimization, and debugging.

Did you encounter any surprises during the integration process?

The realism of the path-traced pictures is amazing, and one notch above basic ray tracing. NVIDIA DLSS 3 also boosts the performance of the path tracer beyond all expectations.

How has path tracing affected your visuals and gameplay?

Path tracing can help game visuals reach cinematic realism, bringing the real-time rendering experience to the film production level. Video game players will feel like they are in the real world of each game. The visual experience is unprecedented, and there are infinite possibilities for the current metaverse development.

A screenshot of a sunset reflecting off a pond in Justice.
Figure 3. A sunset reflecting off a pond in Justice

Can you share any tips or lessons learned for other developers looking to integrate path tracing technology?

First, make sure that your game engine has a physically based rendering pipeline, which will reduce the integration issues. For certain special materials, the current path tracer cannot work completely without rasterization, and it is recommended to use in conjunction with a rasterizer.

Second, pay attention to the coherence of motion vectors and depth because the denoiser is quite sensitive to motion vectors, whether the motion vectors are in world space or screen space. The flag settings of the denoiser must be correct too. The depth buffer is in the floating-point range (0-1), and if it is reversed, it can affect the denoising and anti-aliasing results. 

Third, our path tracing is based on the NVIDIA Falcor engine, which is written in the shader language Slang. Integrating is a complicated and time-consuming task. We chose to translate Slang into HLSL at first. Since manually translating the entire Falcor shaders could be an onerous task, we simplified the Falcor codebase. Debugging costs us significant time. Looking back now, it would have been wise to take time to support the entire Slang at the beginning of the integration and put in the whole Falcor path tracing codebase. The integration process might go smoother, save us some time, and help support Falcor’s future functionalities and features.

Do you plan to integrate path tracing into future NetEase games?

The amazing visual quality of path tracing is beyond the reach of any rasterization technique. In the future, we will continue investing more resources to develop path traced levels, and improve the quality and performance in the game.

Visit the NetEase website for more information about the company. 

Learn more about the NVIDIA RTX Path Tracing SDK, and sign up to be notified when it is publicly available. For more resources, visit NVIDIA Game Development.


CUDA Toolkit 11.8 New Features Revealed

NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through…

NVIDIA announces the newest CUDA Toolkit software release, 11.8. This release is focused on enhancing the programming model and CUDA application speedup through new hardware capabilities.

New architecture-specific features in NVIDIA Hopper and Ada Lovelace are initially being exposed through libraries and framework enhancements. The full programming model enhancements for the NVIDIA Hopper architecture will be released starting with the CUDA Toolkit 12 family.

CUDA 11.8 has several important features. This post offers an overview of the key capabilities.

NVIDIA Hopper and NVIDIA Ada architecture support

CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families.

CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.

Lazy module loading

Building on the lazy kernel loading feature in 11.7, NVIDIA added lazy loading to the CPU module side. What this means is that functions and libraries load faster on the CPU, with sometimes substantial memory footprint reductions. The tradeoff is a minimal amount of latency at the point in the application where the functions are first loaded. This is lower overall than the total latency without lazy loading.​

All libraries used with lazy loading must be built with 11.7+ to be eligible for lazy loading.

Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY set.

Improved MPS signal handling

You can now terminate with SIGINT or SIGKILL any applications running in MPS environments without affecting other running processes. While not true error isolation, this enhancement enables more fine-grained application control, especially in bare-metal data center environments.​

NVIDIA JetPack installation simplification

NVIDIA JetPack provides a full development environment for hardware-accelerated AI-at-the-edge on Jetson platforms. Starting from CUDA Toolkit 11.8, Jetson users on NVIDIA JetPack 5.0 and later can upgrade to the latest CUDA versions without updating the NVIDIA JetPack version or Jetson Linux BSP (board support package) to stay on par with the CUDA desktop releases.

For more information, see Simplifying CUDA Upgrades for NVIDIA Jetson Developers.

CUDA developer tool updates

Compute developer tools are designed in lockstep with the CUDA ecosystem to help you identify and correct performance issues.

Nsight Compute

In Nsight Compute, you can expose low-level performance metrics, debug API calls, and visualize workloads to help optimize CUDA kernels. New compute features are being introduced in CUDA 11.8 to aid performance tuning activity on the NVIDIA Hopper architecture.

You can now profile and debug NVIDIA Hopper thread block clusters, which provide performance boosts and increased control over the GPU. Cluster tuning is being released in combination with profiling support for the Tensor Memory Accelerator (TMA), the NVIDIA Hopper rapid data transfer system between global and shared memory.

A new sample is included in Nsight Compute for CUDA 11.8 as well. The sample provides source code and precollected results that walk you through an entire workflow to identify and fix an uncoalesced memory access problem. Explore more CUDA samples to equip yourself with the knowledge to use toolkit features and solve similar cases in your own application.

Nsight Systems

Profiling with Nsight Systems can provide insight into issues such as GPU starvation, unnecessary GPU synchronization, insufficient CPU parallelizing, and expensive algorithms across the CPUs and GPUs. Understanding these behaviors and the load of deep learning frameworks, such as PyTorch and TensorFlow, helps you tune your models and parameters to increase overall single or multi-GPU utilization.

Other tools

Also included in the CUDA toolkit, both CUDA-GDB for CPU and GPU thread debugging as well as Compute Sanitizer for functional correctness checking have support for the NVIDIA Hopper architecture.


This release of the CUDA 11.8 Toolkit has the following features:

  • First release supporting NVIDIA Hopper and NVIDIA Ada Lovelace GPUs
  • Lazy module loading extended to support lazy loading of CPU-side modules in addition to device-side kernels
  • Improved MPS signal handling for interrupting and terminating applications
  • NVIDIA JetPack installation simplification
  • CUDA developer tool updates

For more information, see the following resources: