Learn about the newest CUDA features such as release compatibility, dynamic parallelism, lazy module loading, and support for the new NVIDIA Hopper and NVIDIA…
Learn about the newest CUDA features such as release compatibility, dynamic parallelism, lazy module loading, and support for the new NVIDIA Hopper and NVIDIA Ada Lovelace GPU architectures.
After three years of uncertainty caused by the pandemic and its post-lockdown hangover, enterprises in 2023 — even with recession looming and uncertainty abounding — face the same imperatives as before: lead, innovate and problem solve. AI is becoming the common thread in accomplishing these goals. On average, 54% of enterprise AI projects made it Read article >
The latest version of the NVIDIA TAO Toolkit 4.0 boosts developer productivity with all-new AutoML capability, integration with third-party MLOPs services, and…
The latest version of the NVIDIA TAO Toolkit 4.0 boosts developer productivity with all-new AutoML capability, integration with third-party MLOPs services, and new pretrained vision AI models. The enterprise version now includes access to the full source code and model weights for pretrained models.
The toolkit enables efficient model training for vision and conversational AI. By simplifying complex AI models and deep learning frameworks, even developers without AI expertise can use the toolkit to produce AI models. Using transfer learning to fine-tune NVIDIA pretrained models with your own data, it is now possible to optimize model inference throughput without AI expertise or large training datasets.
Developers can create custom production-ready models optimized for specific environments and scenarios with TAO. A notable new feature helps developers build object detection models without massive amounts of data. The use cases include detecting assembly line defects, translating particular phrases across languages, or managing city traffic.
Access to TAO source code and model weights for pretrained models.
Train high-quality models with AutoML without the hassle of manually fine-tuning hundreds of parameters.
Deploy on VMs from leading cloud providers and Kubernetes services like Amazon EKS or Azure AKS.
Simplify infrastructure management and scaling on cloud machine learning services such as Google Colab, Google Vertex AI, and Microsoft Azure Machine Learning.
New cloud integrations and third-party MLOps services, such as W&B and ClearML, provide developers and enterprises with an optimized AI workflow.
Integrate with REST APIs. Quickly build a new AI service or integrate into an existing one with REST APIs.
Use new transformer-based pretrained models (CitySemSegformer, Peoplenet Transformer) and retail-specific pretrained models (RetailObjectDetection, RetailObjectRecognition, and ReIdentificationNet.)
Watch a video to experiment with the NVIDIA TAO Toolkit and pretrained models on Google Colab.
Solutions using TAO Toolkit
Next-generation IT and business service provider, Trifork uses TAO Toolkit to accelerate the development of the AI-based baggage tracking solution for airports.
Fingermark, a company developing data-driven solutions for decision-making, uses TAO Toolkit in developing industrial vision AI solutions to improve worker safety.
A fundamental shift is currently taking place in how AI applications are built and deployed. AI applications are becoming more sophisticated and applied to…
A fundamental shift is currently taking place in how AI applications are built and deployed. AI applications are becoming more sophisticated and applied to broader use cases. This requires end-to-end AI lifecycle management—from data preparation, to model development and training, to deployment and management of AI apps. This approach can lower upfront costs, improve scalability, and decrease risk for customers using AI applications.
While the cloud-native approach to app development can be appealing to developers, machine learning (ML) projects are notoriously time-intensive and cost-intensive, as they require a team with a varied skill set to build and maintain.
This post explains how you can accelerate your vision AI model development using NVIDIA TAO Toolkit and deploying it for inference with NVIDIA Triton Inference Server—all on the Azure Machine Learning (Azure ML) platform.
NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.
Azure ML is a cloud service for accelerating and managing the machine learning project lifecycle that enables developers to automate AI workflows, from data preparation and model training to model deployment. Developers can easily train, deploy, and manage AI models at scale with Azure ML.
Watch the video below to see a complete walkthrough of how to fine-tune your models with NVIDIA TAO Toolkit and deploy them for inference with NVIDIA Triton Inference Server—all on Azure ML.
The overall workflow comprises three main steps:
Install the NGC Azure ML Quick Launch Toolkit
Train and optimize a pretrained object detection model
Deploy the optimized model on Azure ML with NVIDIA Triton Inference Server
This section covers the steps required to install the NGC quick launch toolkit, which configures the Azure ML resources and uploads the necessary containers and models for training. The required config files are provided on the AzureML Quick Launch – TAO resource on the NVIDIA NGC Catalog.
Create a conda environment to install the Azure ML Quick Launch Toolkit to avoid any possible conflicts with existing libraries in your machine using the following code:
The azure_config.json file contains the details pertaining to the user credentials, Azure workspace, and GPU compute resources that need to be updated. Edit the azureml_user section with your Azure subscription ID, resource group, and workspace name. Next, edit the aml_compute section with GPU cluster details.
Recommended VMs: NCsv3, NDv2, or NC A100 v4 or ND A100 v4 series
The ngc_config.json file contains the content from the NGC Catalog, such as Docker containers and Jupyter notebooks, that you can upload into the Azure Machine Learning Resources.
Several scripts are packaged that will be used for model deployment.
This will upload all the resources to the Azure ML Datastore. After the upload, a URL is generated for the Jupyter session, which enables you to interact with the session from your local web browser. You can verify that all the resources have been uploaded in the Azure ML portal. The steps to check the resources in Azure ML portal are provided in the video above.
Train and optimize an object detection model with NVIDIA TAO Toolkit
This section covers the steps for training the model with NVIDIA TAO Toolkit on the Azure ML platform.
Before you begin the training process, you need to run the auxiliary notebook called CopyData.ipynb. The notebook is automatically generated with azureml-ngc-tools. This copies the notebooks from the Datastore to the compute cluster.
A new folder called tao is created with all the additional data provided. This folder contains the Jupyter notebook, along with the required configuration files for training. Navigate to the TAO_detectnet_v2.ipynb notebook under folder tao/detectnet_V2. The DetectNet_v2 is one of the many computer vision notebooks available for training.
Once you load up the notebook, simply execute each cell shown. For more information on this network or on how to configure hyperparameters, refer to the TAO Toolkit documentation. Some of the main steps covered in the notebook include:
Setting environment variables
Downloading and converting training data
Downloading the model from the NGC catalog
Training the model
Pruning the model to remove unwanted layers and reduce model size
Retraining the pruned model to recover lost accuracy
Quantize Aware Training (QAT), which changes the precision of the model to INT8, reducing model size without sacrificing accuracy
Exporting the model for inference
Once the model is generated on the compute cluster, you will need to upload the model to the Azure ML workspace. To upload, run the UploadData.ipnyb notebook to copy the model to the Datastore. This model will be used for deployment.
Deploy the model using NVIDIA Triton Inference Server
Next, deploy the exported model using NVIDIA Triton Inference Server. Use the model trained in the previous step with NVIDIA TAO Toolkit and stored in the Datastore. Pull the model directly from the Datastore.
Once the model has been uploaded, push the NVIDIA Triton container to the Azure Container Registry (ACR), create the inference end point, and test it using some sample images.
Register the model for inference
The next steps entail uploading the model for inference. Upload the trained model from the Datastore. Navigate to the Azure Machine Learning Studioto load the local model.
1. After logging in, go to the Azure ML workspace (created earlier) using the azureml-ngc-tool script. Select ‘Models’ from the left menu, then click ‘Register’ and ‘From datastore.’
2. Select the ‘Triton’ model type to upload the model.
3. Browse and select the path: tao/detectnet _v2/model _repository
4. Name the model ‘DetectNet’ and set the version to ‘1’
5. Once the model has successfully uploaded, you should be able to see the directory structure.
Build and upload NVIDIA Triton image to Azure Container Registry
Next, build the NVIDIA Triton container with necessary dependencies and upload the image to Azure Container Registry (ACR).
1. On your local machine, run the following script to create the NVIDIA Triton container with the necessary dependencies:
bash scripts/build_container.sh
2. Verify that the image has been created locally by executing:
docker image ls
If successful, you should see the repo named nvcr.io/nvidia/tao/triton-apps.
3. Push the Docker image to ACR using the following script:
bash scripts/push_container_to_ACR.sh
The registryname parameter is the name of the provided Azure ML workspace default container registry. Navigate to the Workspace essential properties dashboard to find it in the Azure portal. This script will push the Docker image to ACR and tag it as ${registryname}.azurecr.io/tao:latest.
Once the script completes, navigate to ACR to see the container in the tao repository.
Create the Azure ML endpoint and deployment
On your local machine, run the following script to create an Azure ML Endpoint followed by the deployment:
bash scripts/create_endpoint _and_deployment.sh
The script will create the Azure ML Endpoint with the endpoint names provided in the endpoint_aml.yml file. It then deploys the NVIDIA Triton service on Azure using the deployment_aml.yml file.
In this file, you can specify VM size. For this example, use the Standard_NC6s_v3 VM. These files are provided in scripts/auxiliary_files.
Once the script execution is complete, you should be able to see the deployment information, REST endpoint, and authentication key on the Azure portal. The deployment information can be found by clicking into the endpoint and navigating to the Deployment Logs tab.
Validate the endpoint
You can validate the endpoint by using the REST endpoint URL and the primary key found under the Endpoints tab on the Azure portal. To query the Azure ML endpoint from the user local machine, run the following script:
bash scripts/infer.sh
This script is provided in the zip file pulled from AzureML Quick Launch: TAO in the first step.
Next, provide the REST endpoint URL from the NVIDIA Triton deployment endpoint. The endpoint is queried with the test image provided with the option. The output image with bounding boxes is stored in .
Summary
This post showed the end-to-end workflow for fine-tuning a model with NVIDIA TAO Toolkit and deploying the trained object detection model using NVIDIA Triton Inference Server, all on Azure Machine Learning. These tools abstract away the AI framework complexity, enabling you to build and deploy AI applications in production without the need for AI expertise.
There has been tremendous growth in AI over the years. With that, comes a larger demand for AI models and applications. Creating production-quality AI requires…
There has been tremendous growth in AI over the years. With that, comes a larger demand for AI models and applications. Creating production-quality AI requires expertise in AI and data science and can still be intimidating for many developers.
To develop accurate AI, you must choose what model architecture to use, what data to collect, and finally how to tune the model to meet the desired KPIs. There are thousands of combinations of model architectures and hyperparameters that you’d have to try to get the best model for your specific use case. This process is extremely laborious and requires model architecture expertise to tune hyperparameters.
Automated machine learning (AutoML) automates the manual task of finding the best models and hyperparameters for the desired KPI. It can algorithmically derive the best model for your given KPI and abstract away a lot of the complexity of AI model creation and optimization.
AutoML makes it easy for even a novice developer to create a highly accurate AI model.
AutoML in TAO
AutoML in TAO is fully configurable for automatically optimizing the hyperparameters of a model, which reduces the need for manual tuning. It caters to both AI experts and non-experts.
For non-experts, the guided Jupyter notebook provides a simple, efficient way to create an accurate AI model.
For experts, TAO gives you full control of which hyperparameters to tune and which algorithm to use for sweeps.
TAO currently supports two optimization algorithms: Baysian and Hyperband optimization. These algorithms can effectively sweep across a range of hyperparameters to find the best combination to optimize the user-provided metric.
Hyperband yields faster because it doesn’t have to run through the entire training configuration. It runs for a limited number of epochs, discards the runs that are performing poorly, and only continues on the remaining runs. This process of elimination continues until there is a single configuration that gives the best results.
For Bayesian, the training runs to completion for all the sweeps.
AutoML is supported for a wide range of CV tasks: image classification, object detection, segmentation, and OCR. Table 1 shows the full list of supported networks.
The entire AutoML workflow can be run from the provided Jupyter notebooks. AutoML uses the TAO API services to manage all training jobs.
TAO API services
TAO API is a Kubernetes service that enables deployment of TAO as a microservice either on your own Kubernetes cluster or with cloud Kubernetes services such as Amazon EKS or Azure AKS.
TAO API services provide an additional layer of abstraction over containers. You can manage and deploy TAO services using Helm charts and remotely run jobs using REST API calls. With the APIs, you can remotely create and upload datasets, run training jobs, evaluate models, and export models for deployment.
API services make it easy to integrate TAO into your own custom application or build a web-UI application on top of TAO. To get started with building your custom applications with REST APIs, see the API guide and the API notebooks in TAO Toolkit Getting Started. For more information, see the Notebook section later in this post.
To train using the CLI, use the lightweight CLI client application that you can install on a client system to access TAO services and CLI notebooks. CLI notebooks are available on NGC in TAO getting started resources.
AutoML requires a higher-level service on top of your training runs to determine and manage the set of experiments. TAO services keep track of all the experiments that they have tried with the KPIs and build the next set of experiments to improve on the KPIs. You can run AutoML with TAO API services either through the remote CLI application or directly using REST APIs. Jupyter notebooks for both are provided. For more information, see the Notebook section.
The REST API notebooks mainly serve as a reference if you are building your own applications or UI on top of TAO.
Set up TAO services
The TAO API service can run on any Kubernetes platform. To simplify the deployment of TAO services, we have provided a one-click deploy script. This simplifies the deployment of TAO services on bare-metal setup or on Amazon EKS. For this post, we use the bare-metal setup but instructions to deploy on cloud are provided in the API guide.
tar -xvf tao-toolkit-api-bare-metal.tar
cd tao-toolkit-api-bare-metal
Add the host IP address and login credentials in the hosts file. This is the system where you plan to run the TAO services. It could be a local or remote system but you must have sudo privileges.
For credentials, you can use either a password (ansible_ssh_pass) or an SSH private key file (ansible_ssh_private_key_file). For a single-node cluster, you can list only the master node.
You can validate SSH credentials for remote machines with the following command. The proper answer would be root.
ssh @ 'sudo whoami'
Next, modify the tao-toolkit-api-ansible-values.yml file to add your NGC credentials and the Helm chart. This pulls the Helm chart from the NGC registry. For more information, see Generating Your NGC API Key.
Install the dependencies and deploy the TAO service. Before installation, first check whether all the dependencies are met by running check-inventory.yml. If everything looks good, you should see a message that says 0 failed. Then, run install, which takes 10–15 minutes.
For this post, use the object detection notebook (TAO API Starter Kit/Notebooks/client/automl/object_detection.ipynb) but you can also do AutoML on other computer vision tasks.
Use AutoML to fine-tune an object detection model with TAO
Here is a quick walkthrough of the AutoML workflow with the Object Detection AutoML notebook. For this walkthrough, you use the client/automl/object_detection.ipynb notebook from the hierarchy shown earlier. We highlight the key steps here, but all the steps are captured in the Jupyter notebook.
Select a model topology
Choose any one of the available models listed for that notebook. Each notebook has a default model for that domain. In this example, the default model is DetectNet V2 but you can change it to FasterRCNN, SSD, DSSD, Retinanet, EfficientDet, Yolo V3, Yolo V4, or YoloV4 tiny.
model_name = "detectnet-v2"
Create a dataset
The next step is to use the dataset given as an example in the notebook or use your own dataset. The folder structure of the dataset requirement is provided in the notebooks.
When you have the dataset ready, upload it to the TAO Toolkit REST API-deployed machine through the Unix rsync command for TAO-Client notebooks. You must upload the images and labels for both training and validation data.
After the dataset has been uploaded, convert the dataset to tfrecords through the dataset-convert action. All object detection models require dataset conversion, but some models from other domains, like classification, can operate on the raw data uploaded.
The next step is to choose which AutoML algorithm to run. There are options to tweak some AutoML-specific parameters. You can view the parameters that are enabled by default for the AutoML search for a model, along with all the parameters that are available for a network
tao-client {model_name} model-automl-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/automl_defaults.json
This outputs a list of hyperparameters that are used for AutoML. For this experiment, you are choosing five different hyperparameters to sweep.
You can add additional parameters or remove existing defaulted parameters. For example, to sweep the soft_start hyperparameter, add the following in your notebook:
There are also options to tweak algorithm-specific parameters, but the default parameters work well. For more information, see AutoML.
Train with AutoML
At this point, you have all the tools necessary to start the AutoML run. You can also change default training specs, like image extension or class mapping, before triggering the AutoML run:
When AutoML run starts, you can see various stats, such as the best accuracy score at that time, how many experiments have been completed, an approximate estimated time for completion, and so on. You should see an output log similar to the following.
{
"best_map": 0.59636,
"Estimated time for automl completion": "23.13 minutes remaining approximately",
"Current experiment number": 3,
"Number of epochs yet to start": 429.0,
"Time per epoch in seconds": 3.24
}
Compare models
At the end of the AutoML run, you can see the results of all experiments. You are presented with the spec file and the binary weight file of the model achieving the highest accuracy among the AutoML sweeps.
The best model for this experiment was ID 9 with a mAP of 0.627. This is stored in the best_model/recommendataion_9.kitti file.
After saving the best model obtained from AutoML, you can plug the model and spec file in the end-to-end notebook and then prune and optimize the model for inference.
To plug the model into the new notebook, copy the train job ID from the AutoML notebook. The AutoML train job ID is printed when you run the training job.
When you have the train job ID, open the end-to-end notebook from the notebook hierarchy from earlier. For this post, use the TAO API Starter Kit/Notebooks/client/end2end/detectnet_v2.ipynb notebook. As you have already trained a model, just run the import statement in the first cell and skip all the way down to the Run Evaluate section. In this section, create a code cell before evaluating.
train_job_id = “id_you_copied”
After adding the job_map code cell, you can evaluate the model, prune the model for compression, and even do a quantization-aware training of the original model or the pruned model as showcased in the end-to-end notebooks
Results
We trained various models with AutoML on public datasets to see how much improvement in accuracy we could achieve. We compared the best AutoML-based accuracy against a baseline accuracy number from the default spec file provided in the packages. The results are in Table 2.
For object detection, we trained on the FLIR dataset, which contains images from both thermal and RGB sensors.
For semantic segmentation, we used the ISBI dataset.
For accuracy, we used mAP (mean average precision) for object detection, average accuracy of all classes and tasks for image classification, and mean IoU (intersection over union) score for semantic segmentation.
Task
Model
Baseline Accuracy (default spec)
Best AutoML accuracy
Dataset
Object Detection
DetectNet_v2 – ResNet18
44.16
51.37
FLIR
Object Detection
FasterRCNN – ResNet18
56.42
60.44
FLIR
Object Detection
YOLOv4 – ResNet18
40.12
63.46
FLIR
Object Detection
YOLOv3 – ResNet18
42.36
61.84
FLIR
Object Detection
RetinaNet – ResNet18
50.54
63.09
FLIR
Image Classification
ResNet18
53.95
66.28
Pascal VOC
Semantic Segmentation
UNET
71.64
76.65
ISBI
Table 2. Accuracy gain across networks by using AutoML
Across all the models that we tested, the increase in model accuracy is substantial compared to static default hyperparameters. The amount of improvement varies depending on the model, but we have generally seen improvements ranging from 5% to more than 20%. This shows that AutoML can work on various datasets to train the best model for a given KPI.
Summary
As the number of use cases and customization grows, it becomes imperative to accelerate the AI creation process. AutoML can eliminate the need for manual tuning, saving valuable time for developers.
With TAO AutoML, you can now automatically tune models for object detection, classification, and segmentation use cases using various popular model architectures. TAO AutoML provides simplicity for novice users to get started, as well as configurability for experts to choose their own hyperparameters to sweep.
Banks require more than cash in the vault these days, they also need accelerated computing in the back room. “The boost we’re getting with GPUs not only significantly improved our performance at the same cost, it helped us redefine our business and sharpen our focus on customers,” said Marco Airoldi, who’s been head of financial Read article >
CUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single…
CUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single operation. Knowing the workflow upfront enables the CUDA driver to apply various optimizations, which cannot be performed when launching through a stream model.
However, this performance comes at the cost of flexibility: if the full workflow is not known in advance, then GPU execution must be interrupted to return to the CPU to make a decision.
CUDA device graph launch solves this problem by enabling a task graph to be performantly launched from a running GPU kernel, based on data that is determined at run time. CUDA device graph launch offers two distinct launch modes—fire and forget, and tail launch—to enable a wide range of applications and use.
This post demonstrates how to use device graph launch and the two launch modes. It features the example of a device-side work scheduler, which decompresses files for data processing.
Device graph initialization
Executing a task graph involves the four-step process outlined below:
Create the graph
Instantiate the graph into an executable graph
Upload the executable graph’s work descriptors to the GPU
Launch the executable graph
By separating the launch step from the other steps, CUDA is able to optimize the workflow and keep graph launch as lightweight as possible. As a convenience, CUDA will also combine the upload step with the launch step the first time a graph is launched if the upload step has not been called explicitly.
In order to launch a graph from a CUDA kernel, the graph first must have been initialized for device launch during the instantiation step. Additionally, before it can be launched from the device, the device graph must have been uploaded to the device, either explicitly through a manual upload step or implicitly through a host launch. The code below, which performs the host-side steps to set up the device scheduler example, shows both options:
// This is the signature of our scheduler kernel
// The internals of this kernel will be outlined later
__global__ void schedulerKernel(
fileData *files,
int numFiles,
int *currentFile,
void **currentFileData,
cudaGraphExec_t zipGraph,
cudaGraphExec_t lzwGraph,
cudaGraphExec_t deflateGraph);
void setupAndLaunchScheduler() {
cudaGraph_t zipGraph, lzwGraph, deflateGraph, schedulerGraph;
cudaGraphExec_t zipExec, lzwExec, deflateExec, schedulerExec;
// Create the source graphs for each possible operation we want to perform
// We pass the currentFileData ptr to this setup, as this ptr is how the scheduler will
// indicate which file to decompress
create_zip_graph(&zipGraph, currentFileData);
create_lzw_graph(&lzwGraph, currentFileData);
create_deflate_graph(&deflateGraph, currentFileData);
// Instantiate the graphs for these operations and explicitly upload
cudaGraphInstantiate(&zipExec, zipGraph, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(zipExec, stream);
cudaGraphInstantiate(&lzwExec, lzwGraph, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(lzwExec, stream);
cudaGraphInstantiate(&deflateExec, deflateGraph, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(deflateExec, stream);
// Create and instantiate the scheduler graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
schedulerKernel>>(files, numFiles, currentFile, currentFileData, zipExec, lzwExec, deflateExec);
cudaStreamEndCapture(stream, &schedulerGraph);
cudaGraphInstantiate(&schedulerExec, schedulerGraph, cudaGraphInstantiateFlagDeviceLaunch);
// Launch the scheduler graph - this will perform an implicit upload
cudaGraphLaunch(schedulerExec, stream);
}
It is important to note here that device graphs can be launched either from the host or from the device. Therefore, the same cudaGraphExec_t handles may be passed to the scheduler for launch on the device as for launch on the host.
Fire and forget launch
A scheduler kernel dispatches work based on incoming data. For work dispatch, fire and forget launch is the preferred launch method.
When a graph is launched using fire and forget launch, it is dispatched immediately. It executes independently of both the launching graph and subsequent graphs launched using fire and forget mode. Because the work executes immediately, fire and forget launch is preferable for work dispatched by a scheduler, as it starts running as quickly as possible. CUDA introduces a new device-side named stream to perform a fire and forget launch of a graph. See below for an example of a simple dispatcher.
enum compressionType {
zip = 1,
lzw = 2,
deflate = 3
};
struct fileData {
compressionType comprType;
void *data;
};
__global__ void schedulerKernel(
fileData *files,
int numFiles
int *currentFile,
void **currentFileData,
cudaGraphExec_t zipGraph,
cudaGraphExec_t lzwGraph,
cudaGraphExec_t deflateGraph)
{
// Set the data ptr to the current file so the dispatched graph
// is operating on the correct file data
*currentFileData = files[currentFile].data;
switch (files[currentFile].comprType) {
case zip:
cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
break;
case lzw:
cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
break;
case deflate:
cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
break;
default:
break;
}
}
It is also important to note that graph launches can be nested and recursive, so additional device graphs can be dispatched from fire and forget launches. Although not shown in this example, the graphs that are decompressing the file data could dispatch more graphs to do further processing on that data once it is fully decompressed (image processing, for example). Device graph flow is hierarchical, just like graphs themselves.
Tail launch
CUDA work is launched asynchronously to the GPU, which means the launching thread must explicitly wait for the work to complete before consuming any result or output. This is typically done from a CPU thread using a synchronization operation such as cudaDeviceSynchronize or cudaStreamSynchronize.
It is not possible for a launching thread on the GPU to synchronize on device graph launches through traditional methods such as cudaDeviceSynchronize. Instead, when operation ordering is desired, tail launch should be used.
When a graph is submitted for tail launch, it does not execute immediately, but rather upon completion of the launching graph. CUDA encapsulates all dynamically generated work as part of the parent graph, so a tail launch will also wait for all generated fire and forget work before executing.
This is true whether the tail launch was issued before or after any fire and forget launches. Tail launches themselves execute in the order in which they are enqueued. A special case is self-relaunch, where the currently running device graph is enqueued to relaunch through tail launch. Only one pending self-relaunch is permitted at a time.
Using tail launch, you can upgrade the previous dispatcher to become a full scheduler kernel by having it relaunch itself repeatedly, effectively creating a loop in the execution flow:
__global__ void schedulerKernel(
fileData *files,
int numFiles,
int *currentFile,
void **currentFileData,
cudaGraphExec_t zipGraph,
cudaGraphExec_t lzwGraph,
cudaGraphExec_t deflateGraph)
{
// Set the data ptr to the current file so the dispatched graph
// is operating on the correct file data
*currentFileData = files[currentFile].data;
switch (files[currentFile].comprType) {
case zip:
cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
break;
case lzw:
cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
break;
case deflate:
cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
break;
default:
break;
}
// If we have not finished iterating over all the files, relaunch
if (*currentFile
Notice how the relaunch operation uses cudaGetCurrentGraphExec to retrieve a handle to the currently executing graph. It can relaunch itself without needing a handle to its own executable graph.
Use of tail launch for the self-relaunch has the added effect of synchronizing on (waiting for) the dispatched fire and forget work before the next scheduler kernel relaunch begins. A device graph can only have one pending launch at a time (plus one self-relaunch). In order to relaunch the graph that was just dispatched, you need to make sure that the previous launch completed first. Performing a self-relaunch accomplishes this goal, so that you can dispatch whatever graph is needed for the next iteration.
Device compared to host launch performance
How would this example fare against a host-launched graph? Figure 1 compares fire and forget launch, tail launch, and host launch latencies for various topologies.
This chart shows that not only is the device-side launch latency better than 2x lower than that of host launch, but it is also not impacted by graph structure. The latency is identical for each of the given topologies.
Device launch also scales much better to the width of the graph, as shown in Figure 2.
By comparison with host launch, device launch latency stays almost constant regardless of how much parallelism is in the graph.
Conclusion
CUDA device graph launch offers a performant way to enable dynamic control flow within CUDA kernels. While the example presented in this post provides a means of getting started with the feature, it is but a small representation of the ways this feature can be used.
Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many…
Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many applications.
NVIDIA announces the newest CUDA Toolkit software release, 12.0. This release is the first major release in many years and it focuses on new programming models…
NVIDIA announces the newest CUDA Toolkit software release, 12.0. This release is the first major release in many years and it focuses on new programming models and CUDA application acceleration through new hardware capabilities.
You can now target architecture-specific features and instructions in the NVIDIA Hopper and NVIDIA Ada Lovelace architectures with CUDA custom code, enhanced libraries, and developer tools.
CUDA 12.0 includes many changes, both major and minor. Not all changes are listed here, but this post offers an overview of the key capabilities.
Overview
Support for new NVIDIA Hopper and NVIDIA Ada Lovelace architecture features with additional programming model enhancements for all GPUs, including new PTX instructions and exposure through higher-level C and C++ APIs
Support for revamped CUDA dynamic parallelism APIs, offering substantial performance improvements compared to the legacy APIs
Enhancements to the CUDA Graphs API:
You can now schedule graph launches from GPU device-side kernels by calling built-in functions. With this ability, user code in kernels can dynamically schedule graph launches, greatly increasing the flexibility of CUDA Graphs.
The cudaGraphInstantiate API has been refactored to remove unused parameters.
Support for the GCC 12 host compiler
Support for C++20
New nvJitLink library in the CUDA Toolkit for JIT LTO
Library optimizations and performance improvements
Updates to Nsight Compute and Nsight Systems Developer Tools
NVIDIA Hopper and NVIDIA Ada Lovelace architecture support
CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families. The CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.
CUDA 12.0 exposes programmable functionality for many features of the NVIDIA Hopper and NVIDIA Ada Lovelace architectures:
Many tensor operations are now available through public PTX:
TMA operations
TMA bulk operations
32x Ultra xMMA (including FP8 and FP16)
Launch parameters control membar domains in NVIDIA Hopper GPUs
Support for the smem sync unit PTX and C++ API
Support for C intrinsics for cooperative grid array (CGA) relaxed barriers
Support for programmatic L2 Cache to SM multicast (NVIDIA Hopper GPUs only)
Support for public PTX for SIMT collectives: elect_one
Genomics and DPX instructions are now available for NVIDIA Hopper GPUs to provide faster combined-math arithmetic operations (three-way max, fused add+max, and so on).
Lazy loading
Lazy loading is a technique for delaying the loading of both kernels and CPU-side modules until loading is required by the application. The default is preemptively loading all the modules the first time a library is initialized. This can result in significant savings, not only of device and host memory, but also in the end-to-end execution time of your algorithms.
Lazy loading has been part of CUDA since the 11.7 release. Subsequent CUDA releases have continued to augment and extend it. From the application development perspective, nothing specific is required to opt into lazy loading. Your existing applications work with lazy loading as-is.
If you have operations that are particularly latency-sensitive, you may want to profile your applications. The tradeoff with lazy loading is a minimal amount of latency at the point in the application where the functions are first loaded. This is overall lower than the total latency without lazy loading.
Metric
Baseline
CUDA 11.7
CUDA 11.8+
Improvement
End-to-end runtime [s]
2.9
1.7
0.7
4x
Binary load time [s]
1.6
0.8
0.01
118x
Device memory footprint [MB]
1245
435
435
3x
Host memory footprint [MB]
1866
1229
60
31x
Table 1. Example application speedup with lazy loading
All libraries used with lazy loading must be built with 11.7+ to be eligible.
Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY set.
Compatibility
CUDA minor version compatibility is a feature introduced in 11.x that gives you the flexibility to dynamically link your application against any minor version of the CUDA Toolkit within the same major release. Compile your code one time, and you can dynamically link against libraries, the CUDA runtime, and the user-mode driver from any minor version within the same major version of CUDA Toolkit.
For example, 11.6 applications can link against the 11.8 runtime and the reverse. This is accomplished through API or ABI consistency within the library files. For more information, see CUDA Compatibility.
Minor version compatibility continues into CUDA 12.x. However, as 12.0 is a new major release, the compatibility guarantees a reset. Applications that used minor version compatibility in 11.x may have issues when linking against 12.0. Either recompile your application against 12.0 or statically link to the needed libraries within 11.x to ensure the continuity of your development.
Likewise, applications recompiled or built in 12.0 will link to future versions of 12.x but are not guaranteed to work in CUDA Toolkit 11.x.
JIT LTO support
CUDA 12.0 Toolkit introduces a new nvJitLink library for JIT LTO support. NVIDIA is deprecating the support for the driver version of this feature. For more information, see Deprecated Features.
C++20 compiler support
CUDA Toolkit 12.0 adds support for the C++20 standard. C++20 is enabled for the following host compilers and their minimal versions:
GCC 10
Clang 11
MSVC 2022
NVC++ 22.x
Arm C/C++ 22.x
For more information about features, see the corresponding host compiler documentation.
While the majority of C++20 features are available in both host and device code, some are restricted.
Module support
Modules are introduced in C++20 as a new way to import and export entities across translation units.
Because it requires complex interaction between the CUDA device compiler and the host compiler, modules are not supported in CUDA C++, in either host or device code. Uses of the module and export and import keywords are diagnosed as errors.
Coroutine support
Coroutines are resumable functions. Execution can be suspended, in which case control is returned to the caller. Subsequent invocations of the coroutine resume at the point where it was suspended.
Coroutines are supported in host code but are not supported in device code. Uses of the co_await, co_yield, and co_return keywords in the scope of a device function are diagnosed as errors during device compilation.
Three-way comparison operator
The three-way comparison operator is a new kind of relational enabling the compiler to synthetize other relational operators.
Because it is tightly coupled with utility functions from the Standard Template Library, its use is restricted in device code whenever a host function is implicitly called.
Uses where the operator is called directly and does not require implicit calls are enabled.
Nsight Developer Tools
Nsight Developer Tools are receiving updates coinciding with CUDA Toolkit 12.0.
NVIDIA Nsight Systems 2022.5 introduces a preview of InfiniBand switch metrics sampling. NVIDIA Quantum InfiniBand switches offer high-bandwidth, low-latency communication. Viewing switch metrics on the Nsight Systems timeline enables you to better understand your application’s network usage. You can use this information to optimize the application’s performance.
Nsight tools are built to be used collaboratively. Performance analysis in Nsight Systems often informs a deeper dive into kernel activity in Nsight Compute.
To streamline this process, Nsight Compute 2022.4 introduces Nsight Systems integration. This feature enables you to launch system trace activity and view the report in the Nsight Compute interface. You can then inspect the report and initiate kernel profiling from within the context menu.
With this workflow, you don’t have to run two different applications: it can all be done within one.
Nsight Compute 2022.4 also introduces a new inline function table that provides performance metrics split out for multiple inlined instances of a function. This heavily requested feature enables you to understand whether a function is suffering from performance issues in general or only in specific inlined cases.
It also enables you to understand where inlining is occurring, which can often lead to confusion when this level of detail is not available. The main source view continues to show the aggregation of metrics at a per-line level while the table lists the multiple locations where the function was inlined and the performance metrics for each location.
The Acceleration Structure viewer has also received a variety of optimizations and improvements, including support for NVIDIA OptiX curve profiling.
All optimizations and features added to the library come at a cost, usually in the form of binary size. Binary size for each library has slowly increased over the course of their lifespan. NVIDIA has made significant efforts to shrink these binaries without sacrificing performance. cuFFT saw the largest size reduction, with over 50% between CUDA Toolkit 11.8 and 12.0.
There are also a few library-specific features worth calling out.
cuBLAS
cuBLASLt exposes mixed-precision multiplication operations with the new FP8 data types. These operations also support BF16 and FP16 bias fusions, as well as FP16 bias with GELU activation fusions for GEMMs with FP8 input and output data types.
Regarding performance, FP8 GEMMs can be up to 3x and 4.5x faster on H100 PCIe and SXM, respectively, compared to BF16 on A100. The CUDA Math API provides FP8 conversions to facilitate the use of the new FP8 matrix multiplication operations.
cuBLAS 12.0 extends the API to support 64-bit integer problem sizes, leading dimensions, and vector increments. These new functions have the same API as their 32-bit integer counterparts except that they have the _64 suffix in the name and declare the corresponding parameters as int64_t.
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n, const float *x, int incx, int *result);
Performance is the focus for cuBLAS. When the arguments passed to 64-bit integer API fit into the 32-bit range, the library uses the same kernels as if you called the 32-bit integer API. To try the new API, the migration should be as simple as just adding the _64 suffix to cuBLAS functions, thanks to the C/C++ autoconversion from int32_t values to int64_t.
cuFFT
During plan initialization, cuFFT conducts a series of steps, including heuristics, to determine which kernels are used as well as kernel module loads.
Starting with CUDA 12.0, cuFFT delivers a larger portion of kernels using the CUDA Parallel Thread eXecution (PTX) assembly form, instead of the binary form.
The PTX code of cuFFT kernels is loaded and compiled further to the binary code by the CUDA device driver at runtime when a cuFFT plan is initialized. The first improvement available, due to the new implementation, will enable many new accelerated kernels for the NVIDIA Maxwell, NVIDIA Pascal, NVIDIA Volta, and NVIDIA Turing architectures.
cuSPARSE
To reduce the amount of required workspace for sparse-sparse matrix multiplication (SpGEMM), NVIDIA is releasing two new algorithms with lower memory usage. The first algorithm computes a strict bound on the number of intermediate products, while the second one enables partitioning the computation in chunks. These new algorithms are beneficial for customers on devices with smaller memory storage.
INT8 support has been added to cusparseGather, cusparseScatter, and cusparseCsr2cscEx2.
Finally, for SpSV and SpSM, the preprocessing time is improved by an average factor of 2.5x. For the execution phase, SpSV is improved by an average factor of 1.1x, while SpSM is improved by an average factor of 3.0x.
Math API
The new NVIDIA Hopper architecture comes with new Genomics and DPX instructions for faster means of computing combined arithmetic operations like three-way max, fused add+max, and so on.
New DPX instructions accelerate dynamic programming algorithms by up to 7x over the A100 GPU. Dynamic programming is an algorithmic technique for solving a complex recursive problem by breaking it down into simpler sub-problems. For a better user experience, these instructions are now exposed through the Math API.
An example would be a three-way max + ReLU operation, max(max(max(a, b), c), 0).
int __vimax3_s32_relu ( const int a, const int b, const int c )
nvJPEG now has an improved implementation that significantly reduces the GPU memory footprint. This is accomplished by using zero-copy memory operations, fusing kernels, and in-place color space conversion.
Summary
We continue to focus on helping researchers, scientists, and developers solve the world’s most complicated AI/ML and data sciences challenges through simplified programming models.
This CUDA 12.0 release is the first major release in many years and is foundational to help accelerate applications through the use of next-generation NVIDIA GPUs. New architecture-specific features and instructions in the NVIDIA Hopper and NVIDIA Ada Lovelace architectures are now targetable with CUDA custom code, enhanced libraries, and developer tools.
With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, a runtime library, and access to many advanced C/C++ and Python libraries.
For more information, see the following resources: