Categories
Misc

CUDA 12.0: New Features and Beyond on YouTube Premiere

CUDA 16x9 Aspect RatioLearn about the newest CUDA features such as release compatibility, dynamic parallelism, lazy module loading, and support for the new NVIDIA Hopper and NVIDIA…CUDA 16x9 Aspect Ratio

Learn about the newest CUDA features such as release compatibility, dynamic parallelism, lazy module loading, and support for the new NVIDIA Hopper and NVIDIA Ada Lovelace GPU architectures.

Categories
Misc

2023 Predictions: AI That Bends Reality, Unwinds the Golden Screw and Self-Replicates

After three years of uncertainty caused by the pandemic and its post-lockdown hangover, enterprises in 2023 — even with recession looming and uncertainty abounding — face the same imperatives as before: lead, innovate and problem solve. AI is becoming the common thread in accomplishing these goals. On average, 54% of enterprise AI projects made it Read article >

The post 2023 Predictions: AI That Bends Reality, Unwinds the Golden Screw and Self-Replicates appeared first on NVIDIA Blog.

Categories
Misc

Face All Fears With Creative Studio Fabian&Fred This Week ‘In the NVIDIA Studio’

The short film I Am Not Afraid! by creative studio Fabian&Fred embodies childlike wonder, curiosity and imagination this week In the NVIDIA Studio.

The post Face All Fears With Creative Studio Fabian&Fred This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Boost AI Development with Pretrained Models and the NVIDIA TAO Toolkit

The latest version of the NVIDIA TAO Toolkit 4.0 boosts developer productivity with all-new AutoML capability, integration with third-party MLOPs services, and…

The latest version of the NVIDIA TAO Toolkit 4.0 boosts developer productivity with all-new AutoML capability, integration with third-party MLOPs services, and new pretrained vision AI models. The enterprise version now includes access to the full source code and model weights for pretrained models. 

The toolkit enables efficient model training for vision and conversational AI. By simplifying complex AI models and deep learning frameworks, even developers without AI expertise can use the toolkit to produce AI models. Using transfer learning to fine-tune NVIDIA pretrained models with your own data, it is now possible to optimize model inference throughput without AI expertise or large training datasets.

Developers can create custom production-ready models optimized for specific environments and scenarios with TAO. A notable new feature helps developers build object detection models without massive amounts of data. The use cases include detecting assembly line defects, translating particular phrases across languages, or managing city traffic.

Download the TAO Toolkit 4.0.

Release highlights

  • Access to TAO source code and model weights for pretrained models. 
  • Train high-quality models with AutoML without the hassle of manually fine-tuning hundreds of parameters.
  • Deploy on VMs from leading cloud providers and Kubernetes services like Amazon EKS or Azure AKS. 
  • Simplify infrastructure management and scaling on cloud machine learning services such as Google Colab, Google Vertex AI, and Microsoft Azure Machine Learning.
  • New cloud integrations and third-party MLOps services, such as W&B and ClearML, provide developers and enterprises with an optimized AI workflow.
  • Integrate with REST APIs. Quickly build a new AI service or integrate into an existing one with REST APIs.
  • Use new transformer-based pretrained models (CitySemSegformer, Peoplenet Transformer) and retail-specific pretrained models (RetailObjectDetection, RetailObjectRecognition, and ReIdentificationNet.)

Resources

Get started

New Tutorials

Solutions using TAO Toolkit 

  • Next-generation IT and business service provider, Trifork uses TAO Toolkit to accelerate the development of the AI-based baggage tracking solution for airports.
  • Fingermark, a company developing data-driven solutions for decision-making, uses TAO Toolkit in developing industrial vision AI solutions to improve worker safety.
Categories
Misc

Creating Custom AI Models Using NVIDIA TAO Toolkit with Azure Machine Learning

Workflow to create AI models with NVIDIA TAO Toolkit and Azure MLA fundamental shift is currently taking place in how AI applications are built and deployed. AI applications are becoming more sophisticated and applied to…Workflow to create AI models with NVIDIA TAO Toolkit and Azure ML

A fundamental shift is currently taking place in how AI applications are built and deployed. AI applications are becoming more sophisticated and applied to broader use cases. This requires end-to-end AI lifecycle management—from data preparation, to model development and training, to deployment and management of AI apps. This approach can lower upfront costs, improve scalability, and decrease risk for customers using AI applications. 

While the cloud-native approach to app development can be appealing to developers, machine learning (ML) projects are notoriously time-intensive and cost-intensive, as they require a team with a varied skill set to build and maintain.

This post explains how you can accelerate your vision AI model development using NVIDIA TAO Toolkit and deploying it for inference with NVIDIA Triton Inference Server—all on the Azure Machine Learning (Azure ML) platform. 

NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.

Azure ML is a cloud service for accelerating and managing the machine learning project lifecycle that enables developers to automate AI workflows, from data preparation and model training to model deployment. Developers can easily train, deploy, and manage AI models at scale with Azure ML. 

Watch the video below to see a complete walkthrough of how to fine-tune your models with NVIDIA TAO Toolkit and deploy them for inference with NVIDIA Triton Inference Server—all on Azure ML.

Video 1. Create and deploy a custom AI model with NVIDIA TAO Toolkit on Azure Machine Learning

The overall workflow comprises three main steps:

  • Install the NGC Azure ML Quick Launch Toolkit 
  • Train and optimize a pretrained object detection model
  • Deploy the optimized model on Azure ML with NVIDIA Triton Inference Server
Diagram of the overall workflow starting from running the quick launch toolkit, to training and deploying with NVIDIA Triton Inference Server.
Figure 1. Workflow for running NVIDIA TAO Toolkit on Azure ML

This section covers the steps required to install the NGC quick launch toolkit, which configures the Azure ML resources and uploads the necessary containers and models for training. The required config files are provided on the AzureML Quick Launch – TAO resource on the NVIDIA NGC Catalog.

Create a conda environment to install the Azure ML Quick Launch Toolkit to avoid any possible conflicts with existing libraries in your machine using the following code:

conda create -n azureml-ngc-tools python=3.8
conda activate azureml-ngc-tools

Install the Azure ML Quick Launch Toolkit using the following code:

pip install azureml-ngc-tools

Download resources from AzureML Quick Launch: TAO:

wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao_detectnet_aml/versions/version1/zip -O azureml_quick_launch_tao_1.zip

File content

  • The azure_config.json file contains the details pertaining to the user credentials, Azure workspace, and GPU compute resources that need to be updated. Edit the azureml_user section with your Azure subscription ID, resource group, and workspace name. Next, edit the aml_compute section with GPU cluster details.
    • Recommended VMs: NCsv3, NDv2, or NC A100 v4 or ND A100 v4 series
    • OS: Ubuntu 20.04
    • To learn more about Azure VMs, refer to the Virtual Machine series. To learn how to spin up a Azure VM instance, refer to the Azure documentation.
  • The ngc_config.json file contains the content from the NGC Catalog, such as Docker containers and Jupyter notebooks, that you can upload into the Azure Machine Learning Resources.
  • Several scripts are packaged that will be used for model deployment.
  • Run the toolkit using the following code:
azureml-ngc-tools --login azure_config.json --app ngc_config.json

This will upload all the resources to the Azure ML Datastore. After the upload, a URL is generated for the Jupyter session, which enables you to interact with the session from your local web browser. You can verify that all the resources have been uploaded in the Azure ML portal. The steps to check the resources in Azure ML portal are provided in the video above. 

Train and optimize an object detection model with NVIDIA TAO Toolkit

This section covers the steps for training the model with NVIDIA TAO Toolkit on the Azure ML platform. 

Before you begin the training process, you need to run the auxiliary notebook called CopyData.ipynb. The notebook is automatically generated with azureml-ngc-tools. This copies the notebooks from the Datastore to the compute cluster. 

Screenshot of the CopyData.ipynb Jupyter notebook that user executes to copy notebooks from datastore to VM.
Figure 2. CopyData.ipynb Jupyter notebook for copying notebooks from the Datastore to the compute cluster

A new folder called tao is created with all the additional data provided. This folder contains the Jupyter notebook, along with the required configuration files for training. Navigate to the TAO_detectnet_v2.ipynb notebook under folder tao/detectnet_V2. The DetectNet_v2 is one of the many computer vision notebooks available for training.

Screenshot of TAO DetectNet_v2 notebook that is used for training the model.
Figure 3. TAO DetectNet_v2 Jupyter notebook used for training the model

Once you load up the notebook, simply execute each cell shown. For more information on this network or on how to configure hyperparameters, refer to the TAO Toolkit documentation. Some of the main steps covered in the notebook include: 

  • Setting environment variables
  • Downloading and converting training data 
  • Downloading the model from the NGC catalog
  • Training the model
  • Pruning the model to remove unwanted layers and reduce model size 
  • Retraining the pruned model to recover lost accuracy 
  • Quantize Aware Training (QAT), which changes the precision of the model to INT8, reducing model size without sacrificing accuracy 
  • Exporting the model for inference

Once the model is generated on the compute cluster, you will need to upload the model to the Azure ML workspace. To upload, run the UploadData.ipnyb notebook to copy the model to the Datastore. This model will be used for deployment.

Deploy the model using NVIDIA Triton Inference Server

Next, deploy the exported model using NVIDIA Triton Inference Server. Use the model trained in the previous step with NVIDIA TAO Toolkit and stored in the Datastore. Pull the model directly from the Datastore. 

Once the model has been uploaded, push the NVIDIA Triton container to the Azure Container Registry (ACR), create the inference end point, and test it using some sample images.

Register the model for inference

The next steps entail uploading the model for inference. Upload the trained model from the Datastore. Navigate to the Azure Machine Learning Studio to load the local model. 

1. After logging in, go to the Azure ML workspace (created earlier) using the azureml-ngc-tool script. Select ‘Models’ from the left menu, then click ‘Register’ and ‘From datastore.’

Screenshot of where to register the model for deployment from Azure ML Studio.
Figure 4. Register the model for deployment from Azure ML Studio

2. Select the ‘Triton’ model type to upload the model.

Screenshot of options to choose for model upload from Azure ML Studio.
Figure 5. Upload the model from Azure ML Studio

3. Browse and select the path: tao/detectnet _v2/model _repository 

Screenshot of options to choose when selecting model from datastore.
Figure 6. Datastore selection

4. Name the model ‘DetectNet’ and set the version to ‘1’

Screenshot of model settings during register model phase.
Figure 7. Register the model

5. Once the model has successfully uploaded, you should be able to see the directory
structure.

Screenshot of model artifacts in Azure ML model store.
Figure 8. Model artifacts in Azure ML model store

Build and upload NVIDIA Triton image to Azure Container Registry 

Next, build the NVIDIA Triton container with necessary dependencies and upload the image to Azure Container Registry (ACR). 

1. On your local machine, run the following script to create the NVIDIA Triton container with the necessary dependencies:

bash scripts/build_container.sh

2. Verify that the image has been created locally by executing:

docker image ls

If successful, you should see the repo named nvcr.io/nvidia/tao/triton-apps.

3. Push the Docker image to ACR using the following script:

bash scripts/push_container_to_ACR.sh 

The registryname parameter is the name of the provided Azure ML workspace default container registry. Navigate to the Workspace essential properties dashboard to find it in the Azure portal. This script will push the Docker image to ACR and tag it as ${registryname}.azurecr.io/tao:latest.

Screenshot to get the Container Registry name. This is where the container will be uploaded.
Figure 9. Get the Container Registry name, where the container will be uploaded

Once the script completes, navigate to ACR to see the container in the tao repository.

Create the Azure ML endpoint and deployment

On your local machine, run the following script to create an Azure ML Endpoint followed by the deployment:

bash scripts/create_endpoint _and_deployment.sh 

The script will create the Azure ML Endpoint with the endpoint names provided in the endpoint_aml.yml file. It then deploys the NVIDIA Triton service on Azure using the deployment_aml.yml file. 

In this file, you can specify VM size. For this example, use the Standard_NC6s_v3 VM. These files are provided in scripts/auxiliary_files

Once the script execution is complete, you should be able to see the deployment information, REST endpoint, and authentication key on the Azure portal. The deployment information can be found by clicking into the endpoint and navigating to the Deployment Logs tab.

Screenshot of the Triton deployment endpoint with deployment name, REST endpoints and authentication key.
Figure 10. NVIDIA Triton deployment endpoint

Validate the endpoint 

You can validate the endpoint by using the REST endpoint URL and the primary key found under the Endpoints tab on the Azure portal. To query the Azure ML endpoint from the user local machine, run the following script:

bash scripts/infer.sh 

This script is provided in the zip file pulled from AzureML Quick Launch: TAO in the first step. 

Next, provide the REST endpoint URL from the NVIDIA Triton deployment endpoint. The endpoint is queried with the test image provided with the option. The output image with bounding boxes is stored in

Summary 

This post showed the end-to-end workflow for fine-tuning a model with NVIDIA TAO Toolkit and deploying the trained object detection model using NVIDIA Triton Inference Server, all on Azure Machine Learning. These tools abstract away the AI framework complexity, enabling you to build and deploy AI applications in production without the need for AI expertise. 

Categories
Misc

Training Like an AI Pro Using NVIDIA TAO AutoML

There has been tremendous growth in AI over the years. With that, comes a larger demand for AI models and applications. Creating production-quality AI requires…

There has been tremendous growth in AI over the years. With that, comes a larger demand for AI models and applications. Creating production-quality AI requires expertise in AI and data science and can still be intimidating for many developers.

To develop accurate AI, you must choose what model architecture to use, what data to collect, and finally how to tune the model to meet the desired KPIs. There are thousands of combinations of model architectures and hyperparameters that you’d have to try to get the best model for your specific use case. This process is extremely laborious and requires model architecture expertise to tune hyperparameters.

Automated machine learning (AutoML) automates the manual task of finding the best models and hyperparameters for the desired KPI. It can algorithmically derive the best model for your given KPI and abstract away a lot of the complexity of AI model creation and optimization.

AutoML makes it easy for even a novice developer to create a highly accurate AI model.

AutoML in TAO

AutoML in TAO is fully configurable for automatically optimizing the hyperparameters of a model, which reduces the need for manual tuning. It caters to both AI experts and non-experts.

  • For non-experts, the guided Jupyter notebook provides a simple, efficient way to create an accurate AI model.
  • For experts, TAO gives you full control of which hyperparameters to tune and which algorithm to use for sweeps.

TAO currently supports two optimization algorithms: Baysian and Hyperband optimization. These algorithms can effectively sweep across a range of hyperparameters to find the best combination to optimize the user-provided metric.

Hyperband yields faster because it doesn’t have to run through the entire training configuration. It runs for a limited number of epochs, discards the runs that are performing poorly, and only continues on the remaining runs. This process of elimination continues until there is a single configuration that gives the best results.

For Bayesian, the training runs to completion for all the sweeps.

AutoML is supported for a wide range of CV tasks: image classification, object detection, segmentation, and OCR. Table 1 shows the full list of supported networks.

Image Classification Object Detection Segmentation OCR
ResNet10/18/34/50/101
EfficientNet_B0-B7
DarkNet19/53
CSPDarkNet19/53/Tiny
MobileNet_v1/v2
SqueezeNet VGG16/19
GoogleNet
YoloV3/V4/V4-Tiny
EfficientNet
RetinaNet
FasterRCNN
DetectNet_v2 SSD/DSSD    
UNET MaskRCNN LPRNet
Table 1. Networks supported by TAO AutoML

Getting started with AutoML

The entire AutoML workflow can be run from the provided Jupyter notebooks. AutoML uses the TAO API services to manage all training jobs.

TAO API services

TAO API is a Kubernetes service that enables deployment of TAO as a microservice either on your own Kubernetes cluster or with cloud Kubernetes services such as Amazon EKS or Azure AKS.

TAO API services provide an additional layer of abstraction over containers. You can manage and deploy TAO services using Helm charts and remotely run jobs using REST API calls. With the APIs, you can remotely create and upload datasets, run training jobs, evaluate models, and export models for deployment.

Diagram shows a cloud-native TAO API service architecture with a front-end service container that receives the HTTP request for downstream applications and orchestrates the appropriate TAO training container.
Figure 1. Cloud-native TAO API service architecture

API services make it easy to integrate TAO into your own custom application or build a web-UI application on top of TAO. To get started with building your custom applications with REST APIs, see the API guide and the API notebooks in TAO Toolkit Getting Started. For more information, see the Notebook section later in this post.

To train using the CLI, use the lightweight CLI client application that you can install on a client system to access TAO services and CLI notebooks. CLI notebooks are available on NGC in TAO getting started resources.

AutoML requires a higher-level service on top of your training runs to determine and manage the set of experiments. TAO services keep track of all the experiments that they have tried with the KPIs and build the next set of experiments to improve on the KPIs. You can run AutoML with TAO API services either through the remote CLI application or directly using REST APIs. Jupyter notebooks for both are provided. For more information, see the Notebook section.

The REST API notebooks mainly serve as a reference if you are building your own applications or UI on top of TAO.

Set up TAO services

The TAO API service can run on any Kubernetes platform. To simplify the deployment of TAO services, we have provided a one-click deploy script. This simplifies the deployment of TAO services on bare-metal setup or on Amazon EKS. For this post, we use the bare-metal setup but instructions to deploy on cloud are provided in the API guide.

Prerequisites

  • A NVIDIA GPU (on-premises or in the cloud):
    • NVIDIA Volta architecture
    • NVIDIA Turing architecture
    • NVIDIA Ampere architecture
    • NVIDIA Hopper architecture
  • TAO Toolkit 4.0
  • Ubuntu 18.04 or 20.04

Download the one-click deploy tar file and untar the package:

tar -xvf tao-toolkit-api-bare-metal.tar
cd tao-toolkit-api-bare-metal

Add the host IP address and login credentials in the hosts file. This is the system where you plan to run the TAO services. It could be a local or remote system but you must have sudo privileges.

For credentials, you can use either a password (ansible_ssh_pass) or an SSH private key file (ansible_ssh_private_key_file). For a single-node cluster, you can list only the master node.

File name: hosts:

[master]
 ansible_ssh_user='' ansible_ssh_pass=''
[nodes]
 ansible_ssh_user='' ansible_ssh_pass=''

You can validate SSH credentials for remote machines with the following command. The proper answer would be root.

ssh @ 'sudo whoami'

Next, modify the tao-toolkit-api-ansible-values.yml file to add your NGC credentials and the Helm chart. This pulls the Helm chart from the NGC registry. For more information, see Generating Your NGC API Key.

File name: tao-toolkit-api-ansible-values.yml

ngc_api_key: 
ngc_email: 
api_chart: https://helm.ngc.nvidia.com/nvidia/tao/charts/tao-toolkit-api-4.0.0.tgz
api_values: ./tao-toolkit-api-helm-values.yml
cluster_name: tao-automl-demo

Install the dependencies and deploy the TAO service. Before installation, first check whether all the dependencies are met by running check-inventory.yml. If everything looks good, you should see a message that says 0 failed. Then, run install, which takes 10–15 minutes.

bash setup.sh check-inventory.yml
bash setup.sh install

Download AutoML notebooks

Download the computer vision training resources from TAO Toolkit Getting Started on NGC.

wget --content-disposition 
https://api.ngc.nvidia.com/v2/resources/nvidia/tao/getting_started/versions/4.0.0/zip -O getting_started_v4.0.0.zip
unzip -u getting_started_v4.0.0.zip  -d ./getting_started_v4.0.0 && 
rm -rf getting_started_v4.0.0.zip && cd ./getting_started_v4.0.0

All the AutoML-related notebooks are inside the TAO API directory. The notebooks are provided in the directory structure:

notebooks
     |--> tao_api_starter_kit
          |--> API
               |--> automl
               |--> end2end
               |--> dataset_prepare
          |--> client
               |--> automl
                    |--> classification.ipynb
                    |--> object_detection.ipynb
                    |--> segmentation.ipynb
                    |--> lprnet.ipynb        
               |--> end2end
               |--> dataset_prepare

For this post, use the object detection notebook (TAO API Starter Kit/Notebooks/client/automl/object_detection.ipynb) but you can also do AutoML on other computer vision tasks.

Use AutoML to fine-tune an object detection model with TAO

Here is a quick walkthrough of the AutoML workflow with the Object Detection AutoML notebook. For this walkthrough, you use the client/automl/object_detection.ipynb notebook from the hierarchy shown earlier. We highlight the key steps here, but all the steps are captured in the Jupyter notebook.

Diagram shows the end-to-end workflow of running AutoML in TAO. User starts with selecting model topology, create and upload dataset, configuring parameters, training with AutoML to comparing the model.
Figure 2. Step-by-step TAO AutoML workflow

Select a model topology

Choose any one of the available models listed for that notebook. Each notebook has a default model for that domain. In this example, the default model is DetectNet V2 but you can change it to FasterRCNN, SSD, DSSD, Retinanet, EfficientDet, Yolo V3, Yolo V4, or YoloV4 tiny.

model_name = "detectnet-v2"

Create a dataset

The next step is to use the dataset given as an example in the notebook or use your own dataset. The folder structure of the dataset requirement is provided in the notebooks.

train_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format {ds_format}")
print(train_dataset_id)

eval_dataset_id = subprocess.getoutput(f"tao-client {model_name} dataset-create --dataset_type object_detection --dataset_format {ds_format}")
print(eval_dataset_id)

Upload the dataset

When you have the dataset ready, upload it to the TAO Toolkit REST API-deployed machine through the Unix rsync command for TAO-Client notebooks. You must upload the images and labels for both training and validation data.

rsync -ah --info=progress2 {TRAIN_DATA_DIR}/images ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
rsync -ah --info=progress2 {TRAIN_DATA_DIR}/labels ~/shared/users/{os.environ['USER']}/datasets/{train_dataset_id}/
rsync -ah --info=progress2 {VAL_DATA_DIR}/images ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/
rsync -ah --info=progress2 {VAL_DATA_DIR}/labels ~/shared/users/{os.environ['USER']}/datasets/{eval_dataset_id}/

Convert the dataset

After the dataset has been uploaded, convert the dataset to tfrecords through the dataset-convert action. All object detection models require dataset conversion, but some models from other domains, like classification, can operate on the raw data uploaded.

train_convert_job_id = subprocess.getoutput(f"tao-client {model_name} dataset-convert --id {train_dataset_id}  --action {convert_action} ")

Configure the AutoML parameters

The next step is to choose which AutoML algorithm to run. There are options to tweak some AutoML-specific parameters. You can view the parameters that are enabled by default for the AutoML search for a model, along with all the parameters that are available for a network

tao-client {model_name} model-automl-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/automl_defaults.json

This outputs a list of hyperparameters that are used for AutoML. For this experiment, you are choosing five different hyperparameters to sweep.

[
"bbox_rasterizer_config.deadzone_radius",
"training_config.learning_rate.soft_start_annealing_schedule.min_learning_rate",   "training_config.learning_rate.soft_start_annealing_schedule.annealing",
"training_config.regularizer.type",
"classwise_config.postprocessing_config.clustering_config.dbscan_confidence_threshold" 
]

You can add additional parameters or remove existing defaulted parameters. For example, to sweep the soft_start hyperparameter, add the following in your notebook:

additional_automl_parameters = [“training_config.learning_rate.soft_start_annealing_schedule.soft_start”]

There are also options to tweak algorithm-specific parameters, but the default parameters work well. For more information, see AutoML.

Train with AutoML

At this point, you have all the tools necessary to start the AutoML run. You can also change default training specs, like image extension or class mapping, before triggering the AutoML run:

train_job_id = subprocess.getoutput(f"tao-client {model_name} model-train --id " + model_id)

When AutoML run starts, you can see various stats, such as the best accuracy score at that time, how many experiments have been completed, an approximate estimated time for completion, and so on. You should see an output log similar to the following.

{
  "best_map": 0.59636,
  "Estimated time for automl completion": "23.13 minutes remaining approximately",
  "Current experiment number": 3,
  "Number of epochs yet to start": 429.0,
  "Time per epoch in seconds": 3.24
}

Compare models

At the end of the AutoML run, you can see the results of all experiments. You are presented with the spec file and the binary weight file of the model achieving the highest accuracy among the AutoML sweeps.

Checkpoints for the best performing experiment
Folder: /home/nvidia/shared/users/95af85a9-805c-5680-b01a-3c85ed70f009/models/4f22c462-1d97-4537-99b2-15ee69eb2660/168d6149-6c47-40e6-b6a3-267867cea551/best_model/weights
Files: [epoch-80.tlt']

Results of all experiments
    id   result
0    0  0.43636
1    1  0.41818
2    2  0.53636
3    3  0.44545
4    4  0.33636
5    5  0.44545
6    6  0.53636
7    7  0.53636
8    8  0.61636
9    9  0.62727
10  10  0.593636
11  11  0.52727
12  12  0.53636
13  13  0.54545
14  14  0.61636
15  15  0.60909
16  16  0.5636
17  17  0.54545
18  18  0.53636
19  19  0.53636

The spec file for the best-performing experiment is stored in the following directory:

 {home}/shared/users/{os.environ['USER']}/models/{model_id}/{train_job_id}/{automl_job_dir}/best_model

The best model for this experiment was ID 9 with a mAP of 0.627. This is stored in the best_model/recommendataion_9.kitti file.

After saving the best model obtained from AutoML, you can plug the model and spec file in the end-to-end notebook and then prune and optimize the model for inference.

Diagram starts with training using AutoML and moves through model evaluation, pruning, re-training, model evaluation again, to exporting and then optimizing the model for inference.
Figure 3. End-to-end workflow from AutoML training to model optimization

To plug the model into the new notebook, copy the train job ID from the AutoML notebook. The AutoML train job ID is printed when you run the training job.

train_job_id = subprocess.getoutput(f"tao-client {model_name} model-train --id " + model_id)
print(train_job_id)

When you have the train job ID, open the end-to-end notebook from the notebook hierarchy from earlier. For this post, use the TAO API Starter Kit/Notebooks/client/end2end/detectnet_v2.ipynb notebook. As you have already trained a model, just run the import statement in the first cell and skip all the way down to the Run Evaluate section. In this section, create a code cell before evaluating.

train_job_id = “id_you_copied”

After adding the job_map code cell, you can evaluate the model, prune the model for compression, and even do a quantization-aware training of the original model or the pruned model as showcased in the end-to-end notebooks

Results

We trained various models with AutoML on public datasets to see how much improvement in accuracy we could achieve. We compared the best AutoML-based accuracy against a baseline accuracy number from the default spec file provided in the packages. The results are in Table 2.

For accuracy, we used mAP (mean average precision) for object detection, average accuracy of all classes and tasks for image classification, and mean IoU (intersection over union) score for semantic segmentation.

Task Model Baseline Accuracy
(default spec)
Best AutoML accuracy Dataset
Object Detection DetectNet_v2 – ResNet18 44.16 51.37 FLIR
Object Detection FasterRCNN – ResNet18 56.42 60.44 FLIR
Object Detection YOLOv4 – ResNet18 40.12 63.46 FLIR
Object Detection YOLOv3 – ResNet18 42.36 61.84 FLIR
Object Detection RetinaNet – ResNet18 50.54 63.09 FLIR
Image Classification ResNet18 53.95 66.28 Pascal VOC
Semantic Segmentation UNET 71.64 76.65 ISBI
Table 2. Accuracy gain across networks by using AutoML

Across all the models that we tested, the increase in model accuracy is substantial compared to static default hyperparameters. The amount of improvement varies depending on the model, but we have generally seen improvements ranging from 5% to more than 20%. This shows that AutoML can work on various datasets to train the best model for a given KPI.

Summary

As the number of use cases and customization grows, it becomes imperative to accelerate the AI creation process. AutoML can eliminate the need for manual tuning, saving valuable time for developers.

With TAO AutoML, you can now automatically tune models for object detection, classification, and segmentation use cases using various popular model architectures. TAO AutoML provides simplicity for novice users to get started, as well as configurability for experts to choose their own hyperparameters to sweep.

Categories
Misc

Ferrari of Finance: Accelerated Computing Drives Milan Bank Forward

Banks require more than cash in the vault these days, they also need accelerated computing in the back room. “The boost we’re getting with GPUs not only significantly improved our performance at the same cost, it helped us redefine our business and sharpen our focus on customers,” said Marco Airoldi, who’s been head of financial Read article >

The post Ferrari of Finance: Accelerated Computing Drives Milan Bank Forward appeared first on NVIDIA Blog.

Categories
Misc

Enabling Dynamic Control Flow in CUDA Graphs with Device Graph Launch

Graphic of device-side work schedulerCUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single…Graphic of device-side work scheduler

CUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single operation. Knowing the workflow upfront enables the CUDA driver to apply various optimizations, which cannot be performed when launching through a stream model. 

However, this performance comes at the cost of flexibility: if the full workflow is not known in advance, then GPU execution must be interrupted to return to the CPU to make a decision.

CUDA device graph launch solves this problem by enabling a task graph to be performantly launched from a running GPU kernel, based on data that is determined at run time. CUDA device graph launch offers two distinct launch modes—fire and forget, and tail launch—to enable a wide range of applications and use.

This post demonstrates how to use device graph launch and the two launch modes. It features the example of a device-side work scheduler, which decompresses files for data processing.

Device graph initialization

Executing a task graph involves the four-step process outlined below: 

  1. Create the graph
  2. Instantiate the graph into an executable graph
  3. Upload the executable graph’s work descriptors to the GPU
  4. Launch the executable graph

By separating the launch step from the other steps, CUDA is able to optimize the workflow and keep graph launch as lightweight as possible. As a convenience, CUDA will also combine the upload step with the launch step the first time a graph is launched if the upload step has not been called explicitly.

In order to launch a graph from a CUDA kernel, the graph first must have been initialized for device launch during the instantiation step. Additionally, before it can be launched from the device, the device graph must have been uploaded to the device, either explicitly through a manual upload step or implicitly through a host launch. The code below, which performs the host-side steps to set up the device scheduler example, shows both options:

// This is the signature of our scheduler kernel
// The internals of this kernel will be outlined later
__global__ void schedulerKernel(
    fileData *files,
    int numFiles,
    int *currentFile,
    void **currentFileData,
    cudaGraphExec_t zipGraph,
    cudaGraphExec_t lzwGraph,
    cudaGraphExec_t deflateGraph);

void setupAndLaunchScheduler() {
    cudaGraph_t zipGraph, lzwGraph, deflateGraph, schedulerGraph;
    cudaGraphExec_t zipExec, lzwExec, deflateExec, schedulerExec;

    // Create the source graphs for each possible operation we want to perform
    // We pass the currentFileData ptr to this setup, as this ptr is how the scheduler will
    // indicate which file to decompress
    create_zip_graph(&zipGraph, currentFileData);
    create_lzw_graph(&lzwGraph, currentFileData);
    create_deflate_graph(&deflateGraph, currentFileData);

    // Instantiate the graphs for these operations and explicitly upload
    cudaGraphInstantiate(&zipExec, zipGraph, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(zipExec, stream);
    cudaGraphInstantiate(&lzwExec, lzwGraph, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(lzwExec, stream);
    cudaGraphInstantiate(&deflateExec, deflateGraph, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(deflateExec, stream);

    // Create and instantiate the scheduler graph
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    schedulerKernel>>(files, numFiles, currentFile, currentFileData, zipExec, lzwExec, deflateExec);
    cudaStreamEndCapture(stream, &schedulerGraph);
    cudaGraphInstantiate(&schedulerExec, schedulerGraph, cudaGraphInstantiateFlagDeviceLaunch);

    // Launch the scheduler graph - this will perform an implicit upload
    cudaGraphLaunch(schedulerExec, stream);
}

It is important to note here that device graphs can be launched either from the host or from the device. Therefore, the same cudaGraphExec_t handles may be passed to the scheduler for launch on the device as for launch on the host.

Fire and forget launch

A scheduler kernel dispatches work based on incoming data. For work dispatch, fire and forget launch is the preferred launch method.

When a graph is launched using fire and forget launch, it is dispatched immediately. It executes independently of both the launching graph and subsequent graphs launched using fire and forget mode. Because the work executes immediately, fire and forget launch is preferable for work dispatched by a scheduler, as it starts running as quickly as possible. CUDA introduces a new device-side named stream to perform a fire and forget launch of a graph. See below for an example of a simple dispatcher.

enum compressionType {
    zip     = 1,
    lzw     = 2,
    deflate = 3
};

struct fileData {
    compressionType comprType;
    void *data; 
};

__global__ void schedulerKernel(
    fileData *files,
    int numFiles
    int *currentFile,
    void **currentFileData,
    cudaGraphExec_t zipGraph,
    cudaGraphExec_t lzwGraph,
    cudaGraphExec_t deflateGraph)
{
    // Set the data ptr to the current file so the dispatched graph
    // is operating on the correct file data
    *currentFileData = files[currentFile].data;

    switch (files[currentFile].comprType) {
        case zip:
            cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
            break;
        case lzw:
            cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
            break;
        case deflate:
            cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
            break;
        default:
            break;
    }
}

It is also important to note that graph launches can be nested and recursive, so additional device graphs can be dispatched from fire and forget launches. Although not shown in this example, the graphs that are decompressing the file data could dispatch more graphs to do further processing on that data once it is fully decompressed (image processing, for example). Device graph flow is hierarchical, just like graphs themselves.

Tail launch

CUDA work is launched asynchronously to the GPU, which means the launching thread must explicitly wait for the work to complete before consuming any result or output. This is typically done from a CPU thread using a synchronization operation such as cudaDeviceSynchronize or cudaStreamSynchronize.

It is not possible for a launching thread on the GPU to synchronize on device graph launches through traditional methods such as cudaDeviceSynchronize. Instead, when operation ordering is desired, tail launch should be used.

When a graph is submitted for tail launch, it does not execute immediately, but rather upon completion of the launching graph. CUDA encapsulates all dynamically generated work as part of the parent graph, so a tail launch will also wait for all generated fire and forget work before executing.

This is true whether the tail launch was issued before or after any fire and forget launches. Tail launches themselves execute in the order in which they are enqueued. A special case is self-relaunch, where the currently running device graph is enqueued to relaunch through tail launch. Only one pending self-relaunch is permitted at a time.

Using tail launch, you can upgrade the previous dispatcher to become a full scheduler kernel by having it relaunch itself repeatedly, effectively creating a loop in the execution flow:

__global__ void schedulerKernel(
    fileData *files,
    int numFiles,
    int *currentFile,
    void **currentFileData,
    cudaGraphExec_t zipGraph,
    cudaGraphExec_t lzwGraph,
    cudaGraphExec_t deflateGraph)
{
    // Set the data ptr to the current file so the dispatched graph
    // is operating on the correct file data
    *currentFileData = files[currentFile].data;

    switch (files[currentFile].comprType) {
        case zip:
            cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
            break;
        case lzw:
            cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
            break;
        case deflate:
            cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
            break;
        default:
            break;
    }

    // If we have not finished iterating over all the files, relaunch
    if (*currentFile 



Notice how the relaunch operation uses cudaGetCurrentGraphExec to retrieve a handle to the currently executing graph. It can relaunch itself without needing a handle to its own executable graph.

Use of tail launch for the self-relaunch has the added effect of synchronizing on (waiting for) the dispatched fire and forget work before the next scheduler kernel relaunch begins. A device graph can only have one pending launch at a time (plus one self-relaunch). In order to relaunch the graph that was just dispatched, you need to make sure that the previous launch completed first. Performing a self-relaunch accomplishes this goal, so that you can dispatch whatever graph is needed for the next iteration.

Device compared to host launch performance

How would this example fare against a host-launched graph? Figure 1 compares fire and forget launch, tail launch, and host launch latencies for various topologies.

A chart comparing the two device launch modes and host launch for three topologies. The topologies are a straight-line graph, a graph which forks and joins repeatedly, and a graph which forks once into parallel straight-line sections.
Figure 1. A comparison of device and host launch latencies for various topologies

This chart shows that not only is the device-side launch latency better than 2x lower than that of host launch, but it is also not impacted by graph structure. The latency is identical for each of the given topologies.

Device launch also scales much better to the width of the graph, as shown in Figure 2.

A line-graph comparing fire and forget, tail, and host launch latencies for graphs containing 1, 2, 4, and 8 parallel straight-line sections.
Figure 2. A comparison of device and host launch latencies for graphs containing variable amounts of parallel straight-line segments

By comparison with host launch, device launch latency stays almost constant regardless of how much parallelism is in the graph.

Conclusion

CUDA device graph launch offers a performant way to enable dynamic control flow within CUDA kernels. While the example presented in this post provides a means of getting started with the feature, it is but a small representation of the ways this feature can be used.

For more documentation, see the device graph launch section of the programming guide. To try device graph launch, download CUDA Toolkit 12.0.

Categories
Misc

Explainer: What Is Computer Vision?

Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many…

Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many applications.

Categories
Misc

CUDA Toolkit 12.0 Released for General Availability

NVIDIA announces the newest CUDA Toolkit software release, 12.0. This release is the first major release in many years and it focuses on new programming models…

NVIDIA announces the newest CUDA Toolkit software release, 12.0. This release is the first major release in many years and it focuses on new programming models and CUDA application acceleration through new hardware capabilities.

You can now target architecture-specific features and instructions in the NVIDIA Hopper and NVIDIA Ada Lovelace architectures with CUDA custom code, enhanced libraries, and developer tools.

CUDA 12.0 includes many changes, both major and minor. Not all changes are listed here, but this post offers an overview of the key capabilities.

Overview

  • Support for new NVIDIA Hopper and NVIDIA Ada Lovelace architecture features with additional programming model enhancements for all GPUs, including new PTX instructions and exposure through higher-level C and C++ APIs
  • Support for revamped CUDA dynamic parallelism APIs, offering substantial performance improvements compared to the legacy APIs
  • Enhancements to the CUDA Graphs API:
    • You can now schedule graph launches from GPU device-side kernels by calling built-in functions. With this ability, user code in kernels can dynamically schedule graph launches, greatly increasing the flexibility of CUDA Graphs.
    • The cudaGraphInstantiate API has been refactored to remove unused parameters.
  • Support for the GCC 12 host compiler
  • Support for C++20
  • New nvJitLink library in the CUDA Toolkit for JIT LTO
  • Library optimizations and performance improvements
  • Updates to Nsight Compute and Nsight Systems Developer Tools
  • Updated support for the latest Linux versions

For more information, see CUDA Toolkit 12.0 Release Notes. CUDA Toolkit 12.0 is available to download.

NVIDIA Hopper and NVIDIA Ada Lovelace architecture support

CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families. The CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.

CUDA 12.0 exposes programmable functionality for many features of the NVIDIA Hopper and NVIDIA Ada Lovelace architectures:

  • Many tensor operations are now available through public PTX:
    • TMA operations
    • TMA bulk operations
    • 32x Ultra xMMA (including FP8 and FP16)
  • Launch parameters control membar domains in NVIDIA Hopper GPUs
  • Support for the smem sync unit PTX and C++ API
  • Support for C intrinsics for cooperative grid array (CGA) relaxed barriers
  • Support for programmatic L2 Cache to SM multicast (NVIDIA Hopper GPUs only)
  • Support for public PTX for SIMT collectives: elect_one
  • Genomics and DPX instructions are now available for NVIDIA Hopper GPUs to provide faster combined-math arithmetic operations (three-way max, fused add+max, and so on).

Lazy loading

Lazy loading is a technique for delaying the loading of both kernels and CPU-side modules until loading is required by the application. The default is preemptively loading all the modules the first time a library is initialized. This can result in significant savings, not only of device and host memory, but also in the end-to-end execution time of your algorithms.

Lazy loading has been part of CUDA since the 11.7 release. Subsequent CUDA releases have continued to augment and extend it. From the application development perspective, nothing specific is required to opt into lazy loading. Your existing applications work with lazy loading as-is.

If you have operations that are particularly latency-sensitive, you may want to profile your applications. The tradeoff with lazy loading is a minimal amount of latency at the point in the application where the functions are first loaded. This is overall lower than the total latency without lazy loading.​

Metric Baseline CUDA 11.7 CUDA 11.8+ Improvement
End-to-end runtime [s] 2.9 1.7 0.7 4x
Binary load time [s] 1.6 0.8 0.01 118x
Device memory footprint [MB] 1245 435 435 3x
Host memory footprint [MB] 1866 1229 60 31x
Table 1. Example application speedup with lazy loading

All libraries used with lazy loading must be built with 11.7+ to be eligible.

Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY set.

Compatibility

CUDA minor version compatibility is a feature introduced in 11.x that gives you the flexibility to dynamically link your application against any minor version of the CUDA Toolkit within the same major release. Compile your code one time, and you can dynamically link against libraries, the CUDA runtime, and the user-mode driver from any minor version within the same major version of CUDA Toolkit.

For example, 11.6 applications can link against the 11.8 runtime and the reverse. This is accomplished through API or ABI consistency within the library files. For more information, see CUDA Compatibility.

Minor version compatibility continues into CUDA 12.x. However, as 12.0 is a new major release, the compatibility guarantees a reset. Applications that used minor version compatibility in 11.x may have issues when linking against 12.0. Either recompile your application against 12.0 or statically link to the needed libraries within 11.x to ensure the continuity of your development.

Likewise, applications recompiled or built in 12.0 will link to future versions of 12.x but are not guaranteed to work in CUDA Toolkit 11.x.

JIT LTO support

CUDA 12.0 Toolkit introduces a new nvJitLink library for JIT LTO support. NVIDIA is deprecating the support for the driver version of this feature. For more information, see Deprecated Features.

C++20 compiler support

CUDA Toolkit 12.0 adds support for the C++20 standard. C++20 is enabled for the following host compilers and their minimal versions:

  • GCC 10
  • Clang 11
  • MSVC 2022
  • NVC++ 22.x
  • Arm C/C++ 22.x

For more information about features, see the corresponding host compiler documentation.

While the majority of C++20 features are available in both host and device code, some are restricted.

Module support

Modules are introduced in C++20 as a new way to import and export entities across translation units.

Because it requires complex interaction between the CUDA device compiler and the host compiler, modules are not supported in CUDA C++, in either host or device code. Uses of the module and export and import keywords are diagnosed as errors.

Coroutine support

Coroutines are resumable functions. Execution can be suspended, in which case control is returned to the caller. Subsequent invocations of the coroutine resume at the point where it was suspended.

Coroutines are supported in host code but are not supported in device code. Uses of the co_await, co_yield, and co_return keywords in the scope of a device function are diagnosed as errors during device compilation.

Three-way comparison operator

The three-way comparison operator is a new kind of relational enabling the compiler to synthetize other relational operators.

Because it is tightly coupled with utility functions from the Standard Template Library, its use is restricted in device code whenever a host function is implicitly called.

Uses where the operator is called directly and does not require implicit calls are enabled.

Nsight Developer Tools

Nsight Developer Tools are receiving updates coinciding with CUDA Toolkit 12.0.

NVIDIA Nsight Systems 2022.5 introduces a preview of InfiniBand switch metrics sampling. NVIDIA Quantum InfiniBand switches offer high-bandwidth, low-latency communication. Viewing switch metrics on the Nsight Systems timeline enables you to better understand your application’s network usage. You can use this information to optimize the application’s performance.

Screenshot of InfiniBand Switch network usage visualized in the Nsight Systems timeline. Throughput can be expanded to view bytes received and bytes sent.
Figure 1. InfiniBand Switch metrics in Nsight Systems

Nsight tools are built to be used collaboratively. Performance analysis in Nsight Systems often informs a deeper dive into kernel activity in Nsight Compute.

To streamline this process, Nsight Compute 2022.4 introduces Nsight Systems integration. This feature enables you to launch system trace activity and view the report in the Nsight Compute interface. You can then inspect the report and initiate kernel profiling from within the context menu.

With this workflow, you don’t have to run two different applications: it can all be done within one.

Screenshot of a system trace being viewed in Nsight Compute. Right-clicking on the selected kernel opens the context menu, and kernel profiling can be launched.
Figure 2. Nsight Systems integration in Nsight Compute

Nsight Compute 2022.4 also introduces a new inline function table that provides performance metrics split out for multiple inlined instances of a function. This heavily requested feature enables you to understand whether a function is suffering from performance issues in general or only in specific inlined cases.

It also enables you to understand where inlining is occurring, which can often lead to confusion when this level of detail is not available. The main source view continues to show the aggregation of metrics at a per-line level while the table lists the multiple locations where the function was inlined and the performance metrics for each location.

Screenshot of the source code viewer showing per-line metrics alongside a breakdown of where the function was inlined and its performance in those cases.
Figure 3. Inline function table in Nsight Compute

The Acceleration Structure viewer has also received a variety of optimizations and improvements, including support for NVIDIA OptiX curve profiling.

For more information, see NVIDIA Nsight Compute, NVIDIA Nsight Systems, and Nsight Visual Studio Code Edition.

Math library updates

All optimizations and features added to the library come at a cost, usually in the form of binary size. Binary size for each library has slowly increased over the course of their lifespan. NVIDIA has made significant efforts to shrink these binaries without sacrificing performance. cuFFT saw the largest size reduction, with over 50% between CUDA Toolkit 11.8 and 12.0.

There are also a few library-specific features worth calling out.

cuBLAS

cuBLASLt exposes mixed-precision multiplication operations with the new FP8 data types. These operations also support BF16 and FP16 bias fusions, as well as FP16 bias with GELU activation fusions for GEMMs with FP8 input and output data types.

Regarding performance, FP8 GEMMs can be up to 3x and 4.5x faster on H100 PCIe and SXM, respectively, compared to BF16 on A100. The CUDA Math API provides FP8 conversions to facilitate the use of the new FP8 matrix multiplication operations.

cuBLAS 12.0 extends the API to support 64-bit integer problem sizes, leading dimensions, and vector increments. These new functions have the same API as their 32-bit integer counterparts except that they have the _64 suffix in the name and declare the corresponding parameters as int64_t.

cublasStatus_t cublasIsamax(cublasHandle_t handle, int n, const float *x, int incx, int *result);

The 64-bit integer counterpart is as follows:

cublasStatus_t cublasIsamax_64(cublasHandle_t handle, int64_t n, const float *x, int64_t incx, int64_t *result);

Performance is the focus for cuBLAS. When the arguments passed to 64-bit integer API fit into the 32-bit range, the library uses the same kernels as if you called the 32-bit integer API. To try the new API, the migration should be as simple as just adding the _64 suffix to cuBLAS functions, thanks to the C/C++ autoconversion from int32_t values to int64_t.

cuFFT

During plan initialization, cuFFT conducts a series of steps, including heuristics, to determine which kernels are used as well as kernel module loads.

Starting with CUDA 12.0, cuFFT delivers a larger portion of kernels using the CUDA Parallel Thread eXecution (PTX) assembly form, instead of the binary form.

The PTX code of cuFFT kernels is loaded and compiled further to the binary code by the CUDA device driver at runtime when a cuFFT plan is initialized. The first improvement available, due to the new implementation, will enable many new accelerated kernels for the NVIDIA Maxwell, NVIDIA Pascal, NVIDIA Volta, and NVIDIA Turing architectures.

cuSPARSE

To reduce the amount of required workspace for sparse-sparse matrix multiplication (SpGEMM), NVIDIA is releasing two new algorithms with lower memory usage. The first algorithm computes a strict bound on the number of intermediate products, while the second one enables partitioning the computation in chunks. These new algorithms are beneficial for customers on devices with smaller memory storage.

INT8 support has been added to cusparseGather, cusparseScatter, and cusparseCsr2cscEx2.

Finally, for SpSV and SpSM, the preprocessing time is improved by an average factor of 2.5x. For the execution phase, SpSV is improved by an average factor of 1.1x, while SpSM is improved by an average factor of 3.0x.

Math API

The new NVIDIA Hopper architecture comes with new Genomics and DPX instructions for faster means of computing combined arithmetic operations like three-way max, fused add+max, and so on.

New DPX instructions accelerate dynamic programming algorithms by up to 7x over the A100 GPU. Dynamic programming is an algorithmic technique for solving a complex recursive problem by breaking it down into simpler sub-problems. For a better user experience, these instructions are now exposed through the Math API.

An example would be a three-way max + ReLU operation, max(max(max(a, b), c), 0).

int __vimax3_s32_relu ( const int  a, const int  b, const int  c )

For more information, see Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions.

Image processing updates: nvJPEG

nvJPEG now has an improved implementation that significantly reduces the GPU memory footprint. This is accomplished by using zero-copy memory operations, fusing kernels, and in-place color space conversion.

Summary

We continue to focus on helping researchers, scientists, and developers solve the world’s most complicated AI/ML and data sciences challenges through simplified programming models. 

This CUDA 12.0 release is the first major release in many years and is foundational to help accelerate applications through the use of next-generation NVIDIA GPUs. New architecture-specific features and instructions in the NVIDIA Hopper and NVIDIA Ada Lovelace architectures are now targetable with CUDA custom code, enhanced libraries, and developer tools.

With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, a runtime library, and access to many advanced C/C++ and Python libraries.

For more information, see the following resources: