The latest NVIDIA Cybersecurity Hackathon brought together 10 teams to create exciting cybersecurity innovations using the NVIDIA Morpheus cybersecurity AI…
The event featured seven onsite Israeli teams and three remote teams from India and the UK. Working around the clock for 24 hours, the teams were challenged with developing new solutions for solving modern cybersecurity challenges.
“NVIDIA hackathons are a welcoming launchpad for innovation. We put DOCA and Morpheus developers in the center, providing them with everything they need to bring their ideas to fruition and into the spotlight. We see traction as the DOCA developer community keeps growing, and we believe hackathons play a significant role in that”, said Dror Goldenberg, the SVP of Software Architecture at NVIDIA.
Figure 1. Cybersecurity hackathon in NVIDIA Tel-Aviv
NVIDIA Cybersecurity Hackathon winners
First place
Team Yahalom, C4I Unit
The Yahalom team created a next-generation load balancer that supports dynamic node addition or removal and load-balancing based on user-defined fields.
The design uses BlueField DPUs as a tailor-made network device, implemented with NVIDIA DOCA FLOW APIs. Using the DPU results in accelerated throughput at scale.
Figure 2. First place Team Yahalom with the judging team
Second place
Team GAPU, Ministry of Defense with Octopus Computer Solutions
Team GAPU focused on developing a new layer of security and governance on the DPU between the platform and infrastructure. This delivers a modular and scalable first line of defense against malicious packets, including a 5-tuple firewall, DNS filtering, and deep packet inspection.
Named ARMadillo, after the BlueField Arm-based cores and the protective shield of the animal, the solution uses DOCA FLOW. ARMadillo accelerates security workflows and illustrates offloading security workloads from the host CPU and memory to the DPU.
Figure 3. The GAPU team, second place winners of the NVIDIA Cybersecurity Hackathon
Third place
Team Ariel-2, Ariel University
This team worked on a malware-encrypted traffic detection solution based on Morpheus and GPU acceleration. Using deep learning, the team created a Morpheus training model using random forest (an ensemble learning method for classification), regression, and other tasks. The model operates by constructing a multitude of decision trees at training time, on a variety of datasets.
The team selected meaningful attributes from each dataset into the model, to classify malicious data, albeit encrypted. The team demonstrated efficient machine learning tasks and lowered AI training costs using Morpheus and GPU acceleration.
Figure 4. Team Ariel-2, third place winners of the NVIDIA Cybersecurity Hackathon
Honorable mention
Team 8200-2B, Aharai-Tech organization
Team 8200-2B, was composed of a group of high-school students that take part in the tech-leadership organization Aharai-Tech. The group worked on a cybersecurity solution that identifies malicious log-in attacks in real time. This is a marked improvement to most existing solutions that identify a breach after it has occurred.
The team used Morpheus pipelines for filtering, processing, and classifying large-scale data.
Figure 5. Team 8200-2B received an honorable mention
Join the DOCA Community
NVIDIA is building a broad community of DOCA developers to create applications and services on top of BlueField DPUs for efficient data centers.
Learn about the newest CUDA features such as release compatibility, dynamic parallelism, lazy module loading, and support for the new NVIDIA Hopper and NVIDIA…
Learn about the newest CUDA features such as release compatibility, dynamic parallelism, lazy module loading, and support for the new NVIDIA Hopper and NVIDIA Ada Lovelace GPU architectures.
After three years of uncertainty caused by the pandemic and its post-lockdown hangover, enterprises in 2023 — even with recession looming and uncertainty abounding — face the same imperatives as before: lead, innovate and problem solve. AI is becoming the common thread in accomplishing these goals. On average, 54% of enterprise AI projects made it Read article >
The latest version of the NVIDIA TAO Toolkit 4.0 boosts developer productivity with all-new AutoML capability, integration with third-party MLOPs services, and…
The latest version of the NVIDIA TAO Toolkit 4.0 boosts developer productivity with all-new AutoML capability, integration with third-party MLOPs services, and new pretrained vision AI models. The enterprise version now includes access to the full source code and model weights for pretrained models.
The toolkit enables efficient model training for vision and conversational AI. By simplifying complex AI models and deep learning frameworks, even developers without AI expertise can use the toolkit to produce AI models. Using transfer learning to fine-tune NVIDIA pretrained models with your own data, it is now possible to optimize model inference throughput without AI expertise or large training datasets.
Developers can create custom production-ready models optimized for specific environments and scenarios with TAO. A notable new feature helps developers build object detection models without massive amounts of data. The use cases include detecting assembly line defects, translating particular phrases across languages, or managing city traffic.
Access to TAO source code and model weights for pretrained models.
Train high-quality models with AutoML without the hassle of manually fine-tuning hundreds of parameters.
Deploy on VMs from leading cloud providers and Kubernetes services like Amazon EKS or Azure AKS.
Simplify infrastructure management and scaling on cloud machine learning services such as Google Colab, Google Vertex AI, and Microsoft Azure Machine Learning.
New cloud integrations and third-party MLOps services, such as W&B and ClearML, provide developers and enterprises with an optimized AI workflow.
Integrate with REST APIs. Quickly build a new AI service or integrate into an existing one with REST APIs.
Use new transformer-based pretrained models (CitySemSegformer, Peoplenet Transformer) and retail-specific pretrained models (RetailObjectDetection, RetailObjectRecognition, and ReIdentificationNet.)
Watch a video to experiment with the NVIDIA TAO Toolkit and pretrained models on Google Colab.
Solutions using TAO Toolkit
Next-generation IT and business service provider, Trifork uses TAO Toolkit to accelerate the development of the AI-based baggage tracking solution for airports.
Fingermark, a company developing data-driven solutions for decision-making, uses TAO Toolkit in developing industrial vision AI solutions to improve worker safety.
A fundamental shift is currently taking place in how AI applications are built and deployed. AI applications are becoming more sophisticated and applied to…
A fundamental shift is currently taking place in how AI applications are built and deployed. AI applications are becoming more sophisticated and applied to broader use cases. This requires end-to-end AI lifecycle management—from data preparation, to model development and training, to deployment and management of AI apps. This approach can lower upfront costs, improve scalability, and decrease risk for customers using AI applications.
While the cloud-native approach to app development can be appealing to developers, machine learning (ML) projects are notoriously time-intensive and cost-intensive, as they require a team with a varied skill set to build and maintain.
This post explains how you can accelerate your vision AI model development using NVIDIA TAO Toolkit and deploying it for inference with NVIDIA Triton Inference Server—all on the Azure Machine Learning (Azure ML) platform.
NVIDIA Triton Inference Server is an open-source inference serving software that helps standardize model deployment and execution and delivers fast and scalable AI in production.
Azure ML is a cloud service for accelerating and managing the machine learning project lifecycle that enables developers to automate AI workflows, from data preparation and model training to model deployment. Developers can easily train, deploy, and manage AI models at scale with Azure ML.
Watch the video below to see a complete walkthrough of how to fine-tune your models with NVIDIA TAO Toolkit and deploy them for inference with NVIDIA Triton Inference Server—all on Azure ML.
Video 1. Create and deploy a custom AI model with NVIDIA TAO Toolkit on Azure Machine Learning
The overall workflow comprises three main steps:
Install the NGC Azure ML Quick Launch Toolkit
Train and optimize a pretrained object detection model
Deploy the optimized model on Azure ML with NVIDIA Triton Inference Server
Figure 1. Workflow for running NVIDIA TAO Toolkit on Azure ML
This section covers the steps required to install the NGC quick launch toolkit, which configures the Azure ML resources and uploads the necessary containers and models for training. The required config files are provided on the AzureML Quick Launch – TAO resource on the NVIDIA NGC Catalog.
Create a conda environment to install the Azure ML Quick Launch Toolkit to avoid any possible conflicts with existing libraries in your machine using the following code:
The azure_config.json file contains the details pertaining to the user credentials, Azure workspace, and GPU compute resources that need to be updated. Edit the azureml_user section with your Azure subscription ID, resource group, and workspace name. Next, edit the aml_compute section with GPU cluster details.
Recommended VMs: NCsv3, NDv2, or NC A100 v4 or ND A100 v4 series
The ngc_config.json file contains the content from the NGC Catalog, such as Docker containers and Jupyter notebooks, that you can upload into the Azure Machine Learning Resources.
Several scripts are packaged that will be used for model deployment.
This will upload all the resources to the Azure ML Datastore. After the upload, a URL is generated for the Jupyter session, which enables you to interact with the session from your local web browser. You can verify that all the resources have been uploaded in the Azure ML portal. The steps to check the resources in Azure ML portal are provided in the video above.
Train and optimize an object detection model with NVIDIA TAO Toolkit
This section covers the steps for training the model with NVIDIA TAO Toolkit on the Azure ML platform.
Before you begin the training process, you need to run the auxiliary notebook called CopyData.ipynb. The notebook is automatically generated with azureml-ngc-tools. This copies the notebooks from the Datastore to the compute cluster.
Figure 2. CopyData.ipynb Jupyter notebook for copying notebooks from the Datastore to the compute cluster
A new folder called tao is created with all the additional data provided. This folder contains the Jupyter notebook, along with the required configuration files for training. Navigate to the TAO_detectnet_v2.ipynb notebook under folder tao/detectnet_V2. The DetectNet_v2 is one of the many computer vision notebooks available for training.
Figure 3. TAO DetectNet_v2 Jupyter notebook used for training the model
Once you load up the notebook, simply execute each cell shown. For more information on this network or on how to configure hyperparameters, refer to the TAO Toolkit documentation. Some of the main steps covered in the notebook include:
Setting environment variables
Downloading and converting training data
Downloading the model from the NGC catalog
Training the model
Pruning the model to remove unwanted layers and reduce model size
Retraining the pruned model to recover lost accuracy
Quantize Aware Training (QAT), which changes the precision of the model to INT8, reducing model size without sacrificing accuracy
Exporting the model for inference
Once the model is generated on the compute cluster, you will need to upload the model to the Azure ML workspace. To upload, run the UploadData.ipnyb notebook to copy the model to the Datastore. This model will be used for deployment.
Deploy the model using NVIDIA Triton Inference Server
Next, deploy the exported model using NVIDIA Triton Inference Server. Use the model trained in the previous step with NVIDIA TAO Toolkit and stored in the Datastore. Pull the model directly from the Datastore.
Once the model has been uploaded, push the NVIDIA Triton container to the Azure Container Registry (ACR), create the inference end point, and test it using some sample images.
Register the model for inference
The next steps entail uploading the model for inference. Upload the trained model from the Datastore. Navigate to the Azure Machine Learning Studioto load the local model.
1. After logging in, go to the Azure ML workspace (created earlier) using the azureml-ngc-tool script. Select ‘Models’ from the left menu, then click ‘Register’ and ‘From datastore.’
Figure 4. Register the model for deploymentfrom Azure ML Studio
2. Select the ‘Triton’ model type to upload the model.
Figure 5. Upload the modelfrom Azure ML Studio
3. Browse and select the path: tao/detectnet _v2/model _repository
Figure 6. Datastore selection
4. Name the model ‘DetectNet’ and set the version to ‘1’
Figure 7. Register the model
5. Once the model has successfully uploaded, you should be able to see the directory structure.
Figure 8. Model artifacts in Azure ML model store
Build and upload NVIDIA Triton image to Azure Container Registry
Next, build the NVIDIA Triton container with necessary dependencies and upload the image to Azure Container Registry (ACR).
1. On your local machine, run the following script to create the NVIDIA Triton container with the necessary dependencies:
bash scripts/build_container.sh
2. Verify that the image has been created locally by executing:
docker image ls
If successful, you should see the repo named nvcr.io/nvidia/tao/triton-apps.
3. Push the Docker image to ACR using the following script:
bash scripts/push_container_to_ACR.sh
The registryname parameter is the name of the provided Azure ML workspace default container registry. Navigate to the Workspace essential properties dashboard to find it in the Azure portal. This script will push the Docker image to ACR and tag it as ${registryname}.azurecr.io/tao:latest.
Figure 9. Get the Container Registry name, where the container will be uploaded
Once the script completes, navigate to ACR to see the container in the tao repository.
Create the Azure ML endpoint and deployment
On your local machine, run the following script to create an Azure ML Endpoint followed by the deployment:
bash scripts/create_endpoint _and_deployment.sh
The script will create the Azure ML Endpoint with the endpoint names provided in the endpoint_aml.yml file. It then deploys the NVIDIA Triton service on Azure using the deployment_aml.yml file.
In this file, you can specify VM size. For this example, use the Standard_NC6s_v3 VM. These files are provided in scripts/auxiliary_files.
Once the script execution is complete, you should be able to see the deployment information, REST endpoint, and authentication key on the Azure portal. The deployment information can be found by clicking into the endpoint and navigating to the Deployment Logs tab.
Figure 10. NVIDIA Triton deployment endpoint
Validate the endpoint
You can validate the endpoint by using the REST endpoint URL and the primary key found under the Endpoints tab on the Azure portal. To query the Azure ML endpoint from the user local machine, run the following script:
bash scripts/infer.sh
This script is provided in the zip file pulled from AzureML Quick Launch: TAO in the first step.
Next, provide the REST endpoint URL from the NVIDIA Triton deployment endpoint. The endpoint is queried with the test image provided with the option. The output image with bounding boxes is stored in .
Summary
This post showed the end-to-end workflow for fine-tuning a model with NVIDIA TAO Toolkit and deploying the trained object detection model using NVIDIA Triton Inference Server, all on Azure Machine Learning. These tools abstract away the AI framework complexity, enabling you to build and deploy AI applications in production without the need for AI expertise.
There has been tremendous growth in AI over the years. With that, comes a larger demand for AI models and applications. Creating production-quality AI requires…
There has been tremendous growth in AI over the years. With that, comes a larger demand for AI models and applications. Creating production-quality AI requires expertise in AI and data science and can still be intimidating for many developers.
To develop accurate AI, you must choose what model architecture to use, what data to collect, and finally how to tune the model to meet the desired KPIs. There are thousands of combinations of model architectures and hyperparameters that you’d have to try to get the best model for your specific use case. This process is extremely laborious and requires model architecture expertise to tune hyperparameters.
Automated machine learning (AutoML) automates the manual task of finding the best models and hyperparameters for the desired KPI. It can algorithmically derive the best model for your given KPI and abstract away a lot of the complexity of AI model creation and optimization.
AutoML makes it easy for even a novice developer to create a highly accurate AI model.
AutoML in TAO
AutoML in TAO is fully configurable for automatically optimizing the hyperparameters of a model, which reduces the need for manual tuning. It caters to both AI experts and non-experts.
For non-experts, the guided Jupyter notebook provides a simple, efficient way to create an accurate AI model.
For experts, TAO gives you full control of which hyperparameters to tune and which algorithm to use for sweeps.
TAO currently supports two optimization algorithms: Baysian and Hyperband optimization. These algorithms can effectively sweep across a range of hyperparameters to find the best combination to optimize the user-provided metric.
Hyperband yields faster because it doesn’t have to run through the entire training configuration. It runs for a limited number of epochs, discards the runs that are performing poorly, and only continues on the remaining runs. This process of elimination continues until there is a single configuration that gives the best results.
For Bayesian, the training runs to completion for all the sweeps.
AutoML is supported for a wide range of CV tasks: image classification, object detection, segmentation, and OCR. Table 1 shows the full list of supported networks.
The entire AutoML workflow can be run from the provided Jupyter notebooks. AutoML uses the TAO API services to manage all training jobs.
TAO API services
TAO API is a Kubernetes service that enables deployment of TAO as a microservice either on your own Kubernetes cluster or with cloud Kubernetes services such as Amazon EKS or Azure AKS.
TAO API services provide an additional layer of abstraction over containers. You can manage and deploy TAO services using Helm charts and remotely run jobs using REST API calls. With the APIs, you can remotely create and upload datasets, run training jobs, evaluate models, and export models for deployment.
Figure 1. Cloud-native TAO API service architecture
API services make it easy to integrate TAO into your own custom application or build a web-UI application on top of TAO. To get started with building your custom applications with REST APIs, see the API guide and the API notebooks in TAO Toolkit Getting Started. For more information, see the Notebook section later in this post.
To train using the CLI, use the lightweight CLI client application that you can install on a client system to access TAO services and CLI notebooks. CLI notebooks are available on NGC in TAO getting started resources.
AutoML requires a higher-level service on top of your training runs to determine and manage the set of experiments. TAO services keep track of all the experiments that they have tried with the KPIs and build the next set of experiments to improve on the KPIs. You can run AutoML with TAO API services either through the remote CLI application or directly using REST APIs. Jupyter notebooks for both are provided. For more information, see the Notebook section.
The REST API notebooks mainly serve as a reference if you are building your own applications or UI on top of TAO.
Set up TAO services
The TAO API service can run on any Kubernetes platform. To simplify the deployment of TAO services, we have provided a one-click deploy script. This simplifies the deployment of TAO services on bare-metal setup or on Amazon EKS. For this post, we use the bare-metal setup but instructions to deploy on cloud are provided in the API guide.
tar -xvf tao-toolkit-api-bare-metal.tar
cd tao-toolkit-api-bare-metal
Add the host IP address and login credentials in the hosts file. This is the system where you plan to run the TAO services. It could be a local or remote system but you must have sudo privileges.
For credentials, you can use either a password (ansible_ssh_pass) or an SSH private key file (ansible_ssh_private_key_file). For a single-node cluster, you can list only the master node.
You can validate SSH credentials for remote machines with the following command. The proper answer would be root.
ssh @ 'sudo whoami'
Next, modify the tao-toolkit-api-ansible-values.yml file to add your NGC credentials and the Helm chart. This pulls the Helm chart from the NGC registry. For more information, see Generating Your NGC API Key.
Install the dependencies and deploy the TAO service. Before installation, first check whether all the dependencies are met by running check-inventory.yml. If everything looks good, you should see a message that says 0 failed. Then, run install, which takes 10–15 minutes.
For this post, use the object detection notebook (TAO API Starter Kit/Notebooks/client/automl/object_detection.ipynb) but you can also do AutoML on other computer vision tasks.
Use AutoML to fine-tune an object detection model with TAO
Here is a quick walkthrough of the AutoML workflow with the Object Detection AutoML notebook. For this walkthrough, you use the client/automl/object_detection.ipynb notebook from the hierarchy shown earlier. We highlight the key steps here, but all the steps are captured in the Jupyter notebook.
Figure 2. Step-by-step TAO AutoML workflow
Select a model topology
Choose any one of the available models listed for that notebook. Each notebook has a default model for that domain. In this example, the default model is DetectNet V2 but you can change it to FasterRCNN, SSD, DSSD, Retinanet, EfficientDet, Yolo V3, Yolo V4, or YoloV4 tiny.
model_name = "detectnet-v2"
Create a dataset
The next step is to use the dataset given as an example in the notebook or use your own dataset. The folder structure of the dataset requirement is provided in the notebooks.
When you have the dataset ready, upload it to the TAO Toolkit REST API-deployed machine through the Unix rsync command for TAO-Client notebooks. You must upload the images and labels for both training and validation data.
After the dataset has been uploaded, convert the dataset to tfrecords through the dataset-convert action. All object detection models require dataset conversion, but some models from other domains, like classification, can operate on the raw data uploaded.
The next step is to choose which AutoML algorithm to run. There are options to tweak some AutoML-specific parameters. You can view the parameters that are enabled by default for the AutoML search for a model, along with all the parameters that are available for a network
tao-client {model_name} model-automl-defaults --id {model_id} | tee ~/shared/users/{os.environ['USER']}/models/{model_id}/specs/automl_defaults.json
This outputs a list of hyperparameters that are used for AutoML. For this experiment, you are choosing five different hyperparameters to sweep.
You can add additional parameters or remove existing defaulted parameters. For example, to sweep the soft_start hyperparameter, add the following in your notebook:
There are also options to tweak algorithm-specific parameters, but the default parameters work well. For more information, see AutoML.
Train with AutoML
At this point, you have all the tools necessary to start the AutoML run. You can also change default training specs, like image extension or class mapping, before triggering the AutoML run:
When AutoML run starts, you can see various stats, such as the best accuracy score at that time, how many experiments have been completed, an approximate estimated time for completion, and so on. You should see an output log similar to the following.
{
"best_map": 0.59636,
"Estimated time for automl completion": "23.13 minutes remaining approximately",
"Current experiment number": 3,
"Number of epochs yet to start": 429.0,
"Time per epoch in seconds": 3.24
}
Compare models
At the end of the AutoML run, you can see the results of all experiments. You are presented with the spec file and the binary weight file of the model achieving the highest accuracy among the AutoML sweeps.
The best model for this experiment was ID 9 with a mAP of 0.627. This is stored in the best_model/recommendataion_9.kitti file.
After saving the best model obtained from AutoML, you can plug the model and spec file in the end-to-end notebook and then prune and optimize the model for inference.
Figure 3. End-to-end workflow from AutoML training to model optimization
To plug the model into the new notebook, copy the train job ID from the AutoML notebook. The AutoML train job ID is printed when you run the training job.
When you have the train job ID, open the end-to-end notebook from the notebook hierarchy from earlier. For this post, use the TAO API Starter Kit/Notebooks/client/end2end/detectnet_v2.ipynb notebook. As you have already trained a model, just run the import statement in the first cell and skip all the way down to the Run Evaluate section. In this section, create a code cell before evaluating.
train_job_id = “id_you_copied”
After adding the job_map code cell, you can evaluate the model, prune the model for compression, and even do a quantization-aware training of the original model or the pruned model as showcased in the end-to-end notebooks
Results
We trained various models with AutoML on public datasets to see how much improvement in accuracy we could achieve. We compared the best AutoML-based accuracy against a baseline accuracy number from the default spec file provided in the packages. The results are in Table 2.
For object detection, we trained on the FLIR dataset, which contains images from both thermal and RGB sensors.
For semantic segmentation, we used the ISBI dataset.
For accuracy, we used mAP (mean average precision) for object detection, average accuracy of all classes and tasks for image classification, and mean IoU (intersection over union) score for semantic segmentation.
Task
Model
Baseline Accuracy (default spec)
Best AutoML accuracy
Dataset
Object Detection
DetectNet_v2 – ResNet18
44.16
51.37
FLIR
Object Detection
FasterRCNN – ResNet18
56.42
60.44
FLIR
Object Detection
YOLOv4 – ResNet18
40.12
63.46
FLIR
Object Detection
YOLOv3 – ResNet18
42.36
61.84
FLIR
Object Detection
RetinaNet – ResNet18
50.54
63.09
FLIR
Image Classification
ResNet18
53.95
66.28
Pascal VOC
Semantic Segmentation
UNET
71.64
76.65
ISBI
Table 2. Accuracy gain across networks by using AutoML
Across all the models that we tested, the increase in model accuracy is substantial compared to static default hyperparameters. The amount of improvement varies depending on the model, but we have generally seen improvements ranging from 5% to more than 20%. This shows that AutoML can work on various datasets to train the best model for a given KPI.
Summary
As the number of use cases and customization grows, it becomes imperative to accelerate the AI creation process. AutoML can eliminate the need for manual tuning, saving valuable time for developers.
With TAO AutoML, you can now automatically tune models for object detection, classification, and segmentation use cases using various popular model architectures. TAO AutoML provides simplicity for novice users to get started, as well as configurability for experts to choose their own hyperparameters to sweep.
Banks require more than cash in the vault these days, they also need accelerated computing in the back room. “The boost we’re getting with GPUs not only significantly improved our performance at the same cost, it helped us redefine our business and sharpen our focus on customers,” said Marco Airoldi, who’s been head of financial Read article >
CUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single…
CUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single operation. Knowing the workflow upfront enables the CUDA driver to apply various optimizations, which cannot be performed when launching through a stream model.
However, this performance comes at the cost of flexibility: if the full workflow is not known in advance, then GPU execution must be interrupted to return to the CPU to make a decision.
CUDA device graph launch solves this problem by enabling a task graph to be performantly launched from a running GPU kernel, based on data that is determined at run time. CUDA device graph launch offers two distinct launch modes—fire and forget, and tail launch—to enable a wide range of applications and use.
This post demonstrates how to use device graph launch and the two launch modes. It features the example of a device-side work scheduler, which decompresses files for data processing.
Device graph initialization
Executing a task graph involves the four-step process outlined below:
Create the graph
Instantiate the graph into an executable graph
Upload the executable graph’s work descriptors to the GPU
Launch the executable graph
By separating the launch step from the other steps, CUDA is able to optimize the workflow and keep graph launch as lightweight as possible. As a convenience, CUDA will also combine the upload step with the launch step the first time a graph is launched if the upload step has not been called explicitly.
In order to launch a graph from a CUDA kernel, the graph first must have been initialized for device launch during the instantiation step. Additionally, before it can be launched from the device, the device graph must have been uploaded to the device, either explicitly through a manual upload step or implicitly through a host launch. The code below, which performs the host-side steps to set up the device scheduler example, shows both options:
// This is the signature of our scheduler kernel
// The internals of this kernel will be outlined later
__global__ void schedulerKernel(
fileData *files,
int numFiles,
int *currentFile,
void **currentFileData,
cudaGraphExec_t zipGraph,
cudaGraphExec_t lzwGraph,
cudaGraphExec_t deflateGraph);
void setupAndLaunchScheduler() {
cudaGraph_t zipGraph, lzwGraph, deflateGraph, schedulerGraph;
cudaGraphExec_t zipExec, lzwExec, deflateExec, schedulerExec;
// Create the source graphs for each possible operation we want to perform
// We pass the currentFileData ptr to this setup, as this ptr is how the scheduler will
// indicate which file to decompress
create_zip_graph(&zipGraph, currentFileData);
create_lzw_graph(&lzwGraph, currentFileData);
create_deflate_graph(&deflateGraph, currentFileData);
// Instantiate the graphs for these operations and explicitly upload
cudaGraphInstantiate(&zipExec, zipGraph, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(zipExec, stream);
cudaGraphInstantiate(&lzwExec, lzwGraph, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(lzwExec, stream);
cudaGraphInstantiate(&deflateExec, deflateGraph, cudaGraphInstantiateFlagDeviceLaunch);
cudaGraphUpload(deflateExec, stream);
// Create and instantiate the scheduler graph
cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
schedulerKernel>>(files, numFiles, currentFile, currentFileData, zipExec, lzwExec, deflateExec);
cudaStreamEndCapture(stream, &schedulerGraph);
cudaGraphInstantiate(&schedulerExec, schedulerGraph, cudaGraphInstantiateFlagDeviceLaunch);
// Launch the scheduler graph - this will perform an implicit upload
cudaGraphLaunch(schedulerExec, stream);
}
It is important to note here that device graphs can be launched either from the host or from the device. Therefore, the same cudaGraphExec_t handles may be passed to the scheduler for launch on the device as for launch on the host.
Fire and forget launch
A scheduler kernel dispatches work based on incoming data. For work dispatch, fire and forget launch is the preferred launch method.
When a graph is launched using fire and forget launch, it is dispatched immediately. It executes independently of both the launching graph and subsequent graphs launched using fire and forget mode. Because the work executes immediately, fire and forget launch is preferable for work dispatched by a scheduler, as it starts running as quickly as possible. CUDA introduces a new device-side named stream to perform a fire and forget launch of a graph. See below for an example of a simple dispatcher.
enum compressionType {
zip = 1,
lzw = 2,
deflate = 3
};
struct fileData {
compressionType comprType;
void *data;
};
__global__ void schedulerKernel(
fileData *files,
int numFiles
int *currentFile,
void **currentFileData,
cudaGraphExec_t zipGraph,
cudaGraphExec_t lzwGraph,
cudaGraphExec_t deflateGraph)
{
// Set the data ptr to the current file so the dispatched graph
// is operating on the correct file data
*currentFileData = files[currentFile].data;
switch (files[currentFile].comprType) {
case zip:
cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
break;
case lzw:
cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
break;
case deflate:
cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
break;
default:
break;
}
}
It is also important to note that graph launches can be nested and recursive, so additional device graphs can be dispatched from fire and forget launches. Although not shown in this example, the graphs that are decompressing the file data could dispatch more graphs to do further processing on that data once it is fully decompressed (image processing, for example). Device graph flow is hierarchical, just like graphs themselves.
Tail launch
CUDA work is launched asynchronously to the GPU, which means the launching thread must explicitly wait for the work to complete before consuming any result or output. This is typically done from a CPU thread using a synchronization operation such as cudaDeviceSynchronize or cudaStreamSynchronize.
It is not possible for a launching thread on the GPU to synchronize on device graph launches through traditional methods such as cudaDeviceSynchronize. Instead, when operation ordering is desired, tail launch should be used.
When a graph is submitted for tail launch, it does not execute immediately, but rather upon completion of the launching graph. CUDA encapsulates all dynamically generated work as part of the parent graph, so a tail launch will also wait for all generated fire and forget work before executing.
This is true whether the tail launch was issued before or after any fire and forget launches. Tail launches themselves execute in the order in which they are enqueued. A special case is self-relaunch, where the currently running device graph is enqueued to relaunch through tail launch. Only one pending self-relaunch is permitted at a time.
Using tail launch, you can upgrade the previous dispatcher to become a full scheduler kernel by having it relaunch itself repeatedly, effectively creating a loop in the execution flow:
__global__ void schedulerKernel(
fileData *files,
int numFiles,
int *currentFile,
void **currentFileData,
cudaGraphExec_t zipGraph,
cudaGraphExec_t lzwGraph,
cudaGraphExec_t deflateGraph)
{
// Set the data ptr to the current file so the dispatched graph
// is operating on the correct file data
*currentFileData = files[currentFile].data;
switch (files[currentFile].comprType) {
case zip:
cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
break;
case lzw:
cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
break;
case deflate:
cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
break;
default:
break;
}
// If we have not finished iterating over all the files, relaunch
if (*currentFile
Notice how the relaunch operation uses cudaGetCurrentGraphExec to retrieve a handle to the currently executing graph. It can relaunch itself without needing a handle to its own executable graph.
Use of tail launch for the self-relaunch has the added effect of synchronizing on (waiting for) the dispatched fire and forget work before the next scheduler kernel relaunch begins. A device graph can only have one pending launch at a time (plus one self-relaunch). In order to relaunch the graph that was just dispatched, you need to make sure that the previous launch completed first. Performing a self-relaunch accomplishes this goal, so that you can dispatch whatever graph is needed for the next iteration.
Device compared to host launch performance
How would this example fare against a host-launched graph? Figure 1 compares fire and forget launch, tail launch, and host launch latencies for various topologies.
Figure 1. A comparison of device and host launch latencies for various topologies
This chart shows that not only is the device-side launch latency better than 2x lower than that of host launch, but it is also not impacted by graph structure. The latency is identical for each of the given topologies.
Device launch also scales much better to the width of the graph, as shown in Figure 2.
Figure 2. A comparison of device and host launch latencies for graphs containing variable amounts of parallel straight-line segments
By comparison with host launch, device launch latency stays almost constant regardless of how much parallelism is in the graph.
Conclusion
CUDA device graph launch offers a performant way to enable dynamic control flow within CUDA kernels. While the example presented in this post provides a means of getting started with the feature, it is but a small representation of the ways this feature can be used.
Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many…
Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many applications.