Categories
Misc

CUDA Context-Independent Module Loading

Most CUDA developers are familiar with the cuModuleLoad API and its counterparts for loading a module containing device code into a CUDA context. In…

Most CUDA developers are familiar with the cuModuleLoad API and its counterparts for loading a module containing device code into a CUDA context. In most cases, you want to load identical device code on all devices. This requires loading device code into each CUDA context explicitly. Moreover, libraries and frameworks that do not control context creation and destruction must keep track of them to explicitly load and unload modules. 

This post discusses context-independent loading introduced in CUDA 12.0, which solves these problems.

Context-dependent loading

Traditionally, module loading has always been associated with a CUDA context. The following code example shows the traditional way of loading identical device code into two devices and then launching kernels on them.

// Device 0
cuDeviceGet(&device0, 0);
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuModuleLoad(&module0, “myModule.cubin”);
// Device 1
cuDeviceGet(&device1, 1);
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuModuleLoad(&module1, “myModule.cubin”);

Launching a kernel on each of the devices requires you to retrieve a per-module CUfunction as shown in the following code example:

// Device 0
cuModuleGetFuntion(&function0, module0, “myKernel”);
cuLaunchKernel(function0, …);
// Device 1
cuModuleGetFuntion(&function1, module1, “myKernel”);
cuLaunchKernel(function1, …);

This increases code complexity in the application as you must retrieve and track the per-context and per-module types. You also have to unload each module explicitly by using the cuModuleUnload API.

The problem is exacerbated when libraries or frameworks primarily use CUDA driver APIs for loading their own modules. They may not have complete control over the lifetime of contexts owned by the application.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  map moduleContextMap;
}

libraryFunc() {
  cuCtxGetCurrent(&ctx);
  if (!moduleContextMap.contains(ctx)){
    cuModuleLoad(&module, “myModule.cubin”);
    moduleContextMap[ctx] = module;
  }
  else {
    module = moduleContextMap[ctx];
  }

  cuModuleGetFuntion(&function, module, “myKernel”);
  cuLaunchKernel(function, …);
}

libraryDeinitialize() {
  moduleContextMap.clear();
}

In the code example, the library must check for new contexts and load modules into them explicitly. It also must maintain state to check if the module is already loaded into the context. 

Ideally, the state can be freed after the context is destroyed. However, this is not possible if the library has no control over the lifetime of contexts. 

This means that the freeing of resources must be delayed until library deinitialization. This not only increases code complexity, but it also causes the library to hold on to resources longer than it must, potentially denying another portion of the application from using that memory.

Another alternative is for libraries and frameworks to force additional constraints on the users to ensure that they have sufficient control over resource allocation and cleanup.

Context-independent loading

CUDA 12.0 introduces context-independent loading with the addition of the cuLibrary* and cuKernel* APIs, which solve these problems. With context-independent loading, the loading and unloading of modules into each CUDA context is done automatically by the CUDA driver as contexts are created and destroyed by the application.

// Load library
cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
cuLibraryGetKernel(&kernel, library, “myKernel”);

// Launch kernel on the primary context of device 0
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuLaunchKernel((CUkernel)kernel, …);

// Launch kernel on the primary context of device 1
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuLaunchKernel((CUkernel)kernel, …);

// Unload library
cuLibraryUnload(library);

As shown in the code example, the cuLibraryLoadFromFile API takes care of loading the module when a context is created or initialized. In the example, this is done during cuDevicePrimaryCtxRetain

Moreover, you can now launch the kernels using the context-independent handle CUkernel, rather than having to maintain a per-context CUfunctioncuLibraryGetKernel retrieves a context-independent handle to the device function myKernel. The device function can then be launched with cuLaunchKernel by specifying the context-independent handle CUkernel. The CUDA driver takes care of launching the device function in the appropriate context based on the context that is active at that point.

Libraries and frameworks can now simply load and unload modules one time during initialization and deinitialization, respectively.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
  cuLibraryGetKernel(&kernel, library, “myKernel”);
}

libraryFunc() {
  cuLaunchKernel((CUkernel)kernel, …);
}

libraryDeinitialize() {
  cuLibraryUnload(library);
}

The library does not have to maintain and track per-context states anymore. The design of context-independent loading enables the CUDA driver to track modules and contexts and carry out the work of loading and unloading modules.

Accessing __managed__ variables

Managed variables can be referenced from both device and host code. For example, the address of a managed variable can be queried or it can be read or written directly from a device or host function. Unlike __device__ variables, which have the lifetime of a CUDA context in which it is created, __managed__ variables belonging to a module point to the same memory across all CUDA contexts or even devices. 

Before CUDA 12.0, there was no way to retrieve a handle through the driver API to a managed variable that would be unique across CUDA contexts. CUDA 12.0 introduces a new driver API cuLibraryGetManaged, which makes it possible to get a unique handle across CUDA contexts.

Get started with context-independent loading

In this post, we introduced new CUDA driver APIs that provide the ability to load device code independent of a CUDA context. We also discussed context-independent handles to launch kernels. Together, they provide a simpler way to load and execute code on the GPU in comparison to the traditional loading mechanisms, reducing code complexity and avoiding the need for maintaining per-context states. 

To start using these APIs, download the CUDA Driver and Toolkit version 12 or higher. For more information about the cuLibrary* and cuKernel* APIs, see the CUDA Driver API documentation.

Categories
Misc

Upcoming Workshop: Computer Vision for Industrial Inspection

Robot arms workingLearn how to create an end-to-end hardware-accelerated industrial inspection pipeline to automate defect detection in this workshop on January 18 (CET).Robot arms working

Learn how to create an end-to-end hardware-accelerated industrial inspection pipeline to automate defect detection in this workshop on January 18 (CET).

Categories
Misc

Improving Robot Motion Generation with Motion Policy Networks

Collision-free motion generation in unknown environments is a core building block for robotic applications. Generating such motions is challenging. The motion…

Collision-free motion generation in unknown environments is a core building block for robotic applications. Generating such motions is challenging. The motion generator must be fast enough for real-time performance and reliable enough for practical deployment. 

Many methods addressing these challenges have been proposed, ranging from using local controllers to global planners. However, these traditional motion planning solutions are unable to overcome shortcomings when the environment is unknown and dynamic. They also require complex visual processing procedures, such as SLAM, to generate obstacle representations by aggregating camera observations from multiple viewpoints. These representations ultimately require costly updates when the objects move and the environment changes.

Motion Policy Networks (MπNets), pronounced “M Pi Nets,” is a new end-to-end neural policy developed by the NVIDIA Robotics research team. MπNets generates collision-free, smooth motion in real time, by using a continuous stream of data coming from a single static camera. The technology is able to circumvent the challenges of traditional motion planning and is flexible enough to be applied in unknown environments.

We will be presenting this work on December 18 at the Conference on Robot Learning (CoRL) 2022 in New Zealand.

Large-scale synthetic data generation

To train the MπNets neural policy, we first needed to create a large-scale dataset for learning and benchmarking. We turned to simulation for synthetically generating vast amounts of robot trajectories and camera point cloud data. 

The expert trajectories are generated using a motion planner that creates consistent motion around complex obstacles while accounting for a robot’s physical and geometric constraints. It consists of a pipeline of geometric fabrics from NVIDIA Omniverse, an AIT* global planner, and spline-based temporal resampling. 

MπNets was trained with more than 3 million expert trajectories and 700 million point clouds rendered in 500,000 simulated environments. Training the neural policy on large-scale data was crucial for generalizing to unknown environments in the real world. 

The image shows a variety of examples of MπNets training in a variety of simulated environments.
Figure 1. MπNets is trained with a large-scale dataset consisting of 3.27 million trajectories across 575 K procedurally generated environments

End-to-end architecture for motion planning

An end-to-end neural network policy, MπNets maps directly from camera point cloud observations to robot joint positions. The policy jointly encodes three inputs: a single-view point cloud camera observation of the scene, the robot’s current state configuration, and desired target pose that the user commands the robot to achieve. 

It outputs joint positions to achieve the specified target pose, which we then execute on the robot’s low-level controller.

A workflow of the stages: point cloud image data, encoding, calculating latent space, decoding, and planning
Figure 2. The MπNets workflow, which is an end-to-end neural policy for motion planning in unknown environments

The input point cloud is automatically labeled with three classes: the robot, the obstacles, and the specified target pose of the robot. The target pose is represented as a point cloud of the robot’s gripper.

Sim2Real Transfer to the Real World

MπNets generalizes well to a real robot system with a single, static depth camera. The policy directly transfers to the real world without needing real data, due to the low domain gap in point cloud observations (vis-a-vis RGB images). 

As shown in Figure 3, it reaches into tightly confined spaces without colliding with obstacles such as the plates and mug, scenarios commonplace in human spaces. With its end-to-end policy architecture, MπNets can also be executed in a closed-loop real robot system running at 9 Hz, and react immediately to dynamic scenes, as shown in Figure 3.

Fast, global, and avoids local optima

MπNets solution time is much shorter than a state-of-the-art sampling-based planner. It is 46% more likely to find a solution than MPNets, despite not requiring a collision checker. MπNets is less likely to get stuck in challenging situations, such as tightly confined spaces, because it is learned from long-term global planning information. 

“MπNets
Figure 4. Local controllers (right) oftentimes get stuck in local optima. MπNets (left) avoids local optima, as it is trained with trajectories, which have global information

In Figure 4 both STORM and geometric fabrics are stuck in the first drawer because they can’t figure out how to retract and go into the second drawer. Neither reaches the final target pose. 

Getting started with MπNets

When trained on a large dataset of simulated scenes, MπNets is faster than traditional planners, more successful than other local controllers, and transfers well to a real robot system even in dynamic and partially observed scenes.

 To help you get started with MπNets, our paper is published on Arxiv and the source code is available on the Motion Policy Networks GitHub. You can also load our pre-trained weights and play around using our ROS RViz user interface.

Learn more about neural motion planning, in the context of robot benchmarking, at the Benchmarking workshop during CoRL on December 15.

Categories
Misc

Explainer: What Is an Autonomous Truck?

Autonomous trucks are commercial vehicles that use AI to automate everything from shipping yard operations to long-haul deliveries.

Autonomous trucks are commercial vehicles that use AI to automate everything from shipping yard operations to long-haul deliveries.

Categories
Misc

Just Released: CUDA Toolkit 12.0

CUDA Toolkit 12.0 supports NVIDIA Hopper architecture and many new features to help developers maximize performance on NVIDIA GPU-based products.

CUDA Toolkit 12.0 supports NVIDIA Hopper architecture and many new features to help developers maximize performance on NVIDIA GPU-based products.

Categories
Misc

Predict Protein Structures and Properties with Biomolecular Large Language Models

Biomolecular structureThe NVIDIA BioNeMo service is now available for early access. At GTC Fall 2022, NVIDIA unveiled BioNeMo, a domain-specific framework and service for training…Biomolecular structure

The NVIDIA BioNeMo service is now available for early access. At GTC Fall 2022, NVIDIA unveiled BioNeMo, a domain-specific framework and service for training and serving biomolecular large language models (LLMs) for chemistry and biology at supercomputing scale across billions of parameters. 

The BioNeMo service is domain-optimized for chemical, proteomic, and genomic applications, designed to support molecular data represented in the SMILES notation for chemical structures, and FASTA for amino acid and nucleic acid sequences for proteins, DNA, and RNA.

With the BioNeMo service, scientists and researchers now have access to pretrained biomolecular LLMs through a cloud API, enabling them to predict protein structures, develop workflows, and fit downstream task models from LLM embeddings.

The BioNeMo service is a turnkey cloud solution for AI drug discovery pipelines that can be used in your browser or through API endpoints. The service API endpoints offer scientists the ability to get started quickly with AI drug discovery workflows based on large language model architectures. It also provides a UI Playground to easily and quickly try these models through an API, which can be integrated into your applications.

The BioNeMo service contains the following features:

  • Fully managed, browser-based service with API endpoints for protein LLMs
  • Accelerated OpenFold model for fast 3D protein structure predictions
  • ESM-1nv LLM for protein embeddings for downstream tasks
  • Interactive inference and visualization of protein structures through a graphic user interface (GUI)
  • Programmatic access to pretrained models through the API

About the models

ESM-1nv, based on Meta AI’s state-of-the-art ESM-1b, is a large language model for the evolutionary-scale modeling of proteins. It is based on the BERT architecture and trained on millions of protein sequences with a masked language modeling objective. ESM-1nv learns the patterns and dependencies between amino acids that ultimately give rise to protein structure and function.

Embeddings from ESM-1nv can be used to fit downstream task models for protein properties of interest such as subcellular location, thermostability, and protein structure. This is accomplished by training a typically much smaller model with a supervised learning objective to infer a property from ESM-1nv embeddings of protein sequences. Using embeddings from ESM-1nv typically results in far superior accuracy in the final model.

OpenFold is a faithful reproduction of DeepMind’s AlphaFold-2 model for 3D protein structure prediction from a primary amino acid sequence. This long-standing grand challenge in structural biology reached a significant milestone at CASP14, where AlphaFold-2 achieved nearly experimental accuracy for predicted structures. While AlphaFold was developed for a JAX workflow, OpenFold bases its code on PyTorch. 

OpenFold in BioNeMo is also trainable, meaning variants may be created for specialized research. OpenFold achieves similar accuracy to the original model and predicts the median backbone at an accuracy of 0.96 Å RMSD95 and is up to 6x faster due to changes made in the MSA generation step. This means that drug discovery researchers get 3D protein structure predictions very quickly. 

Get early access to the BioNeMo service

Apply for early access to the BioNeMo service. You’ll be asked to join the NVIDIA Developer Program and fill out a short questionnaire to gain your early access.  

Categories
Misc

Introducing NVIDIA Riva: A GPU-Accelerated SDK for Developing Speech AI Applications

This post was updated from November 2021. Sign up for the latest Speech AI news from NVIDIA. Speech AI is used in a variety of applications, including contact…

This post was updated from November 2021. Sign up for the latest Speech AI news from NVIDIA.

Speech AI is used in a variety of applications, including contact centers’ agent assists for empowering human agents, voice interfaces for intelligent virtual assistants (IVAs), and live captioning in video conferencing. To support these features, speech AI technology includes automatic speech recognition (ASR) and text-to-speech (TTS). The ASR pipeline takes raw audio and converts it to text, and the TTS pipeline takes the text and converts it to audio. 

Developing and running real-time speech AI services is complex and difficult. Building speech AI applications requires hundreds of thousands of hours of audio data, tools to build and customize models based on your specific use case, and scalable deployment support.

It also means running in real time, with low latency far under 300 ms to interact naturally with users. NVIDIA Riva streamlines the end-to-end process of developing speech AI services and provides real-time performance for human-like interactions.

NVIDIA Riva SDK

NVIDIA Riva is a GPU-accelerated SDK for building and deploying fully customizable, real-time speech AI applications that deliver accurately in real time. These applications can be deployed on premises, in the cloud, embedded, and on the edge. NVIDIA Riva is designed to help you access speech AI functionalities easily and quickly. With a few commands, you can access the high-performance services through API operations and try demos.

Diagram shows workflow starting from pretrained models in NGC, TAO toolkit for retraining the models and Riva for optimized speech AI skills to generate high-performance inference.
Figure 1. NVIDIA Riva workflow for building speech AI applications

The NVIDIA Riva SDK includes pretrained speech AI models that can be fine-tuned on a custom dataset, and optimized end-to-end skills for automatic speech recognition and speech synthesis.

Using Riva, you can fully customize state-of-art models on your data to achieve a deeper understanding of their specific contexts. Optimize for inference to offer services that run in real time (less than 150 ms).

Task-specific AI services and gRPC endpoints provide out-of-the-box, high-performance ASR and TTS. These AI services are trained with thousands of hours of public and internal datasets to reach high accuracy. You can start using the pre-trained models or fine-tune them with your own dataset to further improve model performance.

Riva uses NVIDIA Triton Inference Server to serve multiple models for efficient and robust resource allocation and to achieve high performance in terms of high throughput, low latency, and high accuracy.

Overview of NVIDIA Riva skills

Riva provides highly optimized automatic speech recognition and speech synthesis services for use cases like real-time transcription and intelligent virtual assistants. The automatic speech recognition skill is available in English, Spanish, Mandarin, Hindi, Korean, Portuguese, French, German, and Russian.

It is trained and evaluated on a wide variety of real-world, domain-specific datasets. With telecommunications, podcasting, and healthcare vocabulary, it delivers world-class production accuracy. To learn more, see Exploring Unique Applications of Automatic Speech Recognition Technology.

The Riva text-to-speech or speech synthesis skill generates human-like speech. It uses non-autoregressive models to deliver 12x higher performance on NVIDIA A100 GPUs compared to Tacotron 2 and WaveGlow models on NVIDIA V100 GPUs. Furthermore, with TTS you can create a natural custom voice for every brand and virtual assistant with only 30 minutes of voice data.

Diagram showing the available models, corresponding sample rates, and modes for automatic speech recognition and text-to-speech services
Figure 2. NVIDIA Riva speech AI skills capabilities

To take full advantage of the computational power of the GPUs, Riva skills uses NVIDIA Triton Inference Server to serve neural networks and ensemble pipelines to run efficiently with NVIDIA TensorRT.

Riva services are exposed through API operations accessible by gRPC endpoints that hide all the complexity. Figure 3 shows the system’s server-side. The gRPC API operations are exposed by the API server running in a Docker container. They are responsible for processing all the speech incoming and outgoing data.

Diagram shows the Riva client applications such as desktop, mobile, and laptop interactions with Riva speech recognition and speech synthesis pipelines.
Figure 3. NVIDIA Riva service pipelines

The API server sends inference requests to NVIDIA Triton and receives the results.

NVIDIA Triton is the backend server that simultaneously processes multiple inference requests on multiple GPUs for many neural networks or ensemble pipelines.

It is crucial for speech AI applications to keep the latency below a given threshold. This latency requirement translates into the execution of inference requests as soon as they arrive. To make the best use of GPUs to increase performance, you should increase the batch size by delaying the inference execution until more requests are received, forming a bigger batch.

NVIDIA Triton is also responsible for the context switch of networks with the state between one request and another.

Riva can be installed directly on bare-metal through simple scripts that download the appropriate models and containers from NGC, or it can be deployed on Kubernetes through a Helm chart, which is also provided.

Querying NVIDIA Riva services

Here’s a quick look at how you can interact with Riva. A Python interface makes communication with a Riva server easier on the client side through simple Python API operations. For example, here’s how a request for an existing TTS Riva service is created in four steps.

First, import the Riva API and other useful or required libraries:

import numpy as np
import IPython.display as ipd
import riva.client

Next, create a gRPC channel to the Riva endpoint:

auth = riva.client.Auth(uri='localhost:50051')
riva_tts = riva.client.SpeechSynthesisService(auth)

Then, configure the TTS API parameters:

sample_rate_hz = 44100
req = { 
        "language_code"  : "en-US",
        "encoding"       : riva.client.AudioEncoding.LINEAR_PCM,
        "sample_rate_hz" : sample_rate_hz,
        "voice_name"     : "English-US.Female-1"
}

Finally, create a TTS request:

req["text"] = "Is it recognize speech or wreck a nice beach?"
resp = riva_tts.synthesize(**req)
audio_samples = np.frombuffer(resp.audio, dtype=np.int16)
ipd.Audio(audio_samples, rate=sample_rate_hz)

Customizing a model with your data

While Riva’s default models are powerful, engineers might need to customize them in developing speech AI applications. Specific contexts where customizing ASR pipeline components can further optimize the transcription of audio data include: 

  • Different accents, dialects, or even languages from those on which the models were initially trained
  • Domain-specific vocabulary, such as academic, scientific, or business jargon
  • Preferencing and/or depreferencing certain words, for example, to account for one word in a set of homophones making more sense in the current context
  • Noisy environments

You might also wish to customize a TTS model, so the synthesized voice assumes a particular pitch or accent, or possibly mimics one’s own voice. 

With NVIDIA NeMo, you can fine-tune ASR, TTS, and NLP models on domain- or application-specific datasets (Figure 4), or even train the models from scratch.

Diagram shows the workflow of TAO toolkit starting from NGC pretrained model to adding your custom data and deploying it as a Riva skill.
Figure 4. NVIDIA NeMo model fine-tuning pipeline

Exploring one such customization in more detail, to further improve the legibility and accuracy of an ASR transcribed text, you can add a custom punctuation and capitalization model to the ASR system that generates text without those features.

Starting from a pretrained BERT model, the first step is to prepare the dataset. For every word in the training dataset, the goal is to predict the following:

  • The punctuation mark that should follow the word
  • Whether the word should be capitalized

After the dataset is ready, the next step is training by running a previously provided script. When the training is completed and the desired final accuracy is reached, create the model repository for NVIDIA Triton by using an included script.

The NVIDIA Riva Speech Skills documentation contains ASR customization best practices and more details about how to train or fine-tune other models. This post shows only one of the many customization possibilities using NVIDIA NeMo.

Deploying a model in NVIDIA Riva

Riva is designed for speech AI at scale. To help you efficiently serve models across different servers robustly, NVIDIA provides push-button model deployment using Helm charts (Figure 5).

Diagram shows the workflow for deploying TAO model as a Riva skill and highlights the TensorRT optimizations and Triton Inference Server under the hood.
Figure 5. Models can be deployed in NVIDIA Riva by modifying the available Helm chart

The Helm chart configuration, available from the NGC catalog, can be modified for custom use cases. You can change settings related to which models to deploy, where to store them, and how to expose the services.

Conclusion

NVIDIA Riva is available as a set of containers and pretrained models, free of charge, from NVIDIA NGC to members of the NVIDIA Developer Program. With these resources, you can develop applications with real-time transcription, virtual assistants, or custom voice synthesis. 

You can also get support for large-scale deployments of Riva with NVIDIA AI Enterprise Support. You can try NVIDIA Riva with a free trial on NVIDIA LaunchPad or access ASR and TTS tutorials.

If you are ready to deploy Riva speech AI skills, check out Riva Getting Started to deliver an interactive voice experience for any application.

Categories
Misc

Building an End-to-End Retail Analytics Application with NVIDIA DeepStream and NVIDIA TAO Toolkit

Shoppers in grocery storeRetailers today have access to an abundance of video data provided by cameras and sensors installed in stores. Leveraging computer vision AI applications,…Shoppers in grocery store

Retailers today have access to an abundance of video data provided by cameras and sensors installed in stores. Leveraging computer vision AI applications, retailers and software partners can develop AI applications faster while also delivering greater accuracy. These applications can help retailers:  

  • Understand in-store customer behavior and buying preference
  • Reduce shrinkage 
  • Notify associates of low or depleted inventory 
  • Improve merchandising
  • Optimize operations

Building and deploying such highly efficient computer vision AI applications at scale poses many challenges. Traditional techniques are time-consuming, requiring intensive development efforts and AI expertise to map all the complex architectures and options. These can include building customized AI models, deploying high-performance video decoding and AI inference pipelines, and generating an insightful analytics dashboard. 

NVIDIA’s suite of SDKs helps to simplify this workflow. You can create high-quality video analytics with minimum configuration using the NVIDIA DeepStream SDK, and an easy model training procedure with the NVIDIA TAO Toolkit.

This post provides a tutorial on how to build a sample application that can perform real-time intelligent video analytics (IVA) in the retail domain using NVIDIA DeepStream SDK and NVIDIA TAO Toolkit. 

To create an end-to-end retail vision AI application, follow the steps below:

  1. Use NVIDIA pretrained models for people detection and tracking.
  2. Customize the computer vision models for the specific retail use case using the NVIDIA TAO Toolkit.
  3. Develop an NVIDIA DeepStream pipeline for video analysis and streaming inference outputs using Apache Kafka. Kafka is an open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale. 
  4. Set up a Kafka Consumer to store inference data into a database.
  5. Develop a Django web application to analyze store performance using a variety of metrics.

You can follow along with implementing this sample application using the code on the NVIDIA-AI-IOT/deepstream-retail-analytics GitHub repo.

The end product of this sample is a custom dashboard, as shown in Figure 1. The dashboard provides analytical insights such as trends of the store traffic, counts of customers with shopping baskets, aisle occupancy, and more.

A dashboard with a line graph to track the number of visitors in a day, a pie chart with the ratio of customers with basket and without baskets, a bar graph with the number of customers in each aisle, and a number to depict the number of visitors in the last hour.
Figure 1. Frontend dashboard to visualize inference data

Introduction to the application architecture

Before diving into the detailed workflow, this section provides an overview of the tools that will be used to build this project. 

NVIDIA DeepStream SDK

NVIDIA DeepStream SDK is NVIDIA’s streaming analytics toolkit that enables GPU-accelerated video analytics with support for high-performance AI inference across a variety of hardware platforms. DeepStream includes several reference applications to jumpstart development. These reference apps can be easily modified to suit new use cases and are available inside the DeepStream Docker images and at deepstream_reference_apps on GitHub. 

This retail vision AI application is built on top of two of the reference applications, ­deepstream-test4 and deepstream-test5. Figure 2 shows the architecture of a typical DeepStream application.

Graphic showing a DeepStream reference architecture workflow, including Video Decoding, Stream Mux, Primary Detector, Object Tracker, Secondary Classifiers, Tiler, On-Screen Display, and Renderer.
Figure 2. NVIDIA DeepStream reference application architecture

NVIDIA TAO Toolkit and pretrained models 

NVIDIA TAO (Train, Adapt, and Optimize) Toolkit enables fine-tuning a variety of AI pretrained models to new domains. The TAO Toolkit is used in concert with the DeepStream application to perform analyses for unique use cases. 

In this project, the model is used to detect whether or not a customer is carrying a shopping basket. DeepStream enables a seamless integration of TAO Toolkit with its existing pipeline without the need for heavy configuration. 

Getting started with TAO Toolkit is easy. TAO Toolkit provides complete Jupyter notebooks for model customization for 100+ combinations of CV architectures and backbones. TAO Toolkit also provides a library of task-specific pretrained models for common retail tasks like people detection, pose estimation, action recognition, and more. To get started, see TAO Toolkit Quick Start

Retail vision AI application workflow

The retail vision AI application architecture (Figure 3) consists of the following stages:

A DeepStream Pipeline with the following configuration:

  • Primary Detector: Configure PeopleNet pretrained model from NGC to detect ‘Persons’
  • Secondary Detector: Custom classification model trained using the TAO Toolkit for shopping basket detection
  • Object Tracker: NvDCF tracker (in the accuracy configuration) to track the movement in the video stream
  • Message Converter: Message converter to generate custom Kafka streaming payload from inference data
  • Message Broker: Message broker to relay inference data to a Kafka receiver

kSQL Time Series Database: Used to store inference output streams from an edge inference server

Django Web Application: Application to analyze data stored in the kSQL database to generate insights regarding store performance, and serve these metrics as RESTful APIs and a web dashboard

The architectural diagram for retail vision AI, including the workflow from taking the input from in-store cameras as the video feed and using the DeepStream pipeline for the video analytics. This is followed by the Inference in a kSQL Time Series Database (TSDB) and then transferred to the Django App where the Dashboard is created.
Figure 3. Retail vision AI application architecture

Additionally, this app is built for x86 platforms with an NVIDIA GPU. However, it can be easily deployed on NVIDIA Jetson embedded platforms, such as the NVIDIA Jetson AGX Orin. 

The next section walks you through the steps involved in building the application. 

Step 1: Building a custom NVIDIA DeepStream pipeline 

To build the retail data analytics pipeline, start with the NVIDIA DeepStream reference applications deepstream-test4 and deepstream-test5. Code for the pipeline and a detailed description of the process is available in the deepstream-retail-analytics GitHub repo. We recommend using this post as a walkthrough to the code in the repository.

The deepstream-test4 application is a reference DeepStream pipeline that demonstrates adding custom-detected objects as NVDS_EVENT_MSG_META user metadata and attaching it to the buffer to be published. The deepstream-test5 is an end-to-end app that demonstrates how to use nvmsgconv and nvmsgbroker plugins in multistream pipelines, create NVDS_META_EVENT_MSG type of meta, and stream inference outputs using Kafka and other sink types. 

This pipeline also integrates a secondary classifier in addition to the primary object detector, which can be useful for detecting shopper attributes once a person is detected in the retail video analytics application. The test4 application is used to modify the nvmsgconv plugin to include retail analytics attributes. Then, refer to the test5 application for secondary classifiers and streaming data from the pipeline using the nvmsgbroker over a Kafka topic. 

Since the first step of the workflow is to identify persons and objects from the video feed, start by using the deepstream-test4 application for primary object detection. This object detection is done on the PeopleNet pretrained model that, by default, takes video input and detects people or their belongings.

For this use case, configure the model to capture only information about people. This can be accomplished easily by storing information only about the subset of frames that contain a person in the dataset.

With the primary person object detection done, use deepstream-test5 to add a secondary object classification model. This object classification shows whether or not a detected person is carrying a basket.

Step 2: Building a custom model for shopping basket detection with NVIDIA TAO Toolkit

This section shows how to use the TAO Toolkit to fine-tune an object classification model and find out whether a person detected in the PeopleNet model is carrying a shopping basket (Figure 4). 

The sample image classifier files where you can see the customers in the ‘hasBasket’ category are clearly carrying shopping baskets. The second category of ‘noBasket’ shows customers without baskets in their hands.
Figure 4. Shoppers classified to have baskets (left) and not to have baskets (right)

To get started, collect and annotate training data from a retail environment for performing object classification. Use the Computer Vision Annotation Tool (CVAT) to annotate persons observed with the following labels:

  • hasBasket: Person is carrying a basket
  • noBasket: Person is not carrying a basket

This annotation is stored as a KITTI formatted dataset, where each line corresponds to a frame and thus an object. To make the data compatible for object classification, use the sample ‘kitti_to_classification‘ Python file on GitHub to crop the dataset. You can then perform object classification on it.

Next, use the TAO Toolkit to fine-tune a Resnet34 image classification model to perform classification on the training data. Learn more about the fine-tuning process at deepstream-retail-analytics/tree/main/TAO on GitHub.

After the custom model is created, run inference to validate that the model works as expected.

Step 3: Integrating the Kafka message broker to create a custom frontend dashboard

With the primary object detection and secondary object classification models ready, the DeepStream application needs to relay this inference data to an analytics web server. Use the deepstream-test5 reference application as a template to stream data using Apache Kafka. 

Here, a Kafka adapter that is built into DeepStream is used to publish messages to the Kafka message broker. Once the web server receives the Kafka streams from each camera inside a store, these inference output data are stored in a kSQL time-series database.

DeepStream has a default Kafka messaging shared library object that enables users to perform primary object detection and transmit the data seamlessly. This project further modifies this library to include information about the secondary classifier as well. This helps to stream data about shopping basket use inside the store.

The current DeepStream library includes NvDsPersonObject, which is used to define the persons detected in the primary detector. To ensure that the basket detection is mapped to each person uniquely, modify this class to include a hasBasket attribute in addition to the previously present attributes. Find more details at deepstream-retail-analytics/tree/main/nvmsgconv on GitHub.

After modifying the NvDsPersonObject to include basket detection, use the pipeline shown in Figure 5 to ensure the functionality for basket detection works appropriately.

The Application pipeline, which walks through all the steps in the workflow starting from video source and h264-parser, followed by nvh264-decoder, nvstreammux, the nvinfer primary detector, the nvtracker, nvinfer for secondary detection, nvvidconv, nvsod and then the tee which branches to msgconv, nveglglessink, msgconv, and msgbroker.
Figure 5. Retail vision AI application pipeline

As shown in the application pipeline in Figure 5, object detection and tracking are performed with the help of pgie and sgie. These are part of the nvinfer plugin as the primary and secondary inference engines. With nvtracker, transfer the data to the nvosd plugin. This nvosd plugin is responsible for drawing boxes around the objects that were detected in the previous sections.

Next, this inference data needs to be converted into message payload based on a specific schema that can be later consumed by the Kafka message broker to store and analyze the results. Use the NvDsPersonsObject (generated previously) for the updated payload in the eventmsg_payload file.

Finally, you now have the message payload with the custom schema. Use this to pass it through the Kafka protocol adapter and publish messages that the DeepStream application sends to the Kafka message broker at the specified broker address and topic. At this point, the final message payload is ready.

Now that the DeepStream pipeline is ready, build a web application to store the streaming inference data into a kSQL database. This web app, built using the Django framework, analyzes the inference data to generate metrics regarding store performance discussed earlier. These metrics are available through a RESTful API documented at deepstream-retail-analytics/tree/main/ds-retail-iva-frontend on GitHub. 

To demonstrate the API functionality, we built a frontend web dashboard to visualize the results of the analytics server. This dashboard acts as a template for a storewide analytics system. 

Results

The previous steps demonstrated how to easily develop an end-to-end retail video analytics pipeline using NVIDIA DeepStream and NVIDIA TAO Toolkit. This pipeline helps retail establishments capitalize on pre-existing video feeds and find insightful information they can use to improve profits. 

The workflow culminates in an easy-to-use web dashboard to analyze invaluable storewide data in real time. As shown in Figure 1, the dashboard presents the following information:

  • Number of store visitors throughout the day
  • Information about the proportion of customers shopping with and without baskets
  • Visitors counts per store aisle
  • Store occupancy heatmaps
  • Customer journey visualization

These attributes can be easily amended to include information about specific use cases that are more relevant to each individual store. Stores can use this information to schedule staffing and improve the store layout to maximize efficiency. 

For example, Figure 6 shows the overall distribution of customers in the store throughout the day, as well as the ratio of customers with and without baskets, respectively. While this sample application supports only a single camera stream, it can be easily modified to support multiple cameras. Scaling this application to multiple stores is equally easy to do. 

A figure displaying the number of customers in the store over time as a bar graph with many rows and different distribution throughout the x-axis (left). On the right side, the  pie chart depicts the ratio of customers who have ‘noBasket’ with respect to the customers who fall into the ‘hasBasket’ category. There are 82.6% customers who have ‘noBasket’ vs 17.4% who are in the ‘hasBasket’ category.
Figure 6. Inference data for the number of customers in the store over time (left) and the ratio of customers with baskets to those without baskets (right)

The application uniquely detects person 11 carrying the shopping basket by setting the attribute of hasBasket, whereas the other customers who do not carry the basket are marked with noBasket. Additionally, the person 1 with a cardboard box is not identified to have a basket. Thus, the model is robust against false positives, ensuring that it was successfully trained to only pick up relevant information for this use case.

Summary 

This post demonstrated an end-to-end process to develop a vision AI application to perform retail analytics using NVIDIA TAO Toolkit and NVIDIA DeepStream SDK. Retail establishments can use the flux of video data they already have and build state-of-the-art video analytics applications. These apps can be deployed in real time and require minimal configuration to get started. In addition, the high customizability of this application ensures that it can be applied to any use case a store might benefit from.

Get started using the sample deepstream-retail-analytics application on GitHub.

Categories
Offsites

Formation of Robust Bound States of Interacting Photons

When quantum computers were first proposed, they were hoped to be a way to better understand the quantum world. With a so-called “quantum simulator,” one could engineer a quantum computer to investigate how various quantum phenomena arise, including those that are intractable to simulate with a classical computer.

But making a useful quantum simulator has been a challenge. Until now, quantum simulations with superconducting qubits have predominantly been used to verify pre-existing theoretical predictions and have rarely explored or discovered new phenomena. Only a few experiments with trapped ions or cold atoms have revealed new insights. Superconducting qubits, even though they are one of the main candidates for universal quantum computing and have demonstrated computational capabilities beyond classical reach, have so far not delivered on their potential for discovery.

In “Formation of Robust Bound States of Interacting Photons”, published in Nature, we describe a previously unpredicted phenomenon first discovered through experimental investigation. First, we present the experimental confirmation of the theoretical prediction of the existence of a composite particle of interacting photons, or a bound state, using the Google Sycamore quantum processor. Second, while studying this system, we discovered that even though one might guess the bound states to be fragile, they remain robust to perturbations that we expected to have otherwise destroyed them. Not only does this open the possibility of designing systems that leverage interactions between photons, it also marks a step forward in the use of superconducting quantum processors to make new scientific discoveries by simulating non-equilibrium quantum dynamics.

Overview

Photons, or quanta of electromagnetic radiation like light and microwaves, typically don’t interact. For example, two intersecting flashlight beams will pass through one another undisturbed. In many applications, like telecommunications, the weak interactions of photons is a valuable feature. For other applications, such as computers based on light, the lack of interactions between photons is a shortcoming.

In a quantum processor, the qubits host microwave photons, which can be made to interact through two-qubit operations. This allows us to simulate the XXZ model, which describes the behavior of interacting photons. Importantly, this is one of the few examples of integrable models, i.e., one with a high degree of symmetry, which greatly reduces its complexity. When we implement the XXZ model on the Sycamore processor, we observe something striking: the interactions force the photons into bundles known as bound states.

Using this well-understood model as a starting point, we then push the study into a less-understood regime. We break the high level of symmetries displayed in the XXZ model by adding extra sites that can be occupied by the photons, making the system no longer integrable. While this nonintegrable regime is expected to exhibit chaotic behavior where bound states dissolve into their usual, solitary selves, we instead find that they survive!

Bound Photons

To engineer a system that can support the formation of bound states, we study a ring of superconducting qubits that host microwave photons. If a photon is present, the value of the qubit is “1”, and if not, the value is “0”. Through the so-called “fSim” quantum gate, we connect neighboring sites, allowing the photons to hop around and interact with other photons on the nearest-neighboring sites.

Superconducting qubits can be occupied or unoccupied with microwave photons. The “fSim” gate operation allows photons to hop and interact with each other. The corresponding unitary evolution has a hopping term between two sites (orange) and an interaction term corresponding to an added phase when two adjacent sites are occupied by a photon.
We implement the fSim gate between neighboring qubits (left) to effectively form a ring of 24 interconnected qubits on which we simulate the behavior of the interacting photons (right).

The interactions between the photons affect their so-called “phase.” This phase keeps track of the oscillation of the photon’s wavefunction. When the photons are non-interacting, their phase accumulation is rather uninteresting. Like a well-rehearsed choir, they’re all in sync with one another. In this case, a photon that was initially next to another photon can hop away from its neighbor without getting out of sync. Just as every person in the choir contributes to the song, every possible path the photon can take contributes to the photon’s overall wavefunction. A group of photons initially clustered on neighboring sites will evolve into a superposition of all possible paths each photon might have taken.

When photons interact with their neighbors, this is no longer the case. If one photon hops away from its neighbor, its rate of phase accumulation changes, becoming out of sync with its neighbors. All paths in which the photons split apart overlap, leading to destructive interference. It would be like each choir member singing at their own pace — the song itself gets washed out, becoming impossible to discern through the din of the individual singers. Among all the possible configuration paths, the only possible scenario that survives is the configuration in which all photons remain clustered together in a bound state. This is why interaction can enhance and lead to the formation of a bound state: by suppressing all other possibilities in which photons are not bound together.

Left: Evolution of interacting photons forming a bound state. Right: Time goes from left to right, each path represents one of the paths that can break the 2-photon bonded state. Due to interactions, these paths interfere destructively, preventing the photons from splitting apart.
Occupation probability versus gate cycle, or discrete time step, for n-photon bound states. We prepare bound states of varying sizes and watch them evolve. We observe that the majority of the photons (darker colors) remain bound together.

In our processor, we start by putting two to five photons on adjacent sites (i.e., initializing two to five adjacent qubits in “1”, and the remaining qubits in “0”), and then study how they propagate. First, we notice that in the theoretically predicted parameter regime, they remain stuck together. Next, we find that the larger bound states move more slowly around the ring, consistent with the fact that they are “heavier”. This can be seen in the plot above where the lattice sites closest to Site 12, the initial position of the photons, remain darker than the others with increasing number of photons (nph) in the bound state, indicating that with more photons bound together there is less propagation around the ring.

Bound States Behave Like Single Composite Particles

To more rigorously show that the bound states indeed behave as single particles with well-defined physical properties, we devise a method to measure how the energy of the particles changes with momentum, i.e., the energy-momentum dispersion relation.

To measure the energy of the bound state, we use the fact that the energy difference between two states determines how fast their relative phase grows with time. Hence, we prepare the bound state in a superposition with the state that has no photons, and measure their phase difference as a function of time and space. Then, to convert the result of this measurement to a dispersion relation, we utilize a Fourier transform, which translates position and time into momentum and energy, respectively. We’re left with the familiar energy-momentum relationship of excitations in a lattice.

Spectroscopy of bound states. We compare the phase accumulation of an n-photon bound state with that of the vacuum (no photons) as a function of lattice site and time. A 2D Fourier transform yields the dispersion relation of the bound-state quasiparticle.

Breaking Integrability

The above system is “integrable,” meaning that it has a sufficient number of conserved quantities that its dynamics are constrained to a small part of the available computational space. In such integrable regimes, the appearance of bound states is not that surprising. In fact, bound states in similar systems were predicted in 2012, then observed in 2013. However, these bound states are fragile and their existence is usually thought to derive from integrability. For more complex systems, there is less symmetry and integrability is quickly lost. Our initial idea was to probe how these bound states disappear as we break integrability to better understand their rigidity.

To break integrability, we modify which qubits are connected with fSim gates. We add qubits so that at alternating sites, in addition to hopping to each of its two nearest-neighboring sites, a photon can also hop to a third site oriented radially outward from the ring.

While a bound state is constrained to a very small part of phase space, we expected that the chaotic behavior associated with integrability breaking would allow the system to explore the phase space more freely. This would cause the bound states to break apart. We find that this is not the case. Even when the integrability breaking is so strong that the photons are equally likely to hop to the third site as they are to hop to either of the two adjacent ring sites, the bound state remains intact, up to the decoherence effect that makes them slowly decay (see paper for details).

Top: New geometry to break integrability. Alternating sites are connected to a third site oriented radially outward. This increases the complexity of the system, and allows for potentially chaotic behavior. Bottom: Despite this added complexity pushing the system beyond integrability, we find that the 3-photon bound state remains stable even for a relatively large perturbation. The probability of remaining bound decreases slowly due to decoherence (see paper).

Conclusion

We don’t yet have a satisfying explanation for this unexpected resilience. We speculate that it may be related to a phenomenon called prethermalization, where incommensurate energy scales in the system can prevent a system from reaching thermal equilibrium as quickly as it otherwise would. We believe further investigations will hopefully lead to new insights into many-body quantum physics, including the interplay of prethermalization and integrability.

Acknowledgements

We would like to thank our Quantum Science Communicator Katherine McCormick for her help writing this blog post.

Categories
Misc

Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions

Dynamic programming (DP) is a well-known algorithmic technique and a mathematical optimization that has been used for several decades to solve groundbreaking…

Dynamic programming (DP) is a well-known algorithmic technique and a mathematical optimization that has been used for several decades to solve groundbreaking problems in computer science.

An example DP use case is route optimization with hundreds or thousands of constraints or weights using the Floyd-Warshall all-pair shortest paths algorithm. Another use case is the alignment of reads for genome sequence alignment using the Needleman-Wunsch or Smith-Waterman algorithms.

NVIDIA Hopper GPU Dynamic Programming X (DPX) instructions accelerate a large class of dynamic programming algorithms used in areas such as genomics, proteomics, and robot path planning. Accelerating these dynamic programming algorithms can help researchers, scientists, and practitioners glean insights much faster about the underlying DNA or protein structures and several other areas.

What is dynamic programming?

DP techniques initially involve expressing the algorithm recursively, where the larger problem is broken down into subproblems that are easier to solve.

A common computational optimization used in DP is to save the results of the subproblems and use them in subsequent steps of the problem, instead of recomputing the solution each time. This step is called memoization. Memoization facilitates avoiding the recursion steps and instead enables using an iterative, look-up, tablebased formulation. The previously computed results are stored in the look-up table.

One of the key observations in many DP problems is that the solution to a larger problem often involves computing the minimum or maximum of the previously computed solutions. The larger problem’s solution is within a delta of that min-max of previous solutions.

DP techniques, in general, achieve the same results as brute force algorithms, but with dramatic reductions in the computational requirements and execution times.

DP example: Accelerating the Smith-Waterman algorithm

NVIDIA Clara Parabricks is a GPU-accelerated genomic analysis suite of tools that heavily uses the Smith-Waterman algorithm and runs on NVIDIA GPUs: A100, V100, A40, A30, A10, A6000, T4, and soon the newest H100.

Genome sequencing has fundamental applications with universal benefits, with examples that include personalized medicine or tracking disease spread. Every cell in living organisms encodes genetic information using a sequence of four nucleotides in DNA (or bases). The nucleotides are adenine, cytosine, guanine, and thymine, represented by A, C, T, and G.

Simple organisms like viruses have a sequence of 10–100K bases while human DNA consists of about three billion base pairs. There are instruments (chemical– or electrical-signal–based) that sequence the bases of short segments of genetic material, called reads. These reads are typically 100–100K bases long, depending on the sequencer technology used for gathering the reads.

A critical computational step in genome sequence analysis is to align the reads to find the best match among a pair of reads. These reads can be 100-400 base pairs long in second-generation sequencers, and up to 100K bases long in third-generation sequencers. Aligning reads is a computational step that is repeated tens or hundreds of millions of times.

There are challenges in finding the best match that include the following:

  • Naturally occurring variations in genomes that give organisms within a species their specific traits
  • Errors in the reads themselves resulting from the sequencing instrument or underlying chemical processes

The best match between a pair of reads is equivalent to an approximate string match between a pair of strings with steps that reward matches and penalize differences. The differences between the reads could be mismatches, insertions, or deletions.

The Smith-Waterman algorithm proceeds to find the best match between a pair of sequences TGTTACGG and GGTTGACTA.
Figure 1. Smith-Waterman scoring matrix and traceback for a pair of sequences (Source: Wikipedia)

Figure 1 shows that the Smith-Waterman step in genomic sequencing aims to find the best match between the read sequences TGTTACGG and GGTTGACTA. The resulting best match is shown to be GTT-AC (from sequence 1, the “-” representing a deletion) with GTTGAC (from sequence 2). The scoring scheme in each step rewards matches (+3), penalizes mismatches (-3), and penalizes insertions and deletions (see the gap penalty formula in Figure 1).

This is an example formulation of the Smith-Waterman algorithm. Implementers of the Smith-Waterman algorithm are allowed to customize the rewards and penalties.

While computing the best match between TGTTACGG and GGTTGACTA, the Smith-Waterman algorithm also computes the best matches for all prefixes of TGTTACGG with all prefixes of GGTTGACTA. It proceeds from the start of these sequences and uses the results of the smaller prefixes to feed into the problem of finding the solution for the larger prefixes.

Diagram shows a simplified cell calculation step between GA and GT. Only the cell entry for A versus T has to be calculated. This is a mismatch. Take the maximum score after accounting for substitution and gap penalties.
Figure 2. A simplified cell calculation step in the Smith-Waterman algorithm

Figure 3 shows how the algorithm proceeds in terms of calculating the scores of matrices for matching a sequence of reads. This comparative matching is the computationally expensive step of the Smith-Waterman algorithm.

This is just one of the formulations of how the Smith-Waterman algorithm proceeds. Different formulations can result in the algorithm proceeding row-wise or column-wise as examples.

Each cell calculation requires the cell to the top, left, and top-left to be available. This diagram shows a formulation of Smith-Waterman that proceeds diagonally, starting from the top-left corner of the scoring matrix.
Figure 3. Example steps of the Smith-Waterman algorithm

After the score matrix is computed, the next step involves backtracking from the highest score to the origin of each of these scores. This is a computationally light step given that each cell maintains how it got its score (the source cell for score calculation).

If each cell keeps a tab of where the score was derived from (top, left, top-left), tracking back the score to a zero will show the best match(es) between the sequences.
Figure 4. Backtracking from the highest score in the Smith-Waterman algorithm

Figure 5 shows the computational efficiency of the Smith-Waterman calculations, where each of the subproblems solved by the algorithm is stored in the result matrix and never recomputed.

For example, in the process of calculating the best match of GGTTGACTA and TGTTACGG, the Smith-Waterman algorithm reuses the best match between GGTT (prefix of GGTTGACTA) and TGTT (prefix of TGTTACGG). In turn, while calculating the best match of GGTT and TGTT, the best match of all prefixes of these strings are calculated and reused (for example, best match of GGTT and TGT).

The best match of GGTT with TGTT also calculates the best match of GGTT with TGT. In fact, the best match with every prefix of TGTTACGG with every prefix of GGTTGACTA is calculated in the scoring matrix.
Figure 5. Subproblems solved by the Smith-Waterman algorithm

Leveraging DPX instructions for better performance

 The inner loop in a real Smith-Waterman implementation involves the following for each cell:

  • Updating deletion penalties
  • Updating insertion penalties
  • Updating the score based on the updated insertion and deletion penalties.
Equations for the insertion and deletion penalties.
Practical implementations of Smith-Waterman must keep tabs of the insertion and deletion penalties in addition to the scores.
Figure 6. Updating insertion penalty, deletion penalties, and scores in a Smith-Waterman implementation step

The NVIDIA Hopper Architecture math API provides dramatic acceleration for such calculations. The APIs expose the acceleration provided by NVIDIA Hopper Streaming Multiprocessor for additions followed by minimum or maximum as a fused operation (for example, __viaddmin_s16x2_relu, an intrinsic that performs per-halfword max(min(a + b, c), 0)).

Another example of an API that is extensively leveraged by Smith-Waterman software is a three-way min or max followed by a clamp to zero ( for example, __vimax3_s16x2_relu, an intrinsic that performs per-halfword max(max(max(a, b), c), 0)).

Our implementation of the Smith-Waterman algorithm using the NVIDIA Hopper DPX instruction math APIs provides a 7.8x speedup over A100.

NVIDIA Hopper DPX instructions enable over a 7x speedup for the Smith-Waterman score matrix calculations.
Figure 7. Smith-Waterman calculations speedup using DPX instructions in H100

Needleman-Wunsch and partial order alignment

In the same way that Smith-Waterman algorithms use DPX instructions, there is a large family of alignment algorithms that essentially use the same principles.

Examples include the Needleman-Wunsch algorithm in which the basic flow of the algorithm resembles the Smith-Waterman closely. However, the initialization, insertion, and gap penalties are calculated differently between these two approaches.

Algorithms like Partial Order Alignment make dense use of cell calculations that closely resemble Smith-Waterman cell calculations in their inner loop.

All-pair shortest paths

Robotic path planning with thousands or tens of thousands of objects is a common problem in warehouses where the environment is dynamic with many moving objects. These scenarios can involve dynamic replanning every few milliseconds.

The inner loop of most all-pair shortest paths algorithms is as shown using the following Floyd-Warshall algorithm example. The pseudocode shows how the all-pair shortest paths algorithm has an inner loop that updates the min distance between each vertex pair. The most-dense operation is essentially an add followed by a min operation.

initialize(dist); # initialize nearest neighbors to actual distance, all others = infinity 
for k in range(V): #order of visiting k values not important, must visit each value

	# pick all vertices as source in parallel
	Parallel for_each i in range(V):

		# Pick all vertices as destinations for the
		# above picked source
		Parallel for_each j in range(V):

			# If vertex k is on the shortest path from
			# i to j, then update the value of dist[i][j]
			dist[i][j] = min (dist[i][j], dist[i][j] + dist[k][j])

			# dist[i][j] calculation can be parallel within each k
			# All dist[i][j] for a single ‘k’ must be computed 
                        # before moving to the next ‘k’
		Synchronize

The speedup offered by DPX instructions makes it possible to dramatically scale the number of objects analyzed or have the re-optimization done in real time with fewer GPUs and optimal results.

Accelerate dynamic programming algorithms with DPX instructions 

Using NVIDIA Hopper DPX instructions demonstrated speedups of up to 7.8x on the A100 GPU for Smith-Waterman, which is key in many genomic sequence alignment and variant calling applications. The exposure in math APIs, available in CUDA 12, enables the configurable implementation of the Smith-Waterman algorithm to suit different user needs, as well as algorithms like Needleman-Wunsch.

DPX instructions accelerate a large class of dynamic programming algorithms such as DNA or protein sequencing, and robot path planning. Most importantly, these algorithms can lead to dramatic speed-ups in disease diagnosis, drug discoveries, and robot autonomy, making our everyday lives better.

Acknowledgments

We’d like to thank Bill Dally, Alejandro Cachon, Mehrzad Samadi, Shirish Gadre, Daniel Stiffler, Rob Van der Wijngaart, Joseph Sullivan, Nikita Astafev, Seamus O’Boyle, and many others across NVIDIA.