With the state of the world under constant flux in 2022, some technology trends were put on hold while others were accelerated. Supply chain challenges, labor shortages and economic uncertainty had companies reevaluating their budgets for new technology. For many organizations, AI is viewed as the solution to a lot of the uncertainty bringing improved Read article >
From taking your order and serving you food in a restaurant to playing poker with you, service robots are becoming increasingly prevalent. Globally, you can…
From taking your order and serving you food in a restaurant to playing poker with you, service robots are becoming increasingly prevalent. Globally, you can find these service robots at hospitals, airports, and retail stores.
According to Gartner, by 2030, 80% of humans will engage with smart robots daily, due to smart robot advancements in intelligence, social interactions, and human augmentation capabilities, up from less than 10% today.
An accurate speech AI or voice AI interface that can quickly understand humans and mimic human speech is critical to a service robot’s ease of use. Developers are integrating automatic speech recognition (ASR) and text-to-speech (TTS) with service robots to enable essential skills, such as understanding and responding to human questions in natural language. These voice-based technologies make up speech AI.
This post explains how ASR and TTS can be used in service robot applications. I provide a walkthrough on how to customize them using speech AI software tools for industry-specific jargon, languages, and dialects, depending on where the robot is deployed.
Why add speech AI to service robot applications?
Service robots are like digital humans in the metaverse except that they operate in the physical world. These service robots can help support warehouse workers, perform dangerous tasks while following human instructions, or even assist in activities that require contactless services. For instance, a service robot in the hospitality industry can greet guests, carry bags, and take orders.
For all these service robots to understand and respond in a human-like way, developers must incorporate highly accurate speech AI that runs in real time.
Examples of speech AI-enabled service robot applications
Today, service robots are used in a wide range of industries.
Restaurants
Online food delivery services are growing in popularity worldwide. To handle the increased customer demand without compromising quality, service robots can assist staff with tasks such as order taking or delivering food to in-person customers.
Hospitals
In hospitals, service robots can support and empower patient care teams by handling patient-related tasks. For example, a speech AI-enabled service robot can empathetically converse with patients to provide company or help improve their mental health state.
Ambient assisted living
In ambient assisted living environments, technology is primarily used to support the independence and safety of elderly or vulnerable adults. Service robots can assist with daily activities, such as transporting food trays from one location to another or using a smart robotic pill dispenser to manage medications in a timely manner. With speech AI skills, service robots can also provide emotional support.
Service robot reference architecture
Service robots help businesses improve quality assurance and boost productivity in several ways:
Assisting frontline workers with daily repetitive tasks in restaurants or manufacturing environments
Helping customers find desired items in retail stores
Supporting physicians and nurses with patient healthcare services in hospitals
In these settings, it’s imperative that robots can accurately process and understand what a user is relaying. This is especially true for situations where danger or serious harm is a possibility, such as a hospital. Service robots that can naturally converse with humans also contribute to a positive overall user experience for an application.
Figure 1 shows that service robots use speech recognition to comprehend what users are saying and TTS to respond to users with a synthetic voice. Other components such as NLP and a dialog manager, are used to help service robots understand context and generate appropriate answers to users’ questions.
Also, the modules under robot tasks such as perception, navigation, and mapping help the robot understand its physical surroundings and move in the right direction.
Voice user interfaces to service robots
Voice user interfaces include two main components: automatic speech recognition and text-to-speech. Automatic speech recognition, also known as speech-to-text, is the process of converting raw speech into text. Text-to-speech, also known as speech synthesis, is the process of converting text into human-like speech.
Developing speech AI pipelines has its own challenges. For example, if a service robot is deployed in restaurants, it should be able to understand words like matcha, cappuccino, and ristretto. It should even transcribe in noisy environments as most people interacting with these applications are in open spaces.
Not only do the robots have to understand what is being said, but they should also be able to say these words correctly. Similarly, each industry has its own terminology that these robots must understand and respond to in real time.
Automatic speech recognition
The roles of each model or module in the ASR pipeline are as follows:
The feature extractor converts raw audio into spectrograms or mel spectrograms.
The acoustic model takes these spectrograms and generates a matrix that has probabilities of characters or words over each time step.
The decoder and language model put together these characters/words into a transcript.
The punctuation and capitalization model applies things like commas, periods, and question marks in the right places for better readability.
Text-to-speech
The roles of each model or module in the TTS pipeline are as follows:
In the text normalization and preprocessing stage, the text is converted into verbalized form. For instance: “at 10:00” -> “at ten o’clock.”
The text encoding module converts text into an encoded vector.
The pitch predictor predicts how much highness or lowness you have to give certain words, while the duration predictor predicts how long it takes to pronounce a character or word.
The spectrogram generator uses an encoded vector and other supporting vectors as input to generate a spectrogram.
The vocoder model takes spectrograms as input and produces a human-like voice as output.
Speech AI software suite
NVIDIA provides a variety of datasets, tools, and SDKs to help you build end-to-end speech AI pipelines. Customize the pipelines to your industry’s specific vocabulary, language, and dialects and run in milliseconds for natural and engaging interactions.
Datasets
To democratize and diversify speech AI technology, NVIDIA collaborated with Mozilla Common Voice (MCV). MCV is a crowd-sourced project in which volunteers contribute speech data to a public dataset that anyone can use to train voice-enabled technology. You can download various language audio datasets from MCV to develop ASR and TTS models.
NVIDIA also collaborated with Defined.ai, a one-stop shop for training data. You can download audio and speech training data in multiple domains, languages, and accents for use in speech AI models.
Pretrained models
NGC provides several pretrained models trained on a variety of open and proprietary datasets. All models have been optimized and trained on NVIDIA DGX servers for hundreds of thousands of hours.
You can fine-tune these highly accurate, pretrained models on a relevant dataset to improve accuracy even further.
Open-source tools
If you’re looking for open-source tools, NVIDIA offers NeMo, an open-source framework for building and training state-of-the-art AI speech and language models. NeMo is built on top of PyTorch and PyTorch Lightning, making it easy for you to develop and integrate modules that are already familiar.
Speech AI SDK
Use NVIDIA Riva, a free GPU-accelerated speech AI SDK, to build and deploy fully customizable, real-time AI pipelines. Riva offers state-of-the-art, highly accurate, pretrained models through NGC:
English
Spanish
Mandarin
Hindi
Russian
Korean
German
French
Portuguese
Japanese, Arabic, and Italian are coming soon.
With NeMo you can fine-tune these pretrained models on industry-specific jargon, languages, dialects, and accents, and optimized speech AI skills to run in real time.
You can deploy Riva skills in streaming or offline in all clouds, on-premises, at the edge, and on embedded devices.
Running Riva speech AI skills on embedded for robotics applications
In this section, I show you how to run out-of-the-box ASR and TTS skills with Riva on embedded devices. For better accuracy and performance, Riva also enables you to customize or fine-tune models on domain-specific datasets.
You can run Riva speech AI skills in both streaming and offline modes. First, set up and run the Riva server on embedded.
For more information about customizing Riva ASR models and pipelines for your industry-specific jargon, languages, dialects, and accents, see the instructions on the Model Overview in the Riva documentation.
Running C++ TTS client
For Riva TTS client on embedded, run the following command to synthesize audio files:
riva_tts_client --voice_name=English-US.Female-1 --text="Hello, this is a speech synthesizer." --audio_file=/opt/riva/wav/output.wav
For more information about customizing TTS models and pipelines on domain-specific datasets, see Model Overview in the Riva User Guide.
Resource for developing speech AI applications
Speech AI makes it possible for service robots and other interactive applications to comprehend nuanced human language and respond with ease.
It is empowering everything from real people in call centers to service robots in every industry. To understand how speech AI skills were integrated with a robotic dog that can fetch drinks in real life, see Low-code Building Blocks for Speech AI Robotics.
You can also access developer ebooks, such as End-To-End Speech AI pipelines to learn more about models and modules in speech AI pipelines and Building Speech AI Applications to gain insight on how to build and deploy real-time speech AI pipelines for your application.
Speech is one of the primary means to communicate with an AI-powered application. From virtual assistants to digital avatars, voice-based interfaces are…
Speech is one of the primary means to communicate with an AI-powered application. From virtual assistants to digital avatars, voice-based interfaces are changing how we typically interact with smart devices.
Deep learning techniques for speech recognition and speech synthesis are helping improve the user experience—think human-like responses and natural-sounding tones.
If you plan to build and deploy a speech AI-enabled application, this post provides an overview of how automatic speech recognition (ASR) and text-to-speech (TTS) technologies have evolved due to deep learning. I also mention some popular, state-of-the-art ASR and TTS architectures used in today’s modern applications.
Demystifying speech AI
Every day, hundreds of billions of audio minutes are generated, whether you are conversing with digital humans in the metaverse or actual humans in contact centers. Speech AI can assist in automating all these audio minutes.
Speech AI includes technologies like ASR, TTS, and related tasks. Interestingly, these technologies are not new and have existed for the last five decades.
Speech recognition evolution
Today, ASR algorithms developed using deep learning techniques can be customized for domain-specific jargon, languages, accents, and dialects, as well as transcribing in noisy environments.
This level of technique differs significantly from the first ASR system, Audrey, which was invented by Bell Labs in 1952. At the time, Audrey could only transcribe numbers and was not developed using deep learning techniques.
ASR pipeline
A standard ASR deep learning pipeline consists of a feature extractor, acoustic model, decoder and language model, and BERT punctuation and capitalization model.
Text-to-speech evolution
TTS, or speech synthesis, systems that are developed using deep learning techniques sound like real humans and can run in real time to have natural and meaningful discussions. On the other hand, traditional systems like Voder, DECtalk commercial, and concatenative TTS sound robotic and are difficult to run in real time.
Deep learning TTS algorithms are flexible enough so that you can adjust the speed, pitch, and duration at the inference time to generate more expressive TTS voices.
TTS pipeline
A basic TTS pipeline includes the following components: text normalization, text encoding, pitch/duration predictor, spectrogram generator, and vocoder model.
You can learn more about how ASR and TTS have changed over the past few years and about each of the models and modules in ASR and TTS pipelines in the on-demand video, Speech AI Demystified.
Popular ASR and TTS architectures used today
Several state-of-the-art neural network architectures have been created. Some of the most popular ones in use today for ASR are CTC and transducer-based architecture models. For example, you can apply these architecture techniques to models such as CitriNet and Conformer.
For TTS, different types of architectures exist:
Autoregressive or non-autoregressive
Deterministic or generative
Explicit control or non-explicit control
Each of these TTS architectures offer varying capabilities. For example, deterministic models can predict the outcome exactly and don’t include randomness. Generative models include the data distribution itself and can capture different variations of the synthetic voice. To build an end-to-end text-to-speech pipeline, you must combine one architecture from each category.
You can get the latest architecture best practices to build an ASR and TTS pipeline for your voice-enabled application in the on-demand video, Speech AI Demystified.
NVIDIA Speech AI SDK
You can develop deep learning-based ASR and TTS algorithms by leveraging a GPU-accelerated speech AI SDK. NVIDIA Riva helps you build and deploy customizable AI pipelines that deliver world-class accuracy in all clouds, on-premises, at the edge, and on embedded devices.
Riva has state-of-the-art pretrained models on NGC that are trained on multiple open and proprietary datasets. You can use low-coding tools to customize these models to fit your industry and use case with optimized speech AI skills that can run in real time, without sacrificing accuracy.
Build your first speech AI application
Are you looking to add an interactive voice experience to applications? The following free ebooks will guide your journey:
Loading and preprocessing data for running machine learning models at scale often requires seamlessly stitching the data processing framework and inference…
Loading and preprocessing data for running machine learning models at scale often requires seamlessly stitching the data processing framework and inference engine together.
In this post, we walk through the integration of NVIDIA TensorRT with Apache Beam SDK and show how complex inference scenarios can be fully encapsulated within a data processing pipeline. We also demonstrate how terabytes of data can be processed from both batch and streaming sources with a few lines of code for high-throughput and low-latency model inference.
NVIDIA TensorRT is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
Proven with 15+ years in production,Dataflow is a no-ops, serverless data processing platform to process data, in batch or in real time, for analytical, ML and application use cases. These often include incorporating pretrained models into data pipelines. Whatever the use case may be, the use of Apache Beam as its SDK enables DataFlow to make use of the robust community and simplify your data architectures and deliver insights with ML.
Build a TensorRT engine for inference
To use TensorRT with Apache Beam, at this stage, you need a converted TensorRT engine file from a trained model. Here’s how to convert a TensorFlow Object Detection SSD MobileNet v2 320×320 model to ONNX, build a TensorRT engine from ONNX, and run the engine locally.
Convert the TF model to ONNX
To convert TensorFlow Object Detection SSD MobileNet v2 320×320 to ONNX, use one of the TensorRT example converters. This can be done on an on-premises system if the system has the same GPU that will be used in Dataflow for inference.
To prepare your environment, follow the instructions under Setup. This post follows this guide up to and including the Create ONNX Graph. Use –batch_size 1 as the example that we are covering further works with batch size 1 only. You can name the final –onnx file ssd_mobilenet_v2_320x320_coco17_tpu-8.onnx. Building and running is handled in GCP.
Make sure that you set up a GCP project with proper credentials and API access to Dataflow, Google Cloud Storage (GCS), and Google Compute Engine (GCE). For more information, see Create a Dataflow pipeline using Python.
Spin up a GCE VM
You need a machine that contains the following installed resources:
NVIDIA T4 Tensor Core GPU
GPU driver
Docker
NVIDIA container toolkit
You can do this by creating a new GCE VM. Follow the instructions but use the following settings:
Name:tensorrt-demo
GPU type: NVIDIA T4
Number of GPUs: 1
Machine type:n1-standard-2
You may need a more powerful machine if you know that you are working with models that are large.
In the Boot disk section, choose CHANGE, and go to the PUBLIC IMAGES tab. For Operating system, choose Deep Learning on Linux. There are many versions, but make sure you choose one with CUDA. The version Debian 10 based Deep Learning VM with M98 works for this example.
The other settings can be left to their default values.
If you did this locally, follow the next steps. Otherwise, you can skip to the next section.
The following commands are only necessary if you are creating the image in a different machine than the one in which you intend to build the TensorRT engine. For this post, use Google Container Registry. Tag your image to a URI that you use for your project and then push to the registry. Make sure to replace GCP_PROJECT and MY_DIR with the appropriate values.
docker tag tensor_rt us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt
docker push us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt
Creating the TensorRT engine
The following commands are only necessary if you created the image in a different machine than the one in which you intend to build the TensorRT engine. Pull the TensorRT image from the registry:
docker pull us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt
docker tag us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt tensor_rt
If the ONNX model is not in the GCE VM, you can copy it from your local machine to the /modelsdirectory:
You should now see the ssd_mobilenet_v2_320x320_coco17_tpu-8.trt file in your /tensorrt_engines directory in the VM.
Upload the TensorRT Engine to GCS
Copy the file to GCP. If you run into issues with gsutilin uploading the file directly from GCE to GCS, you may have to first copy it to your local machine.
Make sure that you have a Beam pipeline that uses TensorRT RunInference. One example is tensorrt_object_detection.py, which you can follow by running the following commands in your GCE VM. Exit the Docker container first by typing Ctrl+D.
You also create a file called image_file_names.txt, whichcontains paths to the images. The images can be in an object store like GCS, or in the GCE VM.
docker run --rm -it --gpus all -v /home/{username}/:/mnt -w /mnt/beam/sdks/python tensor_rt python -m apache_beam.examples.inference.tensorrt_object_detection --input gs://{GCS_BUCKET}/tensorrt_image_file_names.txt --output /mnt/tensorrt_predictions.csv --engine_path gs://{GCS_BUCKET}/ssd_mobilenet_v2_320x320_coco17_tpu-8.trt
You should now see a file called tensorrt_predictions.csv. Each line has data separated by a semicolon.
The first item is the file name.
The second item is a list of dictionaries, where each dictionary corresponds with a single detection.
A detection contains box coordinates (ymin, xmin, ymax, xmax), score, and class.
For more information about how to set up and run TensorRT RunInference locally, follow the instructions in the Object Detection section.
The TensorRT Support Guide provides an overview of all the supported NVIDIA TensorRT 8.5.1 samples on GitHub and in the product package. These samples are designed to show how to use TensorRT in numerous use cases while highlighting different capabilities of the interface. These samples specifically help in use cases such as recommenders, machine comprehension, character recognition, image classification, and object detection.
Running TensorRT Engine with DataFlow RunInference
Now that you have the TensorRT engine, you can run a pipeline on Dataflow.
The following code example is a part of the pipeline, where you use TensorRTEngineHandlerNumPy to load the TensorRT engine and set other inference parameters. You then read the images, do preprocessing to attach keys to the images, do the prediction, and then write to a file in GCS.
Depending on the size constraints of the model, you may want to adjust machine_type, the type and count of the GPU, or disk_size_gb. For more information about Beam pipeline options, see Set Dataflow pipeline options.
TensorRT and TensorFlow object detection benchmarking
To benchmark, we decided to do a comparison between the TensorRT and TensorFlow object detection versions of the previously mentioned SSD MobileNet v2 320×320 model.
Every single inference call was timed in both the TensorRT and TensorFlow object detection versions. We calculated an average of 5000 inference calls, not taking the first 10 images into account due to ramp-up latencies. The SSD model that we used is a small model. You’ll observe even better speedup when your model can make full use of the GPU.
First, we compared the direct performance speedup between TensorFlow and TensorRT with a local benchmark. We aimed to prove the added benefits with reduced precision mode on TensorRT.
Framework and precision
Inference latency (ms)
TensorFlow Object Detection FP32 (end-to-end)
29.47 ms
TensorRT FP32 (end-to-end)
3.72 ms
TensorRT FP32 (GPU compute)
2.39 ms
TensorRT FP16 (GPU compute)
1.48 ms
TensorRT INT8 (GPU compute)
1.34 ms
Table 1. Direct performance speedup on TensorRT
The overall speedup with TensorRT FP32 is 7.9x. End-to-end included data copies, while the GPU compute only included actual inference time. We did this separation because the example model is small. End-to-end TensorRT latency in this case is mostly data copies. You see more significant end-to-end performance improvements using different precisions in bigger models, especially in cases where inference compute is the bottleneck, not data copies.
FP16 is 1.6x faster than FP32 and has no accuracy penalty. INT8 is 1.8x faster than FP32, but sometimes comes with accuracy degradation and requires a calibration process. Accuracy degradation is model-specific, so it’s always good to try yours and see the produced accuracy.
In Dataflow, with the TensorRT engine generated in earlier experiments, we ran with the following configurations: n1-standard-4 machine, disk_size_gb=75, and 10 workers.
To simulate a stream of data coming into Dataflow through PubSub, we set batch sizes to 1. This was done by setting ModelHandlers to have min and max batch sizes of 1.
Stage with RunInference
Mean inference_batch_latency_micro_secs
TensorFlow with T4 GPU
12 min 43 sec
99,242
TensorRT with T4 GPU
7 min 20 sec
10,836
Table 2. Dataflow benchmarks
The Dataflow runner decomposes a pipeline into multiple stages. You can get a better picture of the performance of RunInference by looking at the stage that contains the inference call, and not the other stages that read and write data. This is in the Stage with RunInferencecolumn.
For this metric, TensorRT only spends 57% of the runtime of TensorFlow. You expect the acceleration to grow if you adapt a larger model that fully uses GPU processing power.
The metric inference_batch_latency_micro_secs is the time, in microseconds, that it takes to perform the inference on the batch of examples, that is, the time to call model_handler.run_inference. This varies over time depending on the dynamic batching decision of BatchElements, and the particular values or dtype values of the elements. For this metric, you can see that TensorRT is about 9.2x faster than TensorFlow.
Conclusion
In this post, we demonstrated how to run machine learning models at scale by seamlessly stitching together a data processing framework (Apache Beam) and inference engine (TensorRT). We presented an end-to-end example of how inference workload can be fully integrated within a data processing pipeline.
This integration enables a new inference pipeline that helps reduce production inference cost with better NVIDIA GPU utilization and much-improved inference latency and throughput. The same approach can be applied to many other inference workloads using many off-shelf TensorRT samples. In the future, we plan to further automate TensorRT engine building and work on deeper integration of TensorRT with Apache Beam.
A pretrained AI model is a deep learning model that’s trained on large datasets to accomplish a specific task, and it can be used as is or customized to suit…
A pretrained AI model is a deep learning model that’s trained on large datasets to accomplish a specific task, and it can be used as is or customized to suit application requirements across multiple industries.
If AI had a highlight reel, the NVIDIA YouTube channel might just be it. The channel showcases the latest breakthroughs in artificial intelligence, with demos, keynotes and other videos that help viewers see and believe the astonishing ways in which the technology is changing the world. NVIDIA’s most popular videos of 2022 put spotlights on Read article >
To make transportation safer, autonomous vehicles (AVs) must have processes and underlying systems that meet the highest standards. NVIDIA DRIVE OS is the operating system for in-vehicle accelerated computing powered by the NVIDIA DRIVE platform. DRIVE OS 5.2 is now functional safety-certified by TÜV SÜD, one of the most experienced and rigorous assessment bodies in Read article >
A national initiative in semiconductors provides a once-in-a-generation opportunity to energize manufacturing in the U.S. The CHIPS and Science Act includes an $13 billion R&D investment in the chip industry. Done right, it’s a recipe for bringing advanced manufacturing techniques to every industry and cultivating a highly skilled workforce. The semiconductor industry uses the most Read article >
The great thing about the GPU is that it offers tremendous parallelism; it allows you to perform many tasks at the same time. At its most granular level, this comes down to the fact that there are thousands of tiny processing cores that run the same instruction at the same time. But that is not where such parallelism stops. There are other ways that you can leverage parallelism that are often overlooked, particularly when it comes to AI.
When you consider the performance of an AI feature, what exactly do you mean? Are you just considering the time the model itself takes to run or are you considering the time it takes to load the data, preprocess the data, transfer the data, and write back to disk or display?
This question is perhaps best answered by the user who will experience the feature in question. It can often transpire that the actual model execution time is only a small part of that overall experience.
This post is the first in a series that walks you through several use cases that are specific to APIs, including:
ONNX Runtime and Microsoft WinML
NVIDIA TensorRT
NVIDIA cuDNN
Microsoft DirectML
AI on workstation is a relatively new phenomenon. It’s traditionally been the stuff of servers and the cloud, but that is changing, particularly in the content creation space. As such, there are many existing code bases now being complemented with new AI features.
One of the first questions to ask when implementing an AI feature is, how do you run inference? What are the constraints? What platforms do you need to support?
Depending on the constraints that you identify, you may choose a DirectML and WinML–based approach or a CUDA and TensorRT–based approach. Whatever approach you choose, you should also consider how to integrate your feature into an existing workflow or pipeline.
Consider a relatively common workflow for generative AI in the content creation space: a denoise feature. To run this denoiser, the following steps must happen:
Load the model into GPU memory.
Make input data available to the model.
Pass the input data through the model.
Do something with the output data.
There are a lot of ambiguities in this list, so I want to discuss each step.
Load the model into GPU memory
When and how do you do this?
Models come in all sorts of shapes and sizes, from just a few kilobytes to many gigabytes. If your model executes as a part of a long-running pipeline, you may not be able to keep a large model in memory persistently.
Ideally, you would keep the model loading as far from the performance path as possible, but there may be times that this is intractable. You may have to load and unload models as a pipeline runs.
The best-case scenario is to load a model one time and use it as many times as possible. In cases where this can’t be done, most frameworks enable a serialized model to be unloaded and streamed back to the GPU relatively quickly.
Make input data available to the model
This step is where things can get interesting. Usually, this is where there is a lot of low-hanging fruit to improve your performance.
Ultimately, the model expects to consume input data in a specific format. This almost always means a particular scaling and offset, format conversion (for example, UINT8 to FP16), and possibly some layout transformation as well. On NVIDIA hardware, Tensor Cores prefer the NHWC layout.
Often, there is other preprocessing that must be done. Perhaps there is a conversion from or to frequency space or a decode from some compressed format.
This is all work that the GPU can do effectively so it’s important that you allow the GPU to do it. It can be tempting to either allow the CPU to do this work or offload the work to third-party libraries. The latter is a perfectly sensible way to do this. In either case, you must ensure that you minimize the transfers to and from the GPU and speed up the operations themselves. If you are using third-party GPU solutions for pre- and postprocessing, can you ensure that the data remains on the GPU for as long as possible?
In many cases, there may be solutions to preprocessing and format conversion that can be performed by the model itself using native operators. Conversion to FP16, scaling, and offsetting can be performed in most cases by adding those operators to the beginning and end of the model.
However you do your preprocessing, at some point, you will of course have to transfer your input data to the GPU so that the model can consume it. This raises another important consideration.
When your input data is large, you have to perform inference in tiles, if you can. This means that you load a batch of one or more tiles and run inference before loading the next batch.
Loading data and running inference can be done in parallel. You can pipeline this work so that by the time batch N has finished inferencing, batch N+1 has finished loading and is ready to be run.
If you are using NVIDIA CUDA or NVIDIA TensorRT, use CUDA streams to facilitate this.
If you are using a DML-based inference solution, use DirectX queues in parallel to keep things moving.
Tiling operations such as this are highly parallelizable and a good candidate for performing on the GPU itself. In cases where it is intractable to deal with an entire image in GPU memory, you can split the image up into sections that can be tiled while the next section is streamed onto the GPU.
Pass the input data through the model
To get the best possible performance out of the model by the time that you run the inference itself, make sure that all the following statements are true:
The input data is provided in the fastest device local memory
You are making use of the features that NVIDIA hardware provides, such as Tensor Cores.
The GPU is fully saturated, by which I mean that the GPU is given enough work to keep it busy.
Using the right memory
There are several physical heaps that most GPUs can access. Generally speaking, the programmable heaps are usually one of the following:
Host-visible
Lives in system memory and is read over the PCI bus on a PCI system
You can write to this memory but may not be the fastest for GPU access
Device-local
Lives in device (GPU) memory
Fast memory but you can’t write to this directly
The general workflow to get the fastest memory access is to write your data to host-visible memory. Then, issue a GPU command to copy the data from host-visible to device-local memory
If you are using a CUDA-based platform such as TensorRT or cuDNN, then this is relatively easy to manage as the driver does this for you. However, one thing you can do on the host to speed things up is to use pinned memory on the host. That is, when allocating host memory, use hostAlloc rather than malloc. This enables the GPU DMA to directly dispatch a memory transfer without having to involve a separate CPU transfer into the DMA memory pool, resulting in lower latencies.
If you are using a DirectML-based approach, then you must manage this transfer to fast memory yourself. It is worth the effort, as it gives you full control over exactly when your data is transferred, as well as the opportunity to perform your transfers in parallel with other work.
Saturating the GPU
One commonly overlooked bottleneck when doing any GPU-related work is not giving the GPU enough work to do. When this happens, you may find that there is not enough work to keep all the streaming multiprocessors (SMs) on the GPU busy.
In such cases, strategies such as increasing the spatial dimensions or batch size can help significantly. You may find that a batch size of eight runs at the same speed as a batch size of one.
Just as models can vary in size and complexity, so do GPUs. What is an optimum batch size for one GPU may not be optimal for another. Profiling using NVIDIA NSight Systems can help you identify cases where utilization is low on a given system and help you to design your inferencing strategy accordingly.
Other strategies to keep the GPU busy is to do other compute or even AI work in parallel using multiple CUDA streams or DirectX Command queues.
Every case is unique but both CUDA and DirectML and DirectX provide you with the means to keep the GPU as busy as possible for a given problem.
Do something with the output data
When inference is complete and you have your output, you can apply similar principles as you did for the input data. That is, you can post-process the data in a similar way to the input data, either by adding nodes to your graph or by employing a custom compute step.
If your data must be read back to host memory, this can also be done in parallel with the next inference batch. If your data must go directly to display, then you should avoid any unnecessary round trip to the CPU by making use of the appropriate interop capabilities of the platforms involved (for example, CUDA to OpenGL).
Conclusion
Remember that every case is different and what works well for one particular use case may not work for another.
This post is the second in a series about optimizing end-to-end AI for workstations. For more information, see part 1, End-to-End AI for Workstation: An…
In this post, I discuss how to use ONNX to transition your AI models from research to production while avoiding common mistakes. Considering that PyTorch has become the most popular machine learning framework, all my examples use it but I also supply references to TensorFlow tutorials.
Interoperability with ONNX
ONNX (Open Neural Network Exchange) is an open standard for describing deep learning models designed to facilitate framework compatibility.
Consider the following scenario: you can train a neural network in PyTorch, then run it through the TensorRT optimizing compiler before deploying it to production. This is just one of many interoperable deep-learning tool combinations, which include visualizations, performance profilers, and optimizers.
Researchers and DevOps no longer have to make do with a single toolchain that is unoptimized for modeling and deployment performance.
To do this, ONNX defines a standard set of operators as well as a standard file format based on the Protocol Buffers serialization format. The model is described as a directed graph with edges indicating data flow between the various node inputs and outputs, and nodes expressing an operator and its parameters.
Exporting a model
I defined a simple model consisting of two Convolution-BatchNorm-ReLu blocks for the following cases.
You can use the PyTorch built-in exporter to export this model to ONNX by creating a model instance and calling torch.onnx.export. You must also supply a dummy input with the appropriate input dimensions and data type, as well as symbolic names for the given inputs and outputs.
In the code example, I defined that index 0 for both inputs and outputs is dynamic to run the model with varying batch sizes at runtime.
Internally, PyTorch calls torch.jit.trace, which executes the models using the given arguments and records all operations during that execution as a directed graph.
Tracing unrolls loops and if statements, producing a static graph identical to the traced run. There is no data-dependent control flow being captured. This export type is adequate for many use cases but keep these limitations in mind.
If dynamic behavior is required, you can use scripting. As a result, the model must be exported to a ScriptModule object before being converted to ONNX, as shown in the following example.
Converting a model to a ScriptModule object is not always trivial and usually necessitates some code changes. For more information, see Avoiding Pitfalls and TorchScript.
Because there are no data dependencies in the forward call, you can convert the model to a scriptable model without making any more changes in the code.
When the model has been exported, you can visualize it using Netron. The default view provides a graph of models and a properties panel (Figure 2). If you select the input or output, the properties panel displays generic information, such as name, OpSet, and dimensions.
Similarly, selecting a node in the graph reveals the node’s properties. This is an excellent approach to check whether your model was exported correctly and also to debug and analyze problems later on.
Custom operator
Right now, ONNX currently defines about 150 operations. They range in complexity from arithmetic addition to a complete long short-term memory (LSTM) implementation. Although this list grows with each new release, you may encounter times when an operator from your research model is not included.
In such a scenario, you can define torch.autograd.Function, which includes the custom functionality in the forward function and a symbolic definition in symbolic. In this case, the forward function implements a no-operation by returning its input.
This example demonstrates how to define a symbolic node for exporting your model to ONNX. Although the functionality of symbolic nodes is offered in the forward function, it must be implemented and provided to the runtime used to infer the ONNX model. This is specific to the execution provider and is addressed later in this post.
Modifying ONNX models
You may want to make changes to your ONNX model without having to export it again. Changes can range from changing names to eliminating entire nodes. Modifying the model directly is difficult because all the information was encoded as protocol buffers. Fortunately, you can simply alter your models using GraphSurgeon.
The following code example shows how to remove the fake FooOp node from the exported model. There are numerous other ways you can use GraphSurgeon to modify and debug the model that I can’t cover here. For more information, see the GitHub repo.
import onnx_graphsurgeon as gs
import onnx
graph = gs.import_onnx(onnx.load("model_foo.onnx"))
fake_node = [node for node in graph.nodes if node.op == "FooOp"][0]
# Get the input node of the fake node
# For example, node.i() is equivalent to node.inputs[0].inputs[0]
inp_node = fake_node.i()
# Reconnect the input node to the output tensors of the fake node, so that the first identity
# node in the example graph now skips over the fake node.
inp_node.outputs = fake_node.outputs
fake_node.outputs.clear()
# Remove the fake node from the graph completely
graph.cleanup()
onnx.save(gs.export_onnx(graph), "removed.onnx")
To remove a node, you must first load the model with the GraphSurgeon API. Next, iterate through the graph, looking for the node to replace and matching it with the FooOp node type. Replace the output tensors of its input node with its own outputs and then remove its own connection to its outputs, removing the node.
Figure 4 shows the resulting graph.
Summary
This post walked through running a model with ONNX runtime, model optimizations, and architecture considerations. If you have any further questions about these topics, reach out on Developer Forums or join NVIDIA Developer Discord.