Categories
Misc

Deploying Diverse AI Model Categories from Public Model Zoo Using NVIDIA Triton Inference Server

Nowadays, a huge number of implementations of state-of-the-art (SOTA) models and modeling solutions are present for different frameworks like TensorFlow, ONNX,…

Nowadays, a huge number of implementations of state-of-the-art (SOTA) models and modeling solutions are present for different frameworks like TensorFlow, ONNX, PyTorch, Keras, MXNet, and so on. These models can be used for out-of-the-box inference if you are interested in categories already in the datasets, or they can be embedded to custom business scenarios with minor fine-tuning.

This post gives you an overview of prevalent DL model categories and walks you through the end-to-end examples of deploying these models using NVIDIA Triton Inference Server. The client applications can be used as it is or can be modified according to the use case scenarios. I walk you through the deployment of image classification, object detection, and image segmentation public models using Triton Inference Server. The steps outlined in this post can also be applied to other open-source models with minor changes.

Deep learning inference challenges

Recent years have seen remarkable advancements in deep learning (DL). By resolving numerous complex and intricate problems that have hampered the AI community for years, it has completely revolutionized the future of AI. It is currently being used with rapidly growing applications in different industries, ranging from healthcare and aerospace engineering to autonomous driving and user authentications. 

Deep learning, however, has various challenges when it comes to inference:

  • Support of multiple frameworks
  • Ease of use
  • Cost of deployment

Support of multiple frameworks

The first key challenge is around supporting multiple different types of model frameworks. 

Developers and data scientists today are using various frameworks for their production models. For instance, there can be difficulties modifying the system for testing and deployment if a machine learning project is written in Keras, but a team member has more experience with TensorFlow. 

Also, converting the models can be expensive and complicated, especially if new data is required for their training. They must have a server application to support each of those models.

Ease of use

The next key challenge is to have a serving application that can support different inference queries and use cases. 

In some applications, you’re focused on real-time online inferencing where the priority is to minimize latency as much as possible. On the other hand, there might be use cases that require you to do offline batch inferencing where you’re focused on maximizing throughput. 

It’s essential to have solutions that can support each type of query and use case and optimize for them.

Cost of deployment

The next challenge is managing the cost of deployment and lowering the cost of inference. 

A key part of this is having one serving application that can support running on a mixed infrastructure. You might create a separate serving solution for running on CPU, another one for GPU, and a different one for deploying on the cloud in the data center and edge. That’s going to skyrocket costs and lead to a nonscalable implementation.

Triton Inference Server

Triton Inference Server is an open-source server inference application allowing inference on both CPU and GPU in different environments. It supports various backends, including TensorRT, PyTorch, TensorFlow, ONNX, and Python. To have maximum hardware utilization, NVIDIA Triton allows concurrent execution of different models. Further dynamic batching allows grouping together inference queries to maximize the throughput for different types of queries. For more information, see NVIDIA Triton Inference Server.

The figure illustrates the Triton architecture. Triton allows High-performance inference and supports multiple frameworks enabling the teams to deploy the models on any GPU- or CPU-based infrastructure.
Figure 2. Triton Inference Server architecture

Quickstart with NVIDIA Triton

The easiest way to install and run NVIDIA Triton is to use the pre-built Docker image available from NGC.

Server: Pull the Docker image

Pull the image using the following command:

$ docker pull nvcr.io/nvidia/tritonserver:-py3

NVIDIA Triton is optimized to provide the best inferencing performance by using GPUs, but it can also work on CPU-only systems. In both cases, you can use the same Docker image.

Use the following command to run NVIDIA Triton with the example model repository that you just created:

docker run --gpus=1 --rm --net=host -v /path/to/the/repo/server/models:/models nvcr.io/nvidia/tritonserver:-py3 tritonserver --model-repository=/models --exit-on-error=false --repository-poll-secs=10 --model-control-mode="poll"

Client: Get the client libraries

Use docker pull to get the client libraries.

$ docker pull nvcr.io/nvidia/tritonserver:-py3-sdk

In this command, is the version to pull. Run the client image.

To start the client, run the following command:

$ docker run -it --rm --net=host /path/to/the/repo/client/:/python_examples nvcr.io/nvidia/tritonserver:-py3-sdk

End-to-end model deployment

The NVIDIA Triton project provides several client libraries in C++ and Python to simplify communication. These APIs make communicating with NVIDIA Triton easy. With the help of these APIs, the client applications process the input and communicate with NVIDIA Triton to perform inferencing.

The figure shows a generic workflow of a client application interaction with the Triton Inference Server. Inputs are read and preprocessed, serialized into a message, and sent to the Triton Server. The inference is performed on the Triton Inference Server, and the inferred data is sent back to the client, followed by post-processing. Depending upon the application, the output can be stored, displayed, or passed to the network.
Figure 3. Workflow of client application interaction with Triton Inference Server

In general, the interaction of client applications with NVIDIA Triton can be summarized as follows:

  • Input
  • Preprocess
  • Inference
  • Postprocess
  • Output

Input: Depending upon the application type, one or more inputs are read to be inferred by the neural network.

Preprocess: Preprocessing data is a common first step in the deep learning workflow to prepare raw data in a format the network can accept, For example, image resizing, normalization, or noise removal from input data.

Inference: For the inference part, a client initially serializes the inference request into a message and sends it to Triton Inference Server. The message travels over the network from the client to the server and gets deserialized. The request is placed on the queue. The request is removed from the queue and computed. The completed request is serialized in a message and sent back to the client. The message travels over the network from the server to the client. The message arrives at the client and is deserialized.

Postprocess: When the message arrives at the client application, it is processed as a completed inference request. Depending upon the network type and application use case, post-processing is applied. For example, in object detection, postprocessing involves suppressing the superfluous boxes, aiding in selecting the best possible boxes, and mapping them back to the input image.

Output: After inference and processing, depending upon the application, the output can be stored, displayed, or passed to the network.

Image classification

Image classification is the task of comprehending an entire image and specifying a specific label for the image. Typically in image classification, a single object is present in the image, which is analyzed and comprehended. For more information, see image classification

Server: Download the model

Download the ResNet-18 image classification model from the ONNX model zoo:

$ cd /path/to/the/repo/server/models/classification/1
$ wget https://github.com/onnx/models/raw/main/vision/classification/resnet/model/resnet18-v1-7.onnx && mv resnet18-v1-7.onnx model.onnx

The following code example shows the model configuration file:

name: "classification"
platform: "onnxruntime_onnx"
max_batch_size : 1
input [
  {
    name: "data"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
    reshape { shape: [ 3, 224, 224 ] }
  }
]
output [
  {
    name: "resnetv15_dense0_fwd"
    data_type: TYPE_FP32
    dims: [  1000 ]
    reshape { shape: [1000] }
    label_filename: "labels.txt"
  }
]

Name, platform, and backend

The name property is optional. If the name of the model is not specified in the configuration, it is assumed to be the same as the model repository directory containing the model. The model is executed by the NVIDIA Triton backend, which is simply a wrapper around the DL frameworks like TensorFlow, PyTorch, TensorRT, and so on. For more information, see backend.

Maximum batch size

The maximum batch size that a model can support is indicated by the max_batch_size property. Zero size shows that bathing is not supported. For more information, see batch size.

Inputs and outputs

For each model, the expected input, output, and data types must be specified in the model configuration file. Based on the input and output tensors, different data types are allowed. For more information, see Datatypes

The image classification model accepts a single input, and after the inference returns a single output.

In a separate console, launch the image_client example from the NGC NVIDIA Triton container.

Client: Run the image classification client

To run the image classification client, use the following command:

$ python3 /python_examples/examples/classification/classification.py -m classification -s INCEPTION /python_examples/examples/images/tabby.jpg

First, inputs are preprocessed according to the model. For this model, inception scaling is applied, which scales the input as follows:

if scaling == 'INCEPTION':
    scaled = (typed / 127.5) - 1

The inference request is sent to NVIDIA Triton, and the responses are appended:

responses.append(
    triton_client.infer(FLAGS.model_name,
                        inputs,
                        request_id=str(sent_count),
                        model_version=FLAGS.model_version,
                        outputs=outputs))

Finally, the responses obtained from the server are post-processed.

postprocess(response, output_name, FLAGS.batch_size, supports_batching)

For the classification case, the model returns a single classification output that comprehends the input image. The class is decoded and printed in the console.

for results in output_array:
    if not supports_batching:
        results = [results]
    for result in results:
        if output_array.dtype.type == np.object_:
            cls = "".join(chr(x) for x in result).split(':')
        else:
            cls = result.split(':')
        print("    {} ({}) = {}".format(cls[0], cls[1], cls[2]))

For more information, see classification.py.

Figure 4 shows the sample output.

The figure shows the sample output of an image classification model inferred over the Triton Inference Server. A Tabby image is given as an input to the network; the network correctly classifies the image.
Figure 4. Classification label assigned to the image by the classification network

Object detection

The process of finding instances of objects of a particular class within an image is known as object detection. The problem of object detection combines classification with localization. It also examines more plausible scenarios in which an image might contain several objects. For more information, see object detection.

Server: Download the model

Download the faster_rcnn_inception_v2_coco object detection model:

$ cd /path/to/the/repo/server/models/detection/1
$ wget http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_v2_coco_2018_01_28.tar.gz && tar xvf faster_rcnn_inception_v2_coco_2018_01_28.tar.gz && cp faster_rcnn_inception_v2_coco_2018_01_28/frozen_inference_graph.pb ./model.graphdef &&  rm -r faster_rcnn_inception_v2_coco_2018_01_28 faster_rcnn_inception_v2_coco_2018_01_28.tar.gz

The following code example shows the model configuration file for the object detection model:

name: "detection"
platform: "tensorflow_graphdef"
max_batch_size: 1
input [
  {
    name: "image_tensor"
    data_type: TYPE_UINT8
    format: FORMAT_NHWC
    dims: [ 600, 1024, 3 ]
  }
]
output [
  {
    name: "detection_boxes"
    data_type: TYPE_FP32
    dims: [ 100, 4]
    reshape { shape: [100,4] }
  },
  {
    name: "detection_classes"
    data_type: TYPE_FP32
    dims: [ 100 ]
    reshape { shape: [ 1, 100 ] }  
  },
  {
    name: "detection_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]

  },
  {
    name: "num_detections"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape { shape: [] }
  }
]

The detection model accepts a single image as an input and returns four different outputs.

Client: Run the object detection client 

To run the object detection client, use the following command:

$ python3 /python_examples/examples/detection/detection.py -m detection /python_examples/examples/images/car.jpg

The object detection model returns four different outputs, which are decoded in the post-processing step:

detection_boxes = results.as_numpy(output_name[0].name)
detection_classes = results.as_numpy(output_name[1].name)
detection_scores = results.as_numpy(output_name[2].name)
num_detections = results.as_numpy(output_name[3].name)

At the end, the bounding boxes are drawn on the input as follows:

for idx, detection_box in enumerate(detection_boxes[0,0:int(num_detections),:]):
    y_min=int(detection_box[0]*w)
    x_min=int(detection_box[1]*h)
    y_max=int(detection_box[2]*w)
    x_max=int(detection_box[3]*h)
    start_point = (x_min,y_min)
    end_point = (x_max,y_max)
    shape = (start_point, end_point)   
    draw.rectangle(shape, outline ="red")
    draw.text((int((x_min+x_max)/2),y_min), "class-"+str(int(detection_classes[0,idx])), fill=(0,0,0))

For more information, see detection.py.

Figure 5 shows the sample output.

The figure shows the sample output of an object detection model inferred over the Triton Inference Server. A car image is given as an input to the network; the network correctly localizes and classifies the car.
Figure 5. Using object detection to identify and locate vehicles (source: MathWorks.com)

Image segmentation

The process of clustering parts of an image that correspond to the same object class is known as image segmentation. Image segmentation entails splitting images or video frames into multiple objects or segments. For more information, see image segmentation.

Server: Download the model

To download the model, use the following commands:

$ cd /path/to/the/repo/server/models/segmentation/1
$ wget https://github.com/onnx/models/raw/main/vision/object_detection_segmentation/fcn/model/fcn-resnet50-11.onnx &&  mv fcn-resnet50-11.onnx model.onnx

The following code example shows the model configuration file for the image segmentation model:

name: "segmentation"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [  3, -1, -1 ]
    reshape { shape: [ 1, 3, -1, -1 ] }
  }
]
output [
    {
    name: "out"
    data_type: TYPE_FP32
    dims: [  -1, 21, -1, -1 ]
  }
]

Client: Run the image classification client

To run the image classification client, run the following commands:

$ pip install opencv-python
$ python3 /python_examples/examples/segmentation/segmentation.py -m segmentation -s INCEPTION /python_examples/examples/images/people.jpg

The segmentation model accepts a single input and returns a single output. After inferencing, the model returns the output based on which segmented and blended images are generated.

# generate segmented image
result_img = colorize(raw_labels)
# generate blended image
blended_img = cv2.addWeighted(image[:, :, ::-1], 0.5, result_img, 0.5, 0)

For more information, see the segmentation.py file.

Figure 6 shows the sample output.

The figure shows the sample output of an instance segmentation model inferred over the Triton Inference Server. The network correctly detects and segments the objects in the image.
Figure 6. Annotated image for semantic image segmentation. Source: Visio.ai

Resources

Try Triton Inference Server today on GPU, CPU, or both. The NVIDIA Triton Inference Server container can be downloaded from NGC, and its source code is available on the /triton-inference-server GitHub repo.

Categories
Misc

Explainer: What Is a Smart City?

Examples of what a smart city is can be found in metro IoT deployments from Singapore to Seat Pleasant, Maryland.

Examples of what a smart city is can be found in metro IoT deployments from Singapore to Seat Pleasant, Maryland.

Categories
Misc

New Course: GPU Acceleration with the C++ Standard Library

Learn how to write simple, portable, parallel-first GPU-accelerated applications using only C++ standard language features in this self-paced course from the…

Learn how to write simple, portable, parallel-first GPU-accelerated applications using only C++ standard language features in this self-paced course from the NVIDIA Deep Learning Institute

Categories
Misc

Top 5 Edge AI Trends to Watch in 2023

With the state of the world under constant flux in 2022, some technology trends were put on hold while others were accelerated. Supply chain challenges, labor shortages and economic uncertainty had companies reevaluating their budgets for new technology. For many organizations, AI is viewed as the solution to a lot of the uncertainty bringing improved Read article >

The post Top 5 Edge AI Trends to Watch in 2023 appeared first on NVIDIA Blog.

Categories
Misc

Speech AI Technology Enables Natural Interactions with Service Robots

From taking your order and serving you food in a restaurant to playing poker with you, service robots are becoming increasingly prevalent. Globally, you can…

From taking your order and serving you food in a restaurant to playing poker with you, service robots are becoming increasingly prevalent. Globally, you can find these service robots at hospitals, airports, and retail stores.

According to Gartner, by 2030, 80% of humans will engage with smart robots daily, due to smart robot advancements in intelligence, social interactions, and human augmentation capabilities, up from less than 10% today.

An accurate speech AI or voice AI interface that can quickly understand humans and mimic human speech is critical to a service robot’s ease of use. Developers are integrating automatic speech recognition (ASR) and text-to-speech (TTS) with service robots to enable essential skills, such as understanding and responding to human questions in natural language. These voice-based technologies make up speech AI.

This post explains how ASR and TTS can be used in service robot applications. I provide a walkthrough on how to customize them using speech AI software tools for industry-specific jargon, languages, and dialects, depending on where the robot is deployed.

Why add speech AI to service robot applications?

Service robots are like digital humans in the metaverse except that they operate in the physical world. These service robots can help support warehouse workers, perform dangerous tasks while following human instructions, or even assist in activities that require contactless services. For instance, a service robot in the hospitality industry can greet guests, carry bags, and take orders.

For all these service robots to understand and respond in a human-like way, developers must incorporate highly accurate speech AI that runs in real time.

Examples of speech AI-enabled service robot applications

Today, service robots are used in a wide range of industries.

Restaurants

Online food delivery services are growing in popularity worldwide. To handle the increased customer demand without compromising quality, service robots can assist staff with tasks such as order taking or delivering food to in-person customers.

Hospitals

In hospitals, service robots can support and empower patient care teams by handling patient-related tasks. For example, a speech AI-enabled service robot can empathetically converse with patients to provide company or help improve their mental health state.

Ambient assisted living

In ambient assisted living environments, technology is primarily used to support the independence and safety of elderly or vulnerable adults. Service robots can assist with daily activities, such as transporting food trays from one location to another or using a smart robotic pill dispenser to manage medications in a timely manner. With speech AI skills, service robots can also provide emotional support.

Service robot reference architecture

Service robots help businesses improve quality assurance and boost productivity in several ways:

  • Assisting frontline workers with daily repetitive tasks in restaurants or manufacturing environments
  • Helping customers find desired items in retail stores
  • Supporting physicians and nurses with patient healthcare services in hospitals

In these settings, it’s imperative that robots can accurately process and understand what a user is relaying. This is especially true for situations where danger or serious harm is a possibility, such as a hospital. Service robots that can naturally converse with humans also contribute to a positive overall user experience for an application.

Workflow architecture diagram showing how speech inputs map to robot tasks through a dialog manager and back out as text converted to speech.
Figure 1. Service robot design review workflow architecture

Figure 1 shows that service robots use speech recognition to comprehend what users are saying and TTS to respond to users with a synthetic voice. Other components such as NLP and a dialog manager, are used to help service robots understand context and generate appropriate answers to users’ questions.

Also, the modules under robot tasks such as perception, navigation, and mapping help the robot understand its physical surroundings and move in the right direction.

Voice user interfaces to service robots

Voice user interfaces include two main components: automatic speech recognition and text-to-speech. Automatic speech recognition, also known as speech-to-text, is the process of converting raw speech into text. Text-to-speech, also known as speech synthesis, is the process of converting text into human-like speech.

Developing speech AI pipelines has its own challenges. For example, if a service robot is deployed in restaurants, it should be able to understand words like matcha, cappuccino, and ristretto. It should even transcribe in noisy environments as most people interacting with these applications are in open spaces.

Not only do the robots have to understand what is being said, but they should also be able to say these words correctly. Similarly, each industry has its own terminology that these robots must understand and respond to in real time.

Automatic speech recognition

Diagram showing the models and modules of an end-to-end speech-to-text pipeline (all are listed in the post).
Figure 2. Speech-to-text pipeline

The roles of each model or module in the ASR pipeline are as follows:

  • The feature extractor converts raw audio into spectrograms or mel spectrograms.
  • The acoustic model takes these spectrograms and generates a matrix that has probabilities of characters or words over each time step.
  • The decoder and language model put together these characters/words into a transcript.
  • The punctuation and capitalization model applies things like commas, periods, and question marks in the right places for better readability.

Text-to-speech

Diagram showing the models and modules of an end-to-end text-to-speech pipeline (all are listed in the post).
Figure 3: Text-to-speech pipeline

The roles of each model or module in the TTS pipeline are as follows:

  • In the text normalization and preprocessing stage, the text is converted into verbalized form. For instance: “at 10:00” -> “at ten o’clock.”
  • The text encoding module converts text into an encoded vector.
  • The pitch predictor predicts how much highness or lowness you have to give certain words, while the duration predictor predicts how long it takes to pronounce a character or word.
  • The spectrogram generator uses an encoded vector and other supporting vectors as input to generate a spectrogram.
  • The vocoder model takes spectrograms as input and produces a human-like voice as output.

Speech AI software suite

NVIDIA provides a variety of datasets, tools, and SDKs to help you build end-to-end speech AI pipelines. Customize the pipelines to your industry’s specific vocabulary, language, and dialects and run in milliseconds for natural and engaging interactions.

Datasets

To democratize and diversify speech AI technology, NVIDIA collaborated with Mozilla Common Voice (MCV). MCV is a crowd-sourced project in which volunteers contribute speech data to a public dataset that anyone can use to train voice-enabled technology. You can download various language audio datasets from MCV to develop ASR and TTS models.

NVIDIA also collaborated with Defined.ai, a one-stop shop for training data. You can download audio and speech training data in multiple domains, languages, and accents for use in speech AI models.

Pretrained models

NGC provides several pretrained models trained on a variety of open and proprietary datasets. All models have been optimized and trained on NVIDIA DGX servers for hundreds of thousands of hours.

You can fine-tune these highly accurate, pretrained models on a relevant dataset to improve accuracy even further.

Open-source tools

If you’re looking for open-source tools, NVIDIA offers NeMo, an open-source framework for building and training state-of-the-art AI speech and language models. NeMo is built on top of PyTorch and PyTorch Lightning, making it easy for you to develop and integrate modules that are already familiar.

Speech AI SDK

Use NVIDIA Riva, a free GPU-accelerated speech AI SDK, to build and deploy fully customizable, real-time AI pipelines. Riva offers state-of-the-art, highly accurate, pretrained models through NGC:

  • English
  • Spanish
  • Mandarin
  • Hindi
  • Russian
  • Korean
  • German
  • French
  • Portuguese

Japanese, Arabic, and Italian are coming soon.

With NeMo you can fine-tune these pretrained models on industry-specific jargon, languages, dialects, and accents, and optimized speech AI skills to run in real time.

You can deploy Riva skills in streaming or offline in all clouds, on-premises, at the edge, and on embedded devices.

Running Riva speech AI skills on embedded for robotics applications

In this section, I show you how to run out-of-the-box ASR and TTS skills with Riva on embedded devices. For better accuracy and performance, Riva also enables you to customize or fine-tune models on domain-specific datasets.

You can run Riva speech AI skills in both streaming and offline modes. First, set up and run the Riva server on embedded.

Prerequisites

  • Access to NGC.
    • Follow all steps to be able to run ngc commands from a command-line interface (CLI).
  • Access to NVIDIA Jetson Orin, NVIDIA Jetson AGX Xavier, or NVIDIA Jetson NX Xavier.
  • NVIDIA JetPack version 5.0.2 on the Jetson platform.

For more information, see the Support Matrix.

Server setup

Download the scripts from NGC by running the following command:

ngc registry resource download-version nvidia/riva/riva_quickstart_arm64:2.7.0

Initialize the Riva server:

bash riva_init.sh

Start the Riva server:

bash riva_start.sh

For more information about the most recent steps, see the Quick Start Guide.

Running C++ ASR client

For embedded, Riva server comes with sample clients that you can seamlessly use to do inference.

Run the following command for streaming ASR:

riva_streaming_asr_client --audio_file=/opt/riva/wav/en-US_sample.wav

For more information about customizing Riva ASR models and pipelines for your industry-specific jargon, languages, dialects, and accents, see the instructions on the Model Overview in the Riva documentation.

Running C++ TTS client

For Riva TTS client on embedded, run the following command to synthesize audio files:

riva_tts_client --voice_name=English-US.Female-1 
                --text="Hello, this is a speech synthesizer."
                --audio_file=/opt/riva/wav/output.wav

For more information about customizing TTS models and pipelines on domain-specific datasets, see Model Overview in the Riva User Guide.

Resource for developing speech AI applications

Speech AI makes it possible for service robots and other interactive applications to comprehend nuanced human language and respond with ease.

It is empowering everything from real people in call centers to service robots in every industry. To understand how speech AI skills were integrated with a robotic dog that can fetch drinks in real life, see Low-code Building Blocks for Speech AI Robotics.

Or, browse speech AI posts to learn about speech AI concepts, speech recognition deployment challenges and tips, or unique ASR applications.

You can also access developer ebooks, such as End-To-End Speech AI pipelines to learn more about models and modules in speech AI pipelines and Building Speech AI Applications to gain insight on how to build and deploy real-time speech AI pipelines for your application.

Categories
Misc

Deep Learning is Transforming ASR and TTS Algorithms

Speech is one of the primary means to communicate with an AI-powered application. From virtual assistants to digital avatars, voice-based interfaces are…

Speech is one of the primary means to communicate with an AI-powered application. From virtual assistants to digital avatars, voice-based interfaces are changing how we typically interact with smart devices.

Deep learning techniques for speech recognition and speech synthesis are helping improve the user experience—think human-like responses and natural-sounding tones.

If you plan to build and deploy a speech AI-enabled application, this post provides an overview of how automatic speech recognition (ASR) and text-to-speech (TTS) technologies have evolved due to deep learning. I also mention some popular, state-of-the-art ASR and TTS architectures used in today’s modern applications.

Demystifying speech AI

Every day, hundreds of billions of audio minutes are generated, whether you are conversing with digital humans in the metaverse or actual humans in contact centers. Speech AI can assist in automating all these audio minutes.

Speech AI includes technologies like ASR, TTS, and related tasks. Interestingly, these technologies are not new and have existed for the last five decades.

Speech recognition evolution

Today, ASR algorithms developed using deep learning techniques can be customized for domain-specific jargon, languages, accents, and dialects, as well as transcribing in noisy environments.

This level of technique differs significantly from the first ASR system, Audrey, which was invented by Bell Labs in 1952. At the time, Audrey could only transcribe numbers and was not developed using deep learning techniques.

Infographic showing various automatic speech recognition milestones and inventions from 1952 to the present-day.
Figure 1. Evolution of automatic speech recognition

ASR pipeline

A standard ASR deep learning pipeline consists of a feature extractor, acoustic model, decoder and language model, and BERT punctuation and capitalization model.

Text-to-speech evolution

TTS, or speech synthesis, systems that are developed using deep learning techniques sound like real humans and can run in real time to have natural and meaningful discussions. On the other hand, traditional systems like Voder, DECtalk commercial, and concatenative TTS sound robotic and are difficult to run in real time. 

Deep learning TTS algorithms are flexible enough so that you can adjust the speed, pitch, and duration at the inference time to generate more expressive TTS voices.

TTS pipeline

A basic TTS pipeline includes the following components: text normalization, text encoding, pitch/duration predictor, spectrogram generator, and vocoder model.

You can learn more about how ASR and TTS have changed over the past few years and about each of the models and modules in ASR and TTS pipelines in the on-demand video, Speech AI Demystified.

Popular ASR and TTS architectures used today

Several state-of-the-art neural network architectures have been created. Some of the most popular ones in use today for ASR are CTC and transducer-based architecture models. For example, you can apply these architecture techniques to models such as CitriNet and Conformer.

For TTS, different types of architectures exist:

  • Autoregressive or non-autoregressive
  • Deterministic or generative
  • Explicit control or non-explicit control

Each of these TTS architectures offer varying capabilities. For example, deterministic models can predict the outcome exactly and don’t include randomness. Generative models include the data distribution itself and can capture different variations of the synthetic voice. To build an end-to-end text-to-speech pipeline, you must combine one architecture from each category.

You can get the latest architecture best practices to build an ASR and TTS pipeline for your voice-enabled application in the on-demand video, Speech AI Demystified.

NVIDIA Speech AI SDK

You can develop deep learning-based ASR and TTS algorithms by leveraging a GPU-accelerated speech AI SDK. NVIDIA Riva helps you build and deploy customizable AI pipelines that deliver world-class accuracy in all clouds, on-premises, at the edge, and on embedded devices.  

Riva has state-of-the-art pretrained models on NGC that are trained on multiple open and proprietary datasets. You can use low-coding tools to customize these models to fit your industry and use case with optimized speech AI skills that can run in real time, without sacrificing accuracy.

Build your first speech AI application

Are you looking to add an interactive voice experience to applications? The following free ebooks will guide your journey:

If you prefer step-by-step instruction, check out a self-paced online course to get started with highly accurate custom ASR for speech AI.

Categories
Misc

Simplifying and Accelerating Machine Learning Predictions in Apache Beam with NVIDIA TensorRT

Loading and preprocessing data for running machine learning models at scale often requires seamlessly stitching the data processing framework and inference…

Loading and preprocessing data for running machine learning models at scale often requires seamlessly stitching the data processing framework and inference engine together.

In this post, we walk through the integration of NVIDIA TensorRT with Apache Beam SDK and show how complex inference scenarios can be fully encapsulated within a data processing pipeline. We also demonstrate how terabytes of data can be processed from both batch and streaming sources with a few lines of code for high-throughput and low-latency model inference.

  • NVIDIA TensorRT is an SDK that facilitates high-performance machine learning inference. It is designed to work with deep learning frameworks such as TensorFlow, PyTorch, and MXNet. It focuses specifically on optimizing and running a trained neural network for inference efficiently on NVIDIA GPUs. TensorRT can maximize inference throughput with multiple optimizations while preserving model accuracy including model quantization, layer and tensor fusions, kernel auto-tuning, multi-stream executions, and efficient tensor memory usage.
  • Proven with 15+ years in production, Dataflow is a no-ops, serverless data processing platform to process data, in batch or in real time, for analytical, ML and application use cases. These often include incorporating pretrained models into data pipelines. Whatever the use case may be, the use of Apache Beam as its SDK enables DataFlow to make use of the robust community and simplify your data architectures and deliver insights with ML.

Build a TensorRT engine for inference

To use TensorRT with Apache Beam, at this stage, you need a converted TensorRT engine file from a trained model. Here’s how to convert a TensorFlow Object Detection SSD MobileNet v2 320×320 model to ONNX, build a TensorRT engine from ONNX, and run the engine locally.

Convert the TF model to ONNX

To convert TensorFlow Object Detection SSD MobileNet v2 320×320 to ONNX, use one of the TensorRT example converters. This can be done on an on-premises system if the system has the same GPU that will be used in Dataflow for inference.

To prepare your environment, follow the instructions under Setup. This post follows this guide up to and including the Create ONNX Graph. Use –batch_size 1 as the example that we are covering further works with batch size 1 only. You can name the final –onnx file  ssd_mobilenet_v2_320x320_coco17_tpu-8.onnx. Building and running is handled in GCP. 

Make sure that you set up a GCP project with proper credentials and API access to Dataflow, Google Cloud Storage (GCS), and Google Compute Engine (GCE). For more information, see Create a Dataflow pipeline using Python.

Spin up a GCE VM

You need a machine that contains the following installed resources:

  • NVIDIA T4 Tensor Core GPU
  • GPU driver
  • Docker
  • NVIDIA container toolkit

You can do this by creating a new GCE VM. Follow the instructions but use the following settings:

  • Name: tensorrt-demo
  • GPU type: NVIDIA T4
  • Number of GPUs: 1
  • Machine type: n1-standard-2

You may need a more powerful machine if you know that you are working with models that are large.

In the Boot disk section, choose CHANGE, and go to the PUBLIC IMAGES tab. For Operating system, choose Deep Learning on Linux. There are many versions, but make sure you choose one with CUDA. The version Debian 10 based Deep Learning VM with M98 works for this example.

The other settings can be left to their default values.

Next, connect to the VM using SSH. Install NVIDIA drivers if you are prompted to do so.

Inside the VM, run the following commands to create a few directories to be used later:

mkdir models
mkdir tensorrt_engines

For more information, see Create a VM with attached GPUs.

Build the image

You need a custom container that contains the necessary dependencies to execute the TensorRT code: CUDA, cuDNN, and TensorRT.

You can copy the following example Dockerfile into a new file and name it tensor_rt.dockerfile.

ARG BUILD_IMAGE=nvcr.io/nvidia/tensorrt:22.09-py3

FROM ${BUILD_IMAGE} 

ENV PATH="/usr/src/tensorrt/bin:${PATH}"

WORKDIR /workspace

RUN pip install --no-cache-dir apache-beam[gcp]==2.42.0
COPY --from=apache/beam_python3.8_sdk:2.42.0 /opt/apache/beam /opt/apache/beam

RUN pip install --upgrade pip 
    && pip install torch>=1.7.1 
    && pip install torchvision>=0.8.2 
    && pip install pillow>=8.0.0 
    && pip install transformers>=4.18.0 
    && pip install cuda-python

ENTRYPOINT [ "/opt/apache/beam/boot" ]

View the Docker file used for testing in the Apache Beam repo. Keep in mind that there may be a later version of Beam available than what was used in this post.

Build the image by running the following command, locally or in a GCE VM:

docker build -f tensor_rt.dockerfile -t tensor_rt .

If you did this locally, follow the next steps. Otherwise, you can skip to the next section.

The following commands are only necessary if you are creating the image in a different machine than the one in which you intend to build the TensorRT engine. For this post, use Google Container Registry. Tag your image to a URI that you use for your project and then push to the registry. Make sure to replace GCP_PROJECT and MY_DIR with the appropriate values.

docker tag tensor_rt us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt
docker push us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt

Creating the TensorRT engine

The following commands are only necessary if you created the image in a different machine than the one in which you intend to build the TensorRT engine. Pull the TensorRT image from the registry:

docker pull us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt
docker tag us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt tensor_rt

If the ONNX model is not in the GCE VM, you can copy it from your local machine to the /models directory:

gcloud compute scp ~/Downloads/ssd_mobilenet_v2_320x320_coco17_tpu-8.onnx tensorrt-demo:~/models --zone=us-central1-a

You should now have the ONNX model and the built Docker image in the VM. Now it’s time to use them both.

Launch the Docker container interactively:

docker run --rm -it --gpus all -v /home/{username}/:/mnt tensor_rt bash

Create the TensorRT engine out of the ONNX file:

trtexec --onnx=/mnt/models/ssd_mobilenet_v2_320x320_coco17_tpu-8.onnx --saveEngine=/mnt/tensorrt_engines/ssd_mobilenet_v2_320x320_coco17_tpu-8.trt --useCudaGraph --verbose

You should now see the ssd_mobilenet_v2_320x320_coco17_tpu-8.trt file in your /tensorrt_engines directory in the VM.

Upload the TensorRT Engine to GCS

Copy the file to GCP. If you run into issues with gsutil in uploading the file directly from GCE to GCS, you may have to first copy it to your local machine.

gcloud compute scp tensorrt-demo:~/tensorrt_engines/ssd_mobilenet_v2_320x320_coco17_tpu-8.trt ~/Downloads/ --zone=us-central1-a

In the GCP console, upload the TensorRT engine file to your chosen GCS bucket:

gs://{GCS_BUCKET}/ssd_mobilenet_v2_320x320_coco17_tpu-8.trt

Testing the TensorRT engine locally

Make sure that you have a Beam pipeline that uses TensorRT RunInference. One example is tensorrt_object_detection.py, which you can follow by running the following commands in your GCE VM. Exit the Docker container first by typing Ctrl+D.

git clone https://github.com/apache/beam.git
cd beam/sdks/python
pip install --upgrade pip setuptools
pip install -r build-requirements.txt
pip install --user -e ."[gcp,test]"

You also create a file called image_file_names.txt, which contains paths to the images. The images can be in an object store like GCS, or in the GCE VM.

gs://{GCS_BUCKET}/000000289594.jpg
gs://{GCS_BUCKET}/000000000139.jpg

Then, run the following command:

docker run --rm -it --gpus all -v /home/{username}/:/mnt -w /mnt/beam/sdks/python tensor_rt python -m apache_beam.examples.inference.tensorrt_object_detection --input gs://{GCS_BUCKET}/tensorrt_image_file_names.txt --output /mnt/tensorrt_predictions.csv --engine_path gs://{GCS_BUCKET}/ssd_mobilenet_v2_320x320_coco17_tpu-8.trt

You should now see a file called tensorrt_predictions.csv. Each line has data separated by a semicolon.

  • The first item is the file name.
  • The second item is a list of dictionaries, where each dictionary corresponds with a single detection.
  • A detection contains box coordinates (ymin, xmin, ymax, xmax), score, and class.

For more information about how to set up and run TensorRT RunInference locally, follow the instructions in the Object Detection section.

The TensorRT Support Guide provides an overview of all the supported NVIDIA TensorRT 8.5.1 samples on GitHub and in the product package. These samples are designed to show how to use TensorRT in numerous use cases while highlighting different capabilities of the interface. These samples specifically help in use cases such as recommenders, machine comprehension, character recognition, image classification, and object detection.

Running TensorRT Engine with DataFlow RunInference

Now that you have the TensorRT engine, you can run a pipeline on Dataflow.

The following code example is a part of the pipeline, where you use TensorRTEngineHandlerNumPy to load the TensorRT engine and set other inference parameters. You then read the images, do preprocessing to attach keys to the images, do the prediction, and then write to a file in GCS.

For more information about the full code example, see tensorrt_object_detection.py.

  engine_handler = KeyedModelHandler(
      TensorRTEngineHandlerNumPy(
          min_batch_size=1,
          max_batch_size=1,
          engine_path=known_args.engine_path))

  with beam.Pipeline(options=pipeline_options) as p:
    filename_value_pair = (
        p
        | 'ReadImageNames' >> beam.io.ReadFromText(known_args.input)
        | 'ReadImageData' >> beam.Map(
            lambda image_name: read_image(
                image_file_name=image_name, path_to_dir=known_args.images_dir))
        | 'AttachImageSizeToKey' >> beam.Map(attach_im_size_to_key)
        | 'PreprocessImages' >> beam.MapTuple(
            lambda file_name, data: (file_name, preprocess_image(data))))
    predictions = (
        filename_value_pair
        | 'TensorRTRunInference' >> RunInference(engine_handler)
        | 'ProcessOutput' >> beam.ParDo(PostProcessor()))

    _ = (
        predictions | "WriteOutputToGCS" >> beam.io.WriteToText(
            known_args.output,
            shard_name_template='',
            append_trailing_newlines=True))

Make sure that you have completed the Google Cloud setup mentioned in the previous section. You also must have the Beam SDK installed.

To run this job on Dataflow, run the following command locally:

python -m apache_beam.examples.inference.tensorrt_object_detection 
--input gs://{GCP_PROJECT}/image_file_names.txt 
--output gs://{GCP_PROJECT}/predictions.txt 
--engine_path gs://{GCP_PROJECT}/ssd_mobilenet_v2_320x320_coco17_tpu-8.trt 
--runner DataflowRunner 
--experiment=use_runner_v2 
--machine_type=n1-standard-4 
--experiment="worker_accelerator=type:nvidia-tesla-t4;count:1;install-nvidia-driver" 
--disk_size_gb=75 
--project {GCP_PROJECT} 
--region us-central1 
--temp_location gs://{GCP_PROJECT}/tmp/ 
--job_name tensorrt-object-detection 
--sdk_container_image="us.gcr.io/{GCP_PROJECT}/{MY_DIR}/tensor_rt tensor_rt"

Depending on the size constraints of the model, you may want to adjust machine_type, the type and count of the GPU, or disk_size_gb. For more information about Beam pipeline options, see Set Dataflow pipeline options.

TensorRT and TensorFlow object detection benchmarking

To benchmark, we decided to do a comparison between the TensorRT and TensorFlow object detection versions of the previously mentioned SSD MobileNet v2 320×320 model.

Every single inference call was timed in both the TensorRT and TensorFlow object detection versions. We calculated an average of 5000 inference calls, not taking the first 10 images into account due to ramp-up latencies. The SSD model that we used is a small model. You’ll observe even better speedup when your model can make full use of the GPU.

First, we compared the direct performance speedup between TensorFlow and TensorRT with a local benchmark. We aimed to prove the added benefits with reduced precision mode on TensorRT.

Framework and precision Inference latency (ms)
TensorFlow Object Detection FP32 (end-to-end) 29.47 ms
TensorRT FP32 (end-to-end) 3.72 ms
TensorRT FP32 (GPU compute) 2.39 ms
TensorRT FP16 (GPU compute) 1.48 ms
TensorRT INT8 (GPU compute) 1.34 ms
Table 1. Direct performance speedup on TensorRT

The overall speedup with TensorRT FP32 is 7.9x. End-to-end included data copies, while the GPU compute only included actual inference time. We did this separation because the example model is small. End-to-end TensorRT latency in this case is mostly data copies. You see more significant end-to-end performance improvements using different precisions in bigger models, especially in cases where inference compute is the bottleneck, not data copies.

FP16 is 1.6x faster than FP32 and has no accuracy penalty. INT8 is 1.8x faster than FP32, but sometimes comes with accuracy degradation and requires a calibration process. Accuracy degradation is model-specific, so it’s always good to try yours and see the produced accuracy.

This issue can also be mitigated using quantized networks with the NVIDIA QAT Toolkit. For more information, see Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT and the NVIDIA TensorRT Developer Guide.

Dataflow benchmarking

In Dataflow, with the TensorRT engine generated in earlier experiments, we ran with the following configurations: n1-standard-4 machine, disk_size_gb=75, and 10 workers.

To simulate a stream of data coming into Dataflow through PubSub, we set batch sizes to 1. This was done by setting ModelHandlers to have min and max batch sizes of 1.

  Stage with RunInference Mean inference_batch_latency_micro_secs
TensorFlow with T4 GPU 12 min 43 sec 99,242
TensorRT with T4 GPU 7 min 20 sec 10,836
Table 2. Dataflow benchmarks

The Dataflow runner decomposes a pipeline into multiple stages. You can get a better picture of the performance of RunInference by looking at the stage that contains the inference call, and not the other stages that read and write data. This is in the Stage with RunInference column.

For this metric, TensorRT only spends 57% of the runtime of TensorFlow. You expect the acceleration to grow if you adapt a larger model that fully uses GPU processing power.

The metric inference_batch_latency_micro_secs is the time, in microseconds, that it takes to perform the inference on the batch of examples, that is, the time to call model_handler.run_inference. This varies over time depending on the dynamic batching decision of BatchElements, and the particular values or dtype values of the elements. For this metric, you can see that TensorRT is about 9.2x faster than TensorFlow.

Conclusion

In this post, we demonstrated how to run machine learning models at scale by seamlessly stitching together a data processing framework (Apache Beam) and inference engine (TensorRT). We presented an end-to-end example of how inference workload can be fully integrated within a data processing pipeline.

This integration enables a new inference pipeline that helps reduce production inference cost with better NVIDIA GPU utilization and much-improved inference latency and throughput. The same approach can be applied to many other inference workloads using many off-shelf TensorRT samples. In the future, we plan to further automate TensorRT engine building and work on deeper integration of TensorRT with Apache Beam.

Categories
Misc

What Is a Pretrained AI Model?

A pretrained AI model is a deep learning model that’s trained on large datasets to accomplish a specific task, and it can be used as is or customized to suit…

A pretrained AI model is a deep learning model that’s trained on large datasets to accomplish a specific task, and it can be used as is or customized to suit application requirements across multiple industries.

Categories
Misc

AI’s Highlight Reel: Top Five NVIDIA Videos of 2022

If AI had a highlight reel, the NVIDIA YouTube channel might just be it. The channel showcases the latest breakthroughs in artificial intelligence, with demos, keynotes and other videos that help viewers see and believe the astonishing ways in which the technology is changing the world. NVIDIA’s most popular videos of 2022 put spotlights on Read article >

The post AI’s Highlight Reel: Top Five NVIDIA Videos of 2022 appeared first on NVIDIA Blog.

Categories
Misc

Safe Travels: NVIDIA DRIVE OS Receives Premier Safety Certification

To make transportation safer, autonomous vehicles (AVs) must have processes and underlying systems that meet the highest standards. NVIDIA DRIVE OS is the operating system for in-vehicle accelerated computing powered by the NVIDIA DRIVE platform. DRIVE OS 5.2 is now functional safety-certified by TÜV SÜD, one of the most experienced and rigorous assessment bodies in Read article >

The post Safe Travels: NVIDIA DRIVE OS Receives Premier Safety Certification appeared first on NVIDIA Blog.