Walk into a store. Grab your stuff. And walk right out again, without stopping to check out. In just the past three months, California-based AiFi has helped Choice Market increase sales at one of its Denver stores by 20 percent among customers who opted to skip the checkout line. It allowed Żappka, a Polish convenience Read article >
This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates. In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial uses NVIDIA TensorRT 8.0.0.3 and provides two code samples, one for TensorFlow v1 and one for TensorFlow v2. TensorRT is an inference … Continued
This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.
In this post, you learn how to deploy TensorFlow trained deep learning models using the new TensorFlow-ONNX-TensorRT workflow. This tutorial uses NVIDIA TensorRT 8.0.0.3 and provides two code samples, one for TensorFlow v1 and one for TensorFlow v2. TensorRT is an inference accelerator.
First, a network is trained using any framework. After a network is trained, the batch size and precision are fixed (with precision as FP32, FP16, or INT8). The trained model is passed to the TensorRT optimizer, which outputs an optimized runtime also called a plan. The .plan file is a serialized file format of the TensorRT engine. The plan file must be deserialized to run inference using the TensorRT runtime.
To optimize models implemented in TensorFlow, the only thing you have to do is convert models to the ONNX format and use the ONNX parser in TensorRT to parse the model and build the TensorRT engine. Figure 1 shows the high-level ONNX workflow.
Figure 1. ONNX workflow.
In this post, we discuss how to create a TensorRT engine using the ONNX workflow and how to run inference from the TensorRT engine. More specifically, we demonstrate end-to-end inference from a model in Keras or TensorFlow to ONNX, and to the TensorRT engine with ResNet-50, semantic segmentation, and U-Net networks. Finally, we explain how you can use this workflow on other networks.
Download the code examples and unzip. You can run either the TensorFlow 1 or the TensorFlow 2 code example by follow the appropriate README. After downloading the file, you should also download labels.py from the Cityscapes dataset scripts repo and place it in the same folder as the other scripts.
ONNX overview
ONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.
It defines a common set of operators, common sets of building blocks of deep learning, and a common file format. It provides a definition of a computation graph, as well as built-in operators. The list of ONNX nodes that may have one or more inputs or outputs forms an acyclic graph.
ResNet ONNX workflow example
In this example, we show how to use the ONNX workflow on two different networks and create a TensorRT engine. The first network is ResNet-50.
The first step is to convert the model to a .pb file. The following code example converts the ResNet-50 model to a .pb file:
import tensorflow as tfimport keras
from tensorflow.keras.models import Model
import keras.backend as K
K.set_learning_phase(0)
def keras_to_pb(model, output_filename, output_node_names):
""" This is the function to convert the Keras model to pb. Args: model: The Keras model.
output_filename: The output .pb file name.
output_node_names: The output nodes of the network. If None, then the function gets the last layer name as the output node. """# Get the names of the input and output nodes.in_name = model.layers[0].get_output_at(0).name.split(':')[0]
if output_node_names is None:
output_node_names = [model.layers[-1].get_output_at(0).name.split(':')[0]]
sess = keras.backend.get_session()
# The TensorFlow freeze_graph expects a comma-separated string of output node names.output_node_names_tf = ','.join(output_node_names)
frozen_graph_def = tf.graph_util.convert_variables_to_constants(
sess,
sess.graph_def,
output_node_names)
sess.close()
wkdir = ''tf.train.write_graph(frozen_graph_def, wkdir, output_filename, as_text=False)
return in_name, output_node_names
# load the ResNet-50 model pretrained on imagenet
model = keras.applications.resnet.ResNet50(include_top=True, weights='imagenet', input_tensor=None, input_shape=None, pooling=None, classes=1000)
# Convert the Keras ResNet-50 model to a .pb file
in_tensor_name, out_tensor_names = keras_to_pb(model, "models/resnet50.pb", None)
In addition to Keras, you can also download ResNet-50 from the following locations:
Deep Learning Examples GitHub repository: Provides the latest deep learning example networks. You can also see the ResNet-50 branch, which contains a script and recipe to train the ResNet-50 v1.5 model.
NVIDIA NGC Models: It has the list of checkpoints for pretrained models. As an example, search on ResNet-50v1.5 for TensorFlow and get the latest checkpoint from the Download page.
Converting the .pb file to ONNX
The second step is to convert the .pb model to the ONNX format. To do this, first install tf2onnx.
After installing tf2onnx, there are two ways of converting the model from a .pb file to the ONNX format. The first way is to use the command line and the second method is by using Python API. Run the following command:
To create the TensorRT engine from the ONNX file, run the following command:
import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
def build_engine(onnx_path, shape = [1,224,224,3]):
""" This is the function to create the TensorRT engine Args: onnx_path : Path to onnx_file. shape : Shape of the input of the ONNX file. """with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, builder.create_builder_config() as config, trt.OnnxParser(network, TRT_LOGGER) as parser:
config.max_workspace_size = (256 with open(onnx_path, 'rb') as model:
parser.parse(model.read())
network.get_input(0).shape = shape
engine = builder.build_engine(network, config)
return engine
def save_engine(engine, file_name):
buf = engine.serialize()
with open(file_name, 'wb') as f:
f.write(buf)
def load_engine(trt_runtime, plan_path):
with open(plan_path, 'rb') as f:
engine_data = f.read()
engine = trt_runtime.deserialize_cuda_engine(engine_data)
return engine
This code should be saved in the engine.py file, and is used later in the post.
This code example contains the following variable:
max_workspace_size: Maximum GPU temporary memory that ICudaEngine can use at execution time.
The builder creates an empty network (builder.create_network()) and the ONNX parser parses the ONNX file into the network (parser.parse(model.read())). You set the input shape for the network (network.get_input(0).shape = shape), after which the builder creates the engine (engine = builder.build_cuda_engine(network)). To create the engine, run the following code example:
import engine as eng
import argparse
from onnx import ModelProto
import tensorrt as trt
engine_name = “resnet50.plan”
onnx_path = "/path/to/onnx/result/file/"
batch_size = 1
model = ModelProto()
with open(onnx_path, "rb") as f:
model.ParseFromString(f.read())
d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
shape = [batch_size , d0, d1 ,d2]
engine = eng.build_engine(onnx_path, shape= shape)
eng.save_engine(engine, engine_name)
In this code example, you first get the input shape from the ONNX model. Next, create the engine, and then save the engine in a .plan file.
Running inference from the TensorRT engine:
The TensorRT engine runs inference in the following workflow:
Allocate buffers for inputs and outputs in the GPU.
Copy data from the host to the allocated input buffers in the GPU.
Run inference in the GPU.
Copy results from the GPU to the host.
Reshape the results as necessary.
These steps are explained in detail in the following code example. This code should be saved in the inference.py file, and is used later in this post.
import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
import pycuda.autoinit
def allocate_buffers(engine, batch_size, data_type):
""" This is the function to allocate buffers for input and output in the device Args: engine : The path to the TensorRT engine. batch_size : The batch size for execution time.
data_type: The type of the data for input and output, for example trt.float32. Output: h_input_1: Input in the host. d_input_1: Input in the device. h_output_1: Output in the host. d_output_1: Output in the device. stream: CUDA stream.
"""# Determine dimensions and create page-locked memory buffers (which won't be swapped to disk) to hold host inputs/outputs.h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))
# Allocate device memory for inputs and outputs.d_input_1 = cuda.mem_alloc(h_input_1.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)
# Create a stream in which to copy inputs/outputs and run inference.stream = cuda.Stream()
return h_input_1, d_input_1, h_output, d_output, stream
def load_images_to_buffer(pics, pagelocked_buffer):
preprocessed = np.asarray(pics).ravel()
np.copyto(pagelocked_buffer, preprocessed)
def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):
""" This is the function to run the inference Args: engine : Path to the TensorRT engine pics_1 : Input images to the model. h_input_1: Input in the host d_input_1: Input in the device h_output_1: Output in the host d_output_1: Output in the device stream: CUDA stream batch_size : Batch size for execution time height: Height of the output image width: Width of the output image Output: The list of output images """
load_images_to_buffer(pics_1, h_input_1)
with engine.create_execution_context() as context:
# Transfer input data to the GPU.cuda.memcpy_htod_async(d_input_1, h_input_1, stream)
# Run inference.context.profiler = trt.Profiler()
context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])
# Transfer predictions back from the GPU.cuda.memcpy_dtoh_async(h_output, d_output, stream)
# Synchronize the streamstream.synchronize()
# Return the host output.out = h_output.reshape((batch_size,-1, height, width))
return out
The first two lines are for determining the dimensions for input and output. You create page-locked memory buffers in host (h_input_1, h_output). Then, you allocate device memory for input and output the same size as host input and output (d_input_1, d_output). The next step is to create the CUDA stream for copying data between the allocated memory from device and host.
In this code example, in the do_inference function, the first step is to load images to buffers in the host using the load_images_to_buffer function. Then the input data is transferred to the GPU (cuda.memcpy_htod_async(d_input_1, h_input_1, stream)) and inference is run using context.execute. Finally the results are copied from GPU to the host (cuda.memcpy_dtoh_async(h_output, d_output, stream)).
In this post, you use similar networks to run the ONNX workflow for semantic segmentation. The network consists of a VGG16-based encoder and three upsampling layers implemented using a deconvolutional layer. The network is trained in about 40,000 iterations on the Cityscapes Dataset.
There are multiple ways of converting the TensorFlow model to an ONNX file. One way is the one explained in the ResNet50 section. Keras also has its own Keras-to-ONNX file converter. Sometimes, some of the layers are not supported in the TensorFlow-to-ONNX but they are supported in the Keras to ONNX converter. Depending on the Keras framework and the type of layers used, you may need to choose between converters.
In the following code example, you directly convert the Keras model to ONNX using the Keras-to-ONNX converter. Download the pretrained semantic segmentation file, semantic_segmentation.hdf5.
import keras
import tensorflow as tf
from keras2onnx import convert_keras
def keras_to_onnx(model, output_filename):
onnx = convert_keras(model, output_filename)
with open(output_filename, "wb") as f:
f.write(onnx.SerializeToString())
semantic_model = keras.models.load_model('/path/to/semantic_segmentation.hdf5')
keras_to_onnx(semantic_model, 'semantic_segmentation.onnx')
Figure 2 shows the architecture of the network.
Figure 2. The VGG16-based semantic segmentation model.
As in the previous example, use the following code example to create the engine for semantic segmentation.
import engine as eng
from onnx import ModelProto
import tensorrt as trt
engine_name = 'semantic.plan'
onnx_path = "semantic.onnx"
batch_size = 1
model = ModelProto()
with open(onnx_path, "rb") as f:
model.ParseFromString(f.read())
d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
shape = [batch_size , d0, d1 ,d2]
engine = eng.build_engine(onnx_path, shape= shape)
eng.save_engine(engine, engine_name)
To test the output of the model, use the Cityscapes Dataset. To work with Cityscapes, you must have the following functions: sub_mean_chw and color_map.
In the following code example, sub_mean_chw is for subtracting the mean value from the image as the preprocessing step and color_map is the mapping from the class ID to a color. The latter is used for visualization.
import numpy as np
from PIL import Image
import tensorrt as trt
import labels # from cityscapes evaluation scriptimport skimage.transform
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
MEAN = (71.60167789, 82.09696889, 72.30508881)
CLASSES = 20
HEIGHT = 512
WIDTH = 1024
def sub_mean_chw(data):
data = data.transpose((1, 2, 0)) # CHW -> HWCdata -= np.array(MEAN) # Broadcast subtractdata = data.transpose((2, 0, 1)) # HWC -> CHWreturn data
def rescale_image(image, output_shape, order=1):
image = skimage.transform.resize(image, output_shape,
order=order, preserve_range=True, mode='reflect')
return image
def color_map(output):
output = output.reshape(CLASSES, HEIGHT, WIDTH)
out_col = np.zeros(shape=(HEIGHT, WIDTH), dtype=(np.uint8, 3))
for x in range(WIDTH):
for y in range(HEIGHT):
if (np.argmax(output[:, y, x] )== 19):
out_col[y,x] = (0, 0, 0)
else:
out_col[y, x] = labels.id2label[labels.trainId2label[np.argmax(output[:, y, x])].id].color
return out_col
The following code example is the rest of the code for the previous example. You must run the previous block first because you need the defined functions. Use the example to compare the output of the Keras model and TensorRT engine semantic .plan file and then visualize both outputs. Replace the placeholders /path/to/semantic_segmentation.hdf5 and input_file_path as appropriate.
import engine as eng
import inference as inf
import keras
import tensorrt as trt
input_file_path = ‘munster_000172_000019_leftImg8bit.png’
onnx_file = "semantic.onnx"
serialized_plan_fp32 = "semantic.plan"
HEIGHT = 512
WIDTH = 1024
image = np.asarray(Image.open(input_file_path))
img = rescale_image(image, (512, 1024),order=1)
im = np.array(img, dtype=np.float32, order='C')
im = im.transpose((2, 0, 1))
im = sub_mean_chw(im)
engine = eng.load_engine(trt_runtime, serialized_plan_fp32)
h_input, d_input, h_output, d_output, stream = inf.allocate_buffers(engine, 1, trt.float32)
out = inf.do_inference(engine, im, h_input, d_input, h_output, d_output, stream, 1, HEIGHT, WIDTH)
out = color_map(out)
colorImage_trt = Image.fromarray(out.astype(np.uint8))
colorImage_trt.save(“trt_output.png”)
semantic_model = keras.models.load_model('/path/to/semantic_segmentation.hdf5')
out_keras= semantic_model.predict(im.reshape(-1, 3, HEIGHT, WIDTH))
out_keras = color_map(out_keras)
colorImage_k = Image.fromarray(out_keras.astype(np.uint8))
colorImage_k.save(“keras_output.png”)
Figure 3 shows the actual image and the ground truth, and the output of Keras versus the output of the TensorRT engine. As you can see, the output for the TensorRT engine is similar to the one for Keras.
Figure 3a. Original image.
Figure 3b. Ground truth label.
Figure 3c. Output of TensorRT.
Figure 3d. Output of Keras.
Try it on other networks
Now you can try the ONNX workflow on other networks. For more information about good examples of segmentation networks, see Segmentation models with pretrained backbones on GitHub.
As an example, we show how to use the ONNX workflow with other networks. The network in this example is U-Net from the segmentation_models library. Here, we only loaded the model and did not train it. You may need to train these models on your preferred dataset.
One important point about these networks is that when you load these networks, their input layer sizes are as follows: (None, None, None, 3). To create a TensorRT engine, you need an ONNX file with a known input size. Before you convert this model to ONNX, change the network by assigning the size to its input and then convert it to the ONNX format.
As an example, load the U-Net network from this library (segmentation_models) and assign the size (244, 244, 3) to its input. After creating the TensorRT engine for the inference, do a similar conversion to what you did for semantic segmentation. Depending on the application and dataset, you may need to have a different color mapping.
# Requirement for TensorFlow 2
pip install tensorflow-gpu==2.1.0
# Other requirements
pip install -U segmentation-models
import segmentation_models as sm
import keras
from keras2onnx import convert_keras
from engine import *
onnx_path = 'unet.onnx'
engine_name = 'unet.plan'
batch_size = 1
CHANNEL = 3
HEIGHT = 224
WIDTH = 224
model = sm.Unet()
model._layers[0].batch_input_shape = (None, 224,224,3)
model = keras.models.clone_model(model)
onx = convert_keras(model, onnx_path)
with open(onnx_path, "wb") as f:
f.write(onx.SerializeToString())
shape = [batch_size , HEIGHT, WIDTH, CHANNEL]
engine = build_engine(onnx_path, shape= shape)
save_engine(engine, engine_name)
As we mentioned earlier in this post, another way of downloading pretrained models is to download them from NVIDIA NGC Models. It has a list of checkpoints for pretrained models. As an example, you can search for UNet for TensorFlow and then go to the Download page to get the latest checkpoint.
Conclusion
In this post, we explained how to deploy deep learning applications using a TensorFlow-to-ONNX-to-TensorRT workflow, with several examples. The first example was ONNX-TensorRT on ResNet-50, and the second example was VGG16-based semantic segmentation that was trained on the Cityscapes Dataset. At the end of the post, we demonstrated how to apply this workflow on other networks. For more information about the best performance of training and inference, see NVIDIA Data Center Deep Learning Product Performance.
Hi all, I am currently trying to implement faster rcnn using tensorflow. Due to the limitations of the machine I am working on I can only use tensorflow 1.15. However, the model that I wanted to originally use as a feature extractor is built with Keras and tf2. Now I’m left wondering what I’m supposed to use to build the model for the feature extractor if I can’t use Keras. Should I just try to implement the inception resnet model by myself? Any help would be appreciated
I’m trying to create a model to predict latitude and longitude from an image. However, my loss does not decrease at all during training, and the predictions are very far off.
Is what I’m doing realistically possible? And if so, how can I make my model more accurate?
NVIDIA announced TensorRT 8.0 which brings BERT-Large inference latency down to 1.2 ms with new optimizations.
Today, NVIDIA announced TensorRT 8.0 which brings BERT-Large inference latency down to 1.2 ms with new optimizations. This version also delivers 2x the accuracy for INT8 precision with Quantization Aware Training, and significantly higher performance through support for Sparsity, which was introduced in Ampere GPUs.
TensorRT is an SDK for high-performance deep learning inference that includes an inference optimizer and runtime that delivers low latency and high throughput. TensorRT is used across industries such as Healthcare, Automotive, Manufacturing, Internet/Telecom services, Financial Services, Energy, and has been downloaded nearly 2.5 million times.
There have been several kinds of new transformer-based models used across conversational AI. New generalized optimizations in TensorRT can accelerate all such models reducing inference time to half the time vs TensorRT 7.
Highlights from this version include:
BERT Inference in 1.2 ms with new transformer optimizations
Achieve accuracy equivalent to FP32 with INT8 precision using Quantization Aware Training
Introducing Sparsity support for faster inference on Ampere GPUs
One of the biggest social media platforms in China, WeChat accelerates its search using TensorRT serving 500M users a month.
“We have implemented TensorRT-and-INT8 QAT-based model inference acceleration to accelerate core tasks of WeChat Search such as Query Understanding and Results Ranking. The conventional limitation of NLP model complexity has been broken-through by our solution with GPU + TensorRT, and BERT/Transformer can be fully integrated in our solution. In addition, we have achieved significant reduction (70%) in allocated computational resources using superb performance optimization methods. ” – Huili/Raccoonliu/Dickzhu, WeChat Search
NVIDIA today launched TensorRT™ 8, the eighth generation of the company’s AI software, which slashes inference time in half for language queries — enabling developers to build the world’s best-performing search engines, ad recommendations and chatbots and offer them from the cloud to the edge.
The final push for the hat trick came down to the wire. Five minutes before the deadline, the team submitted work in its third and hardest data science competition of the year in recommendation systems. Called RecSys, it’s a relatively new branch of computer science that’s spawned one of the most widely used applications in Read article >
Today, NVIDIA is releasing TensorRT 8.0, which introduces many transformer optimizations. With this post update, we present the latest TensorRT optimized BERT sample and its inference latency benchmark on A30 GPUs. Using the optimized sample, you can execute different batch sizes for BERT-base or BERT-large within the 10 ms latency budget for conversational AI applications.
This post was originally published in August 2019 and has been updated for NVIDIA TensorRT 8.0.
Large-scale language models (LSLMs) such as BERT, GPT-2, and XL-Net have brought exciting leaps in accuracy for many natural language processing (NLP) tasks. Since its release in October 2018, BERT (Bidirectional Encoder Representations from Transformers), with all its many variants, remains one of the most popular language models and still delivers state-of-the-art accuracy.
BERT provided a leap in accuracy for NLP tasks that brought high-quality, language-based services within the reach of companies across many industries. To use the model in production, you must consider factors such as latency and accuracy, which influences end-user satisfaction with a service. BERT requires significant compute during inference due to its 12/24-layer stacked, multihead attention network. This has posed a challenge for companies to deploy BERT as part of real-time applications.
Today, NVIDIA is releasing version 8 of TensorRT, which brings the inference latency of BERT-Large down to 1.2 ms on NVIDIA A100 GPUs with new optimizations on transformer-based networks. New generalized optimizations in TensorRT can accelerate all such models, reducing inference time to half the time compared to TensorRT 7.
TensorRT
TensorRT is a platform for high-performance, deep learning inference, which includes an optimizer and runtime that minimizes latency and maximizes throughput in production. With TensorRT, you can optimize models trained in all major frameworks, calibrate for lower precision with high accuracy, and finally deploy in production.
All the code for achieving this performance with BERT is being released as open source in this NVIDIA/TensorRT GitHub repo. We have optimized the Transformer layer, which is a fundamental building block of the BERT encoder so that you can adapt these optimizations to any BERT-based NLP task. BERT is applied to an expanding set of speech and NLP applications beyond conversational AI, all of which can take advantage of these optimizations.
Question answering (QA) or reading comprehension is a popular way to test the ability of models to understand the context. The SQuAD leaderboard tracks the top performers for this task, for a dataset and test set that they provide. There has been rapid progress in QA ability in the last few years, with global contributions from academia and companies.
In this post, we demonstrate how to create a simple QA application using Python, powered by TensorRT-optimized BERT code that NVIDIA released today. The example provides an API to input passages and questions, and it returns responses generated by the BERT model.
Here’s a brief review of the steps to perform training and inference using TensorRT for BERT.
BERT training and inference pipeline
A major problem faced by NLP researchers and developers is scarcity of high-quality labeled training data for their specific NLP task. To overcome the problem of learning a model for the task from scratch, breakthroughs in NLP use the vast amounts of unlabeled text and break the NLP task into two parts:
Learning to represent the meaning of words, the relationship between them, that is, building up a language model using auxiliary tasks and a large corpus of text
Specializing the language model to the actual task by augmenting the language model with a relatively small, task-specific network that is trained in a supervised manner.
These two stages are typically referred to as pretraining and fine-tuning. This paradigm enables the use of the pretrained language model to a wide range of tasks without any task-specific change to the model architecture. In this example, BERT provides a high-quality language model that is fine-tuned for QA but suitable for other tasks such as sentence classification and sentiment analysis.
You can either start with the pretrained checkpoints available online or pretrain BERT on your own custom corpus (Figure 1). You can also initialize pretraining from a checkpoint and then continue training on custom data.
Figure 1. Generating BERT TensorRT engine from pretrained checkpoints
Pretraining with custom or domain-specific data may yield interesting results, for example BioBert. However, it is computationally intensive and requires a massively parallel compute infrastructure to complete within a reasonable amount of time. GPU-enabled, multinode training is an ideal solution for such scenarios. For more information about how NVIDIA developers were able to train BERT in less than an hour, see Training BERT with GPUs.
In the fine-tuning step, the task-specific network based on the pretrained BERT language model is trained using the task-specific training data. For QA, this is (paragraph, question, answer) triples. Compared to pretraining, fine-tuning is generally far less computationally demanding.
To perform inference using a QA neural network:
Create a TensorRT engine by passing the fine-tuned weights and network definition to the TensorRT builder.
Start the TensorRT runtime with this engine.
Feed a passage and a question to the TensorRT runtime and receive as output the answer predicted by the network.
Figure 2 shows the entire workflow.
Figure 2. Workflow to perform inference with TensorRT runtime engine for BERT QA task
Run the sample!
Set up your environment to perform BERT inference with the following steps:
Create a Docker image with the prerequisites.
Build the TensorRT engine from the fine-tuned weights.
Perform inference given a passage and query.
We use scripts to perform these steps, which you can find in the TensorRT BERT sample repo. While we describe several options that you can pass to each script, to get started quickly, you could also run the following code example:
# Clone the TensorRT repository and navigate to BERT demo directory
git clone --recursive https://github.com/NVIDIA/TensorRT && cd TensorRT
# Create and launch the Docker image
# Here we assume the following:
# - the os being ubuntu-18.04 (see below for other supported versions)
# - cuda version is 11.3.1
bash docker/build.sh --file docker/ubuntu-18.04.Dockerfile --tag tensorrt-ubuntu18.04-cuda11.3 --cuda 11.3.1
# Run the Docker container just created
bash docker/launch.sh --tag tensorrt-ubuntu18.04-cuda11.3 --gpus all
# cd into the BERT demo folder
cd $TRT_OSSPATH/demo/BERT
# Download the BERT model fine-tuned checkpoint
bash scripts/download_model.sh
# Build the TensorRT runtime engine.
# To build an engine, use the builder.py script.
mkdir -p engines && python3 builder.py -m models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/model.ckpt -o engines/bert_large_128.engine -b 1 -s 128 --fp16 -c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1
This last command builds an engine with a maximum batch size of 1 (-b 1), and sequence length of 128 (-s 128) using mixed precision (--fp16) and the BERT Large SQuAD v2 FP16 Sequence Length 128 checkpoint (-c models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1).
Now, give it a passage and see how much information it can decipher by asking a few questions.
python3 inference.py -e engines/bert_large_128.engine -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." -q "What is TensorRT?" -v models/fine-tuned/bert_tf_ckpt_large_qa_squad2_amp_128_v19.03.1/vocab.txt
The result of this command should be something similar to the following:
Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open-sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.
Question: What is TensorRT?
Answer: 'a high performance deep learning inference platform'
Given the same passage with a different question, you should get the following result:
Question: What is included in TensorRT?
Answer: 'parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference'
The answers provided by the model are accurate based on the text of the passage that was provided. The sample uses FP16 precision for performing inference with TensorRT. This helps achieve the highest performance possible on Tensor Cores in NVIDIA GPUs. In our tests, we measured the accuracy of TensorRT to be comparable to in-framework inference with FP16 precision.
Script options
Here are the options available with the scripts. The docker/build.sh script builds the docker image using the Dockerfile supplied in the docker folder. It installs all necessary packages, depending on the OS you are selecting as Dockerfile. For this post, we used ubuntu-18.04 but dockerfiles for ubuntu-16.04 and ubuntu-20.04 are also provided.
After creating and running the environment, download fine-tuned weights for BERT. Note that you do not need the pretrained weights to create the TensorRT engine (just the fine-tuned weights). Along with the fine-tuned weights, use the associated configuration file, which specifies parameters such as number of attention heads, number of layers, and the vocab.txt file, which contains the learned vocabulary from the training process. These are packaged with the fine-tuned model downloaded from NGC; download them using the download_model.sh script. As part of this script, you can specify the set of fine-tuned weights for the BERT model to download. The command-line parameters control the exact BERT model to be used later for model building and inference:
sh download_model.sh [tf|pyt] [base|large|megatron-large] [128|384] [v2|v1_1] [sparse] [int8-qat]
tf | pyt tensorflow or pytorch version
base | large | megatron-large - determine whether to download a BERT-base or BERT-large or megatron model to optimize
128 | 384 - determine whether to download a BERT model for sequence length 128 or 384
v2 | v1_1, fine-tuned on squad2 or squad1.1
sparse, download sparse version
int8-qat, download int8 weights
Examples:
# Running with default parametersbash download_model.sh# Running with custom parameters (BERT-large, FP32 fine-tuned weights, 128 sequence length)sh download_model.sh large tf fp32 128
This script by default downloads fine-tuned TensorFlow BERT-large, with FP16 precision and a sequence length of 128. In addition to the fine-tuned model, you use the configuration file, enumerating model parameters and the vocabulary file used to convert BERT model output to a textual answer.
Next, you can build the TensorRT engine and use it for a QA example, that is, inference. The script builder.py builds the TensorRT engine for inference based on the downloaded BERT fine-tuned model.
Make sure that the sequence length provided to the following script matches the sequence length of the model that was downloaded.
-h, --help show this help message and exit
-m CKPT, --ckpt CKPT The checkpoint file basename, e.g.:
basename(model.ckpt-766908.data-00000-of-00001) is model.ckpt-766908 (default: None)
-x ONNX, --onnx ONNX The ONNX model file path. (default: None)
-pt PYTORCH, --pytorch PYTORCH
The PyTorch checkpoint file path. (default: None)
-o OUTPUT, --output OUTPUT
The bert engine file, ex bert.engine (default: bert_base_384.engine)
-b BATCH_SIZE, --batch-size BATCH_SIZE
Batch size(s) to optimize for.
The engine will be usable with any batch size below this, but may not be optimal for smaller sizes. Can be specified multiple times to optimize for more than one batch size.(default: [])
-s SEQUENCE_LENGTH, --sequence-length SEQUENCE_LENGTH
Sequence length of the BERT model (default: [])
-c CONFIG_DIR, --config-dir CONFIG_DIR
The folder containing the bert_config.json,
which can be downloaded e.g. from https://github.com/google-research/bert#pre-trained-models (default: None)
-f, --fp16 Indicates that inference should be run in FP16 precision
(default: False)
-i, --int8 Indicates that inference should be run in INT8 precision
(default: False)
-t, --strict Indicates that inference should be run in strict precision mode
(default: False)
-w WORKSPACE_SIZE, --workspace-size WORKSPACE_SIZE Workspace size in MiB for
building the BERT engine (default: 1000)
-j SQUAD_JSON, --squad-json SQUAD_JSON
squad json dataset used for int8 calibration (default: squad/dev-v1.1.json)
-v VOCAB_FILE, --vocab-file VOCAB_FILE
Path to file containing entire understandable vocab (default: ./pre-trained_model/uncased_L-24_H-1024_A-16/vocab.txt)
-n CALIB_NUM, --calib-num CALIB_NUM
calibration batch numbers (default: 100)
-p CALIB_PATH, --calib-path CALIB_PATH
calibration cache path (default: None)
-g, --force-fc2-gemm
Force use gemm to implement FC2 layer (default: False)
-iln, --force-int8-skipln
Run skip layernorm with INT8 (FP32 or FP16 by default) inputs and output (default: False)
-imh, --force-int8-multihead
Run multi-head attention with INT8 (FP32 or FP16 by default) input and output (default: False)
-sp, --sparse Indicates that model is sparse (default: False)
-tcf TIMING_CACHE_FILE, --timing-cache-file TIMING_CACHE_FILE
Path to tensorrt build timeing cache file, only available for tensorrt 8.0 and later (default: None)
You should now have a TensorRT engine, engines/bert_large_128.engine, to use in the inference.py script for QA.
Later in this post, we describe the process to build the TensorRT engine. You can now provide a passage and a query to inference.py and see if the model is able to answer your queries correctly.
There are few ways to interact with the inference script:
The passage and question can be provided as command-line arguments using the –passage and –question flags.
They can be passed in from a given file using the –passage_file and –question_file flags.
If neither of these flags are given during execution, you are prompted to enter the passage and question after the execution has begun.
Here are the parameters for the inference.py script:
This script uses a prebuilt TensorRT BERT QA engine to answer a question based on the provided passage.
Here are the optional arguments:
-h, --help show this help message and exit
-e ENGINE, --engine ENGINE
Path to BERT TensorRT engine
-b BATCH_SIZE, --batch-size BATCH_SIZE
Batch size for inference.
-p [PASSAGE [PASSAGE ...]], --passage [PASSAGE [PASSAGE ...]]
Text for paragraph/passage for BERT QA
-pf PASSAGE_FILE, --passage-file PASSAGE_FILE
File containing input passage
-q [QUESTION [QUESTION ...]], --question [QUESTION [QUESTION ...]]
Text for query/question for BERT QA
-qf QUESTION_FILE, --question-file QUESTION_FILE
File containing input question
-sq SQUAD_JSON, --squad-json SQUAD_JSON
SQuAD json file
-o OUTPUT_PREDICTION_FILE, --output-prediction-file OUTPUT_PREDICTION_FILE
Output prediction file for SQuAD evaluation
-v VOCAB_FILE, --vocab-file VOCAB_FILE
Path to file containing entire understandable vocab
-s SEQUENCE_LENGTH, --sequence-length SEQUENCE_LENGTH
The sequence length to use. Defaults to 128
--max-query-length MAX_QUERY_LENGTH
The maximum length of a query in number of tokens.
Queries longer than this will be truncated
--max-answer-length MAX_ANSWER_LENGTH
The maximum length of an answer that can be generated
--n-best-size N_BEST_SIZE
Total number of n-best predictions to generate in the
nbest_predictions.json output file
--doc-stride DOC_STRIDE
When splitting up a long document into chunks, what
stride to take between chunks
BERT inference with TensorRT
For a step-by-step description and walkthrough of the inference process, see the Python script inference.py and the detailed Jupyter notebook inference.ipynb in the sample folder. Here are a few key parameters and concepts for performing inference with TensorRT.
BERT, or more specifically, the encoder layer, uses the following parameters to govern its operation:
Batch size
Sequence Length
Number of attention heads
The value of these parameters, which depend on the BERT model chosen, are used to set the configuration parameters for the TensorRT plan file (execution engine).
For each encoder, also specify the number of hidden layers and the attention head size. You can also read all the earlier parameters from the TensorFlow checkpoint file.
As the BERT model we are using has been fine-tuned for a downstream task of QA on the SQuAD dataset, the output for the network (that is, the output fully connected layer) is a span of text where the answer appears in the passage, referred to as h_output in the sample. After you generate the TensorRT engine, you can serialize it and use it later with TensorRT runtime.
During inference, you perform memory copies from CPU to GPU and the reverse asynchronously to get tensors into and out of the GPU memory, respectively. Asynchronous memory copy operation hides the latency of memory transfer by overlapping computations with memory copy operation between device and host. Figure 3 shows the asynchronous memory copies and kernel execution.
Figure 3. TensorRT Runtime process
The inputs to the BERT model (Figure 3) include the following:
input_ids: tensor with token ids of paragraph concatenated along with question that is used as input for inference
segment_ids: distinguishes between passage and question
input_mask: indicates which elements in the sequence are tokens, and which ones are padding elements
The outputs (start_logits and end_logits) represent the span of the answer, which the network predicts inside the passage based on the question.
Benchmarking BERT inference performance
BERT can be applied both for online and offline use cases. Online NLP applications, such as conversational AI, place tight latency budgets during inference. Several models need to execute in a sequence in response to a single user query. When used as a service, the total time a customer experiences includes compute time as well as input and output network latency. Longer times lead to a sluggish performance and a poor customer experience.
While the exact latency available for a single model can vary by application, several real-time applications need the language model to execute in under 10 ms.
Using an NVIDIA Ampere Architecture A100 GPU, BERT-Large optimized with TensorRT 8 can perform inference in 1.2ms for a QA task similar to that available in SQuAD with batch size = 1 and sequence length = 128.
Using the TensorRT optimized sample, you can execute different batch sizes for BERT-base or BERT-large within the 10 ms latency budget. For example, the latency for inference on a BERT-Large model with sequence length = 384 batch size = 1 on A30 with TensorRT8 was 3.62ms. The same model, sequence length =384 with highly optimized code on a CPU-only platform (**) for batch size = 1 was 76ms.
Figure 4. Compute latency in milliseconds for executing BERT-large on an NVIDIA A30 GPU vs. a CPU-only server
The performance measures the compute-only latency time for executing the network on a QA task between passing tensors as input and gathering logits as output. You can find the code used to benchmark the sample in the script scripts/inference_benchmark.sh in the repo.
Summary
NVIDIA is releasing TensorRT 8.0, which makes it possible to perform BERT inference in 0.74ms on A30 GPUs. The code for benchmarking inference on BERT is available as a sample in the TensorRT open-source repo.
This post gives an overview of how to use the TensorRT sample and performance results. We further describe a workflow of how to use the BERT sample as part of a simple application and Jupyter notebook where you can pass a paragraph and ask questions related to it. The new optimizations and performance achievable makes it practical to use BERT in production for applications with tight latency budgets, such as conversational AI.
We are always looking for new ideas for new examples and applications to share. What NLP applications do you use BERT for and what examples would you like to see from us in the future?
CPU-only specifications: Gold 6240@2.60GHz 3.9GHz Turbo (Cascade Lake) HT Off, Single node, Single Socket, Number of CPU Threads = 18, Data=Real, Batch Size=1; Sequence Length=128; nireq=1; Precision=FP32; Data=Real; OpenVINO 2019 R2
GPU-server specification: Gold 6140@2GHz 3.7GHz Turbo (Skylake) HT On, Single node, Dual Socket, Number of CPU Threads = 72, T4 16GB, Driver Version 418.67 (r418_00), BERT-base, Batch Size=1; Number of heads = 12, Size per head = 64; 12 layers; Sequence Length=128; Precision=FP16; XLA=Yes; Data=Real; TensorRT 5.1
(**)
CPU-only specifications: Platinum 8380H@2.90GHz to 4.3 GHz Turbo (Cooper Lake) HT Off, Single node, Single Socket, Number of CPU Threads = 28, BERT-Large, Data=Real, Batch Size=1; Sequence Length=384; nireq=1; Precision=INT8; Data=Real; OpenVINO 2021 R3
GPU-server specification: AMD EPYC 7742@2.25GHz 3.4GHz Turbo (Rome) HT Off, Single node, Number of CPU Threads = 64, A30(GA100) 1*24258 MiB 1*56 SM, Driver Version 470.29 (r470_00), BERT-Large, Batch Size=1; Sequence Length=384; Precision=INT8; TensorRT 8.0
The NVIDIA Merlin and KGMON team earned 1st place in the RecSys Challenge 2021 by effectively predicting the probability of user engagement within a dynamic environment and providing fair recommendations on a multi-million point dataset.
The NVIDIA Merlin and KGMON team earned 1st place in the RecSys Challenge 2021 by effectively predicting the probability of user engagement within a dynamic environment and providing fair recommendations on a multi-million point dataset. Twitter sponsored the RecSys Challenge 2021, curated the challenge’s multi-goal optimization requirements to mirror the real world, and provided multi-million data points each day over the course of the challenge for the teams to work with. NVIDIA’s win for a second year in a row reaffirms NVIDIA’s continued commitment to democratize and streamline recommender workflows.
NVIDIA Merlin Team Weaves Industry Engagement and Learnings Into Software Product
Billions of people are online. Each moment online represents an opportunity for a person to engage with a recommender while reading news, streaming entertainment, shopping, or engaging with social media. Twitter, as a single social media platform, reports an average of 199 million monetizable daily active users that engage within its dynamic environment. The RecSys Challenge 2021 reflects how providing quality recommendations that span millions of data points is extremely challenging. The Merlin team builds open source software designed to help machine learning engineers and data scientists tackle these problems and more. The team also leveraged their skills and experience building Merlin software to win the RecSys Challenge 2021. The challenge also provided insights and opportunities that feed back into Merlin, helping to continuously improve the product. For example, operators used to win last year’s RecSys Challenge 2020 were woven into a product release. This is particularly impactful when working with the Kaggle Grandmasters Of NVIDIA (KGMON) team who are regular collaborators with the Merlin team on recommendation competitions and who bring insight from hundreds of kaggle competition wins. The Merlin team’s hands-on engagement coupled with feedback from Merlin’s early adopters, is vital for reaffirming NVIDIA’s commitment of democratizing the building and accelerating of recommenders.
Editor’s note: Feature image includes just a few of NVIDIA team members that participated in the challenge (clockwise from upper left): Bo Liu, Benedikt Schifferer, Gilberto Titericz and Chris Deotte.
This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates. NVIDIA TensorRT is an SDK for deep learning inference. TensorRT provides APIs and parsers to import trained models from all major deep learning frameworks. It then generates optimized runtime engines deployable in the datacenter as well as in automotive and embedded environments. … Continued
This post was updated July 20, 2021 to reflect NVIDIA TensorRT 8.0 updates.
NVIDIA TensorRT is an SDK for deep learning inference. TensorRT provides APIs and parsers to import trained models from all major deep learning frameworks. It then generates optimized runtime engines deployable in the datacenter as well as in automotive and embedded environments.
This post provides a simple introduction to using TensorRT. You learn how to deploy a deep learning application onto a GPU, increasing throughput and reducing latency during inference. It uses a C++ example to walk you through converting a PyTorch model into an ONNX model and importing it into TensorRT, applying optimizations, and generating a high-performance runtime engine for the datacenter environment.
TensorRT supports both C++ and Python; if you use either, this workflow discussion could be useful. If you prefer to use Python, see Using the Python API in the TensorRT documentation.
Deep learning applies to a wide range of applications such as natural language processing, recommender systems, image, and video analysis. As more applications use deep learning in production, demands on accuracy and performance have led to strong growth in model complexity and size.
Safety-critical applications such as automotive place strict requirements on throughput and latency expected from deep learning models. The same holds true for some consumer applications, including recommendation systems.
TensorRT is designed to help deploy deep learning for these use cases. With support for every major framework, TensorRT helps process large amounts of data with low latency through powerful optimizations, use of reduced precision, and efficient memory use.
To follow along with this post, you need a computer with a CUDA-capable GPU or a cloud instance with GPUs and an installation of TensorRT. On Linux, the easiest place to get started is by downloading the GPU-accelerated PyTorch container with TensorRT integration from the NVIDIA Container Registry (on NGC). The link will have an updated version of the container, but to make sure that this tutorial works properly, we specify the version used for this post:
Because you use TensorRT 8 in this walkthrough, you must upgrade it in the container. The next step is to download the .deb package for TensorRT 8 (CUDA 11.0, Ubuntu 18.04), and install the following requirements:
# Export absolute path to directory hosting TRT8.deb
export TRT_DEB_DIR_PATH=$HOME/trt_release # Change this path to where you’re keeping your .deb file
# Run container
docker run --rm --gpus all -ti --volume $TRT_DEB_DIR_PATH:/workspace/trt_release --net host nvcr.io/nvidia/pytorch:20.07-py3
# Update TensorRT version to 8
dpkg -i nv-tensorrt-repo-ubuntu1804-cuda11.0-trt8.0.0.3-ea-20210423_1-1_amd64.deb
apt-key add /var/nv-tensorrt-repo-ubuntu1804-cuda11.0-trt8.0.0.3-ea-20210423/7fa2af80.pub
apt-get update
apt-get install -y libnvinfer8 libnvinfer-plugin8 libnvparsers8 libnvonnxparsers8
apt-get install -y libnvinfer-bin libnvinfer-dev libnvinfer-plugin-dev libnvparsers-dev
apt-get install -y tensorrt
# Verify TRT 8.0.0 installation
dpkg -l | grep TensorRT
Simple TensorRT example
Following are the four steps for this example application:
Convert the pretrained image segmentation PyTorch model into ONNX.
Import the ONNX model into TensorRT.
Apply optimizations and generate an engine.
Perform inference on the GPU.
Importing the ONNX model includes loading it from a saved file on disk and converting it to a TensorRT network from its native framework or format. ONNX is a standard for representing deep learning models enabling them to be transferred between frameworks.
Many frameworks such as Caffe2, Chainer, CNTK, PaddlePaddle, PyTorch, and MXNet support the ONNX format. Next, an optimized TensorRT engine is built based on the input model, target GPU platform, and other configuration parameters specified. The last step is to provide input data to the TensorRT engine to perform inference.
The application uses the following components in TensorRT:
ONNX parser: Takes a converted PyTorch trained model into the ONNX format as input and populates a network object in TensorRT.
Builder: Takes a network in TensorRT and generates an engine that is optimized for the target platform.
Engine: Takes input data, performs inferences, and emits inference output.
Logger: Associated with the builder and engine to capture errors, warnings, and other information during the build and inference phases.
Convert the pretrained image segmentation PyTorch model into ONNX
After you have successfully installed the PyTorch container from the NGC registry and upgraded it with TensorRT 8.0, run the following commands to download everything needed to run this sample application (example code, test input data, and reference outputs). Then, update the dependencies and compile the application with the makefile provided.
>> sudo apt-get install libprotobuf-dev protobuf-compiler # protobuf is a prerequisite library
>> git clone --recursive https://github.com/onnx/onnx.git # Pull the ONNX repository from GitHub
>> cd onnx
>> mkdir build && cd build
>> cmake .. # Compile and install ONNX
>> make # Use the ‘-j’ option for parallel jobs, for example, ‘make -j $(nproc)’
>> make install
>> cd ../..
>> git clone https://github.com/parallel-forall/code-samples.git
>> cd code-samples/posts/TensorRT-introduction
Modify $TRT_INSTALL_DIR in the Makefile.
>> make clean && make # Compile the TensorRT C++ code
>> cd ..
>> wget https://developer.download.nvidia.com/devblogs/speeding-up-unet.7z // Get the ONNX model and test the data
>> sudo apt install p7zip-full
>> 7z x speeding-up-unet.7z # Unpack the model data into the unet folder
>> cd unet
>> python create_network.py #Inside the unet folder, it creates the unet.onnx file
Convert the PyTorch-trained UNet model into ONNX, as shown in the following code example:
Next, prepare the input data for inference. Download all images from the Kaggle directory. Copy to the /unet directory any three images that don’t have _mask in their filename and the utils.py file from the brain-segmentation-pytorch repository. Prepare three images to be used as input data later in this post. To prepare the input_0. pb and ouput_0. pb files for use later, run the following code example:
import torch
import argparse
import numpy as np
from torchvision import transforms
from skimage.io import imread
from onnx import numpy_helper
from utils import normalize_volume
def main(args):
model = torch.hub.load('mateuszbuda/brain-segmentation-pytorch', 'unet',
in_channels=3, out_channels=1, init_features=32, pretrained=True)
model.train(False)
filename = args.input_image
input_image = imread(filename)
input_image = normalize_volume(input_image)
input_image = np.asarray(input_image, dtype='float32')
preprocess = transforms.Compose([
transforms.ToTensor(),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0)
tensor1 = numpy_helper.from_array(input_batch.numpy())
with open(args.input_tensor, 'wb') as f:
f.write(tensor1.SerializeToString())
if torch.cuda.is_available():
input_batch = input_batch.to('cuda')
model = model.to('cuda')
with torch.no_grad():
output = model(input_batch)
tensor = numpy_helper.from_array(output[0].cpu().numpy())
with open(args.output_tensor, 'wb') as f:
f.write(tensor.SerializeToString())
if __name__=='__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--input_image', type=str)
parser.add_argument('--input_tensor', type=str, default='input_0.pb')
parser.add_argument('--output_tensor', type=str, default='output_0.pb')
args=parser.parse_args()
main(args)
To generate processed input data for inference, run the following commands:
>> pip install medpy #dependency for utils.py file
>> mkdir test_data_set_0
>> mkdir test_data_set_1
>> mkdir test_data_set_2
>> python prepareData.py --input_image your_image1 --input_tensor test_data_set_0/input_0.pb --output_tensor test_data_set_0/output_0.pb # This creates input_0.pb and output_0.pb
>> python prepareData.py --input_image your_image2 --input_tensor test_data_set_1/input_0.pb --output_tensor test_data_set_1/output_0.pb # This creates input_0.pb and output_0.pb
>> python prepareData.py --input_image your_image3 --input_tensor test_data_set_2/input_0.pb --output_tensor test_data_set_2/output_0.pb # This creates input_0.pb and output_0.pb
That’s it, you have the input data ready to perform inference.
Import the ONNX model into TensorRT, generate the engine, and perform inference
Run the sample application with the trained model and input data passed as inputs. The data is provided as an ONNX protobuf file. The sample application compares output generated from TensorRT with reference values available as ONNX .pb files in the same folder and summarizes the result on the prompt.
It can take a few seconds to import the UNet ONNX model and generate the engine. It also generates the output image in the portable gray map (PGM) format as output.pgm.
>> cd to code-samples/posts/TensorRT-introduction-updated
>> ./simpleOnnx path/to/unet/unet.onnx fp32 path/to/unet/test_data_set_0/input_0.pb # The sample application expects output reference values in path/to/unet/test_data_set_0/output_0.pb
...
...
: --------------- Timing Runner: Conv_40 + Relu_41 (CaskConvolution)
: Conv_40 + Relu_41 Set Tactic Name: volta_scudnn_128x128_relu_exp_medium_nhwc_tn_v1 Tactic: 861694390046228376
: Tactic: 861694390046228376 Time: 0.237568
...
: Conv_40 + Relu_41 Set Tactic Name: volta_scudnn_128x128_relu_exp_large_nhwc_tn_v1 Tactic: -3853827649136781465
: Tactic: -3853827649136781465 Time: 0.237568
: Conv_40 + Relu_41 Set Tactic Name: volta_scudnn_128x64_sliced1x2_ldg4_relu_exp_large_nhwc_tn_v1 Tactic: -3263369460438823196
: Tactic: -3263369460438823196 Time: 0.126976
: Conv_40 + Relu_41 Set Tactic Name: volta_scudnn_128x32_sliced1x4_ldg4_relu_exp_medium_nhwc_tn_v1 Tactic: -423878181466897819
: Tactic: -423878181466897819 Time: 0.131072
: Fastest Tactic: -3263369460438823196 Time: 0.126976
: >>>>>>>>>>>>>>> Chose Runner Type: CaskConvolution Tactic: -3263369460438823196
...
...
INFO: [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1148, GPU 1959 (MiB)
: Total per-runner device memory is 79243264
: Total per-runner host memory is 13840
: Allocated activation device memory of size 1459617792
Inference batch size 1 average over 10 runs is 2.21147ms
Verification: OK
INFO: [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 1149, GPU 3333 (MiB)
And that’s it, you have an application that is optimized with TensorRT and running on your GPU. Figure 2 shows the output of a sample test case.
(2a): Original MRI input image
(2b): Segmented ground truth from test dataset
(2c): Predicted segmented image using TensorRT
Figure 2: Inference using TensorRT on a brain MRI image.
Here are a few key code examples used in the earlier sample application.
The main function in the following code example starts by declaring a CUDA engine to hold the network definition and trained parameters. The engine is generated in the SimpleOnnx::createEngine function that takes the path to the ONNX model as input.
// Declare the CUDA engine
SampleUniquePtr mEngine{nullptr};
...
// Create the CUDA engine
mEngine = SampleUniquePtr (builder->buildEngineWithConfig(*network, *config));
The SimpleOnnx::buildEngine function parses the ONNX model and holds it in the network object. To handle the dynamic input dimensions of input images and shape tensors for U-Net model, you must create an optimization profile from the builder class, as shown in the following code example.
The optimization profile enables you to set the optimum input, minimum, and maximum dimensions to the profile. The builder selects the kernel that results in the lowest runtime for input tensor dimensions and which is valid for all input tensor dimensions in the range between the minimum and maximum dimensions. It also converts the network object into a TensorRT engine.
The setMaxBatchSize function in the following code example is used to specify the maximum batch size that a TensorRT engine expects. The setMaxWorkspaceSize function allows you to increase the GPU memory footprint during the engine building phase.
bool SimpleOnnx::createEngine(const SampleUniquePtr& builder)
{
// Create a network using the parser.
const auto explicitBatch = 1U (NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
auto network = SampleUniquePtr(builder->createNetworkV2(explicitBatch));
...
auto parser= SampleUniquePtr(nvonnxparser::createParser(*network, gLogger));
auto parsed = parser->parseFromFile(mParams.onnxFilePath.c_str(), static_cast(nvinfer1::ILogger::Severity::kINFO));
auto config = SampleUniquePtr(builder->createBuilderConfig());
auto profile = builder->createOptimizationProfile();
profile->setDimensions("input.1", OptProfileSelector::kMIN, Dims4{1, 3, 256, 256});
profile->setDimensions("input.1", OptProfileSelector::kOPT, Dims4{1, 3, 256, 256});
profile->setDimensions("input.1", OptProfileSelector::kMAX, Dims4{32, 3, 256, 256});
config->addOptimizationProfile(profile);
...
// Setup model precision.
if (mParams.fp16)
{
config->setFlag(BuilderFlag::kFP16);
}
// Build the engine.
mEngine = SampleUniquePtr(builder->buildEngineWithConfig(*network, *config));
...
return true;
}
After an engine has been created, create an execution context to hold intermediate activation values generated during inference. The following code shows how to create the execution context.
// Declare the execution context
SampleUniquePtr mContext{nullptr};
...
// Create the execution context
mContext = SampleUniquePtr(mEngine->createExecutionContext());
This application places inference requests on the GPU asynchronously in the function launchInference shown in the following code example. Inputs are copied from host (CPU) to device (GPU) within launchInference. The inference is then performed with the enqueueV2 function, and results copied back asynchronously.
The example uses CUDA streams to manage asynchronous work on the GPU. Asynchronous inference execution generally increases performance by overlapping compute as it maximizes GPU utilization. The enqueueV2 function places inference requests on CUDA streams and takes as input runtime batch size, pointers to input and output, plus the CUDA stream to be used for kernel execution. Asynchronous data transfers are performed from the host to the device and the reverse using cudaMemcpyAsync.
Using the cudaStreamSynchronize function after calling launchInference ensures GPU computations complete before the results are accessed. The number of inputs and outputs, as well as the value and dimension of each, can be queried using functions from the ICudaEngine class. The sample finally compares reference output with TensorRT-generated inferences and prints discrepancies to the prompt.
For more information about classes, see the TensorRT Class List. The complete code example is in simpleOnnx_1.cpp.
Batch your inputs
This application example expects a single input and returns output after performing inference on it. Real applications commonly batch inputs to achieve higher performance and efficiency. A batch of inputs that are identical in shape and size can be computed in parallel on different layers of the neural network.
Larger batches generally enable more efficient use of GPU resources. For example, batch sizes using multiples of 32 may be particularly fast and efficient in lower precision on Volta and Turing GPUs because TensorRT can use special kernels for matrix multiply and fully connected layers that leverage Tensor Cores.
Pass the images to the application on the command line using the following code. The number of images (.pb files) passed as input arguments on the command line determines the batch size in this example. Use test_data_set_* to take all the input_0.pb files from all the directories. Instead of reading just one input, the following command reads all inputs available in the folders.
Currently, the downloaded data has three input directories, so the batch size is 3. This version of the example profiles the application and prints the result to the prompt. For more information, see the next section, Profile the application.
>> ./simpleOnnx path/to/unet/unet.onnx fp32 path/to/unet/test_data_set_*/input_0.pb # Use all available test data sets.
...
INFO: [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 1148, GPU 1806 (MiB)
: Total per-runner device memory is 79243264
: Total per-runner host memory is 13840
: Allocated activation device memory of size 1459617792
Inference batch size 3 average over 10 runs is 4.99552ms
To process multiple images in one inference pass, make a couple of changes to the application. First, collect all images (.pb files) in a loop to use as input in the application:
for (int i = 2; i
Next, specify the maximum batch size that a TensorRT engine expects using the setMaxBatchSize function. The builder then generates an engine tuned for that batch size by choosing algorithms that maximize its performance on the target platform. While the engine does not accept larger batch sizes, using smaller batch sizes at runtime is allowed.
The choice of maxBatchSize value depends on the application as well as the expected inference traffic (for example, the number of images) at any given time. A common practice is to build multiple engines optimized for different batch sizes (using different maxBatchSize values), and then choosing the most optimized engine at runtime.
When not specified, the default batch size is 1, meaning that the engine does not process batch sizes greater than 1. Set this parameter as shown in the following code example:
builder->setMaxBatchSize(batchSize);
Profile the application
Now that you’ve seen an example, here’s how to measure its performance. The simplest performance measurement for network inference is the time elapsed between an input being presented to the network and an output being returned, referred to as latency.
For many applications on embedded platforms, latency is critical while consumer applications require quality-of-service. Lower latencies make these applications better. This example measures the average latency of an application using timestamps on the GPU. There are many ways to profile your application in CUDA. For more information, see How to Implement Performance Metrics in CUDA C/C++ .
CUDA offers lightweight event API functions to create, destroy, and record events, as well as calculate the time between them. The application can record events in the CUDA stream, one before initiating inference and another after the inference completes, shown in the following code example.
In some cases, you might care about including the time it takes to transfer data between the GPU and CPU before inference initiates and after inference completes. Techniques exist to pre-fetch data to the GPU as well as overlap compute with data transfers that can significantly hide data transfer overhead. The function cudaEventElapsedTime measures the time between these two events being encountered in the CUDA stream.
Use the following code example for latency calculation within SimpleOnnx::infer:
// Number of times to run inference and calculate average time
constexpr int ITERATIONS = 10;
...
bool SimpleOnnx::infer()
{
CudaEvent start;
CudaEvent end;
double totalTime = 0.0;
CudaStream stream;
for (int i = 0; i
Many applications perform inferences on large amounts of input data accumulated and batched for offline processing. The maximum number of inferences possible per second, known as throughput, is a valuable metric for these applications.
You measure throughput by generating optimized engines for larger specific batch sizes, run inference, and measure the number of batches that can be processed per second. Use the number of batches per second and batch size to calculate the number of inferences per second, but this is out of scope for this post.
Optimize your application
Now that you know how to run inference in batches and profile your application, optimize it. The key strength of TensorRT is its flexibility and use of techniques including mixed precision, efficient optimizations on all GPU platforms, and the ability to optimize across a wide range of model types.
In this section, we describe a few techniques to increase throughput and reduce latency from applications. For more information, see Best Practices for TensorRT Performance.
Here are a few common techniques:
Use mixed precision computation
Change the workspace size
Reuse the TensorRT engine
Use mixed precision computation
TensorRT uses FP32 algorithms for performing inference to obtain the highest possible inference accuracy by default. However, you can use FP16 and INT8 precision for inference with minimal impact on the accuracy of results in many cases.
Using reduced precision to represent models enables you to fit larger models in memory and achieve higher performance given lower data transfer requirements for reduced precision. You can also mix computations in FP32 and FP16 precision with TensorRT, referred to as mixed precision, or use INT8 quantized precision for weights, activations, and execute layers.
Enable FP16 kernels by setting the setFlag(BuilderFlag::kFP16) parameter to true for devices that support fast FP16 math.
if (mParams.fp16)
{
config->setFlag(BuilderFlag::kFP16);
}
The setFlag(BuilderFlag::kFP16) parameter indicates to the builder that a lower precision is acceptable for computations. TensorRT uses FP16 optimized kernels if they perform better with the chosen configuration and target platform.
With this mode turned on, weights can be specified in FP16 or FP32, and are converted automatically to the appropriate precision for the computation. You also have the flexibility of specifying 16-bit floating point data type for input and output tensors, which is out of scope for this post.
Change the workspace size
TensorRT allows you to increase GPU memory footprint during the engine building phase with the setMaxWorkspaceSize parameter. Increasing the limit may affect the number of applications that could share the GPU at the same time. Setting this limit too low may filter out several algorithms and create a suboptimal engine. TensorRT allocates just the memory required even if the amount set in IBuilder::setMaxWorkspaceSize is much higher. Applications should therefore allow the TensorRT builder as much workspace as they can afford. TensorRT allocates no more than this and typically less.
This example uses 1 GB, which lets TensorRT pick any algorithm available.
// Allow TensorRT to use up to 1 GB of GPU memory for tactic selection
constexpr size_t MAX_WORKSPACE_SIZE = 1ULL setMaxWorkspaceSize(MAX_WORKSPACE_SIZE);
Reuse the TensorRT engine
When building the engine, the builder object selects the most optimized kernels for the chosen platform and configuration. Building the engine from a network definition file can be time-consuming and should not be repeated each time you perform inference, unless the model, platform, or configuration changes.
Figure 3 shows that you can transform the format of the engine after generation and store on disk for reuse later, known as serializing the engine. Deserializing occurs when you load the engine from disk into memory and continue to use it for inference.
Figure 3. Serializing and deserializing the TensorRT engine.
The runtime object deserializes the engine.
The SimpleOnnx::buildEngine function first tries to load and use an engine if it exists. If the engine is not available, it creates and saves the engine in the current directory with the name unet_batch4.engine. Before this example tries to build a new engine, it picks this engine if it is available in the current directory.
To force a new engine to be built with updated configuration and parameters, use the make clean_engines command to delete all existing serialized engines stored on disk before re-running the code example.
bool SimpleOnnx::buildEngine()
{
auto builder = SampleUniquePtr(nvinfer1::createInferBuilder(gLogger));
string precison = (mParams.fp16 == false) ? "fp32" : "fp16";
string enginePath{getBasename(mParams.onnxFilePath) + "_batch" + to_string(mParams.batchSize)
+ "_" + precison + ".engine"};
string buffer = readBuffer(enginePath);
if (buffer.size())
{
// Try to deserialize engine.
SampleUniquePtr runtime{nvinfer1::createInferRuntime(gLogger)};
mEngine = SampleUniquePtr(runtime->deserializeCudaEngine(buffer.data(), buffer.size(), nullptr));
}
if (!mEngine)
{
// Fallback to creating engine from scratch.
createEngine(builder);
if (mEngine)
{
SampleUniquePtr engine_plan{mEngine->serialize()};
// Try to save engine for future uses.
writeBuffer(engine_plan->data(), engine_plan->size(), enginePath);
}
}
return true;
}
You’ve now learned how to speed up inference of a simple application using TensorRT. We measured the earlier performance on NVIDIA TITAN V GPUs with TensorRT 8 throughout this post.
Next steps
Real-world applications have much higher computing demands with larger deep learning models, more data processing needs, and tighter latency bounds. TensorRT offers high-performance optimizations for compute- heavy deep learning applications and is an invaluable tool for inference.
Hopefully, this post has familiarized you with the key concepts needed to get amazing performance with TensorRT. Here are some ideas to apply what you have learned, use other models, and explore the impact of design and performance tradeoffs by changing parameters introduced in this post.
The TensorRT support matrix provides a look into supported features and software for TensorRT APIs, parsers, and layers. While this example used C++, TensorRT provides both C++ and Python APIs. To run the sample application included in this post, see the APIs and Python and C++ code examples in the TensorRT Developer Guide.
Change the allowable precision with the parameter setFp16Mode to true/false for the models and profile the applications to see the difference in performance.
Change the batch size used at run time for inference and see how that impacts the performance (latency, throughput) of your model and dataset.
Change the maxbatchsize parameter from 64 to 4 and see different kernels get selected among the top five. Use nvprof to see the kernels in the profiling results.
One topic not covered in this post is performing inference accurately in TensorRT with INT8 precision. TensorRT can convert an FP32 network for deployment with INT8 reduced precision while minimizing accuracy loss. To achieve this goal, models can be quantized using post training quantization and quantization aware training with TensorRT. For more information, see Achieving FP32 Accuracy for INT8 Inference using Quantization Aware Training with TensorRT.
There are numerous resources to help you accelerate applications for image/video, speech apps, and recommendation systems. These range from code samples, self-paced Deep Learning Institute labs and tutorials to developer tools for profiling and debugging applications.
If you have issues with TensorRT, check the NVIDIA TensorRT Developer Forum to see if other members of the TensorRT community have a resolution first. NVIDIA Registered Developers can also file bugs on the Developer Program page.