NVIDIA Announces TensorRT 8 Slashing BERT-Large Inference Down to 1 Millisecond

NVIDIA announced TensorRT 8.0 which brings BERT-Large inference latency down to 1.2 ms with new optimizations.

Today, NVIDIA announced TensorRT 8.0 which brings BERT-Large inference latency down to 1.2 ms with new optimizations. This version also delivers 2x the accuracy for INT8 precision with Quantization Aware Training, and significantly higher performance through support for Sparsity, which was introduced in Ampere GPUs.

TensorRT is an SDK for high-performance deep learning inference that includes an inference optimizer and runtime that delivers low latency and high throughput. TensorRT is used across industries such as Healthcare, Automotive, Manufacturing, Internet/Telecom services, Financial Services, Energy, and has been downloaded nearly 2.5 million times.

There have been several kinds of new transformer-based models used across conversational AI. New generalized optimizations in TensorRT can accelerate all such models reducing inference time to half the time vs TensorRT 7.

Highlights from this version include:

BERT Inference in 1.2 ms with new transformer optimizations
Achieve accuracy equivalent to FP32 with INT8 precision using Quantization Aware Training
Introducing Sparsity support for faster inference on Ampere GPUs

You can learn more about Sparsity here.

One of the biggest social media platforms in China, WeChat accelerates its search using TensorRT serving 500M users a month.

“We have implemented TensorRT-and-INT8 QAT-based model inference acceleration to accelerate core tasks of WeChat Search such as Query Understanding and Results Ranking. The conventional limitation of NLP model complexity has been broken-through by our solution with GPU + TensorRT, and BERT/Transformer can be fully integrated in our solution. In addition, we have achieved significant reduction (70%) in allocated computational resources using superb performance optimization methods. ” – Huili/Raccoonliu/Dickzhu, WeChat Search

*Figure 1. Leading adopters across all verticals.*

To learn more about TensorRT 8 and its features:

Follow these GTC Sessions to get yourself familiar with Technologies:

GTC Session S31876: Accelerate Deep Learning Inference with TensorRT 8.0
GTC Session S31552: Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture
GTC Session S31653: Quantization Aware Training in PyTorch with TensorRT 8.0
GTC Session S32224: Accelerating Deep Learning Inference with OnnxRuntime-TensorRT
GTC Session S31732: Inference with Tensorflow 2 Integrated with TensorRT Session
GTC Session S31828: TensorRT Quick Start Guide

NVIDIA TensorRT is freely available to members of the NVIDIA Developer Program. To learn more, visit TensorRT product page.

Leave a Reply Cancel reply