Categories
Misc

Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT

Introduction to the NVIDIA Quantization-Aware Training toolkit for TensorFlow 2 for model quantization for TensorRT acceleration on NVIDIA GPUs.

We’re excited to announce the NVIDIA Quantization-Aware Training (QAT) Toolkit for TensorFlow 2 with the goal of accelerating the quantized networks with NVIDIA TensorRT on NVIDIA GPUs. This toolkit provides you with an easy-to-use API to quantize networks in a way that is optimized for TensorRT inference with just a few additional lines of code.

This post is accompanied by the Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT GTC session. For the PyTorch quantization toolkit equivalent, see PyTorch Quantization.

Background

Accelerating deep neural networks (DNN) inference is an important step in realizing latency-critical deployment of real-world applications such as image classification, image segmentation, natural language processing, and so on.

The need for improving DNN inference latency has sparked interest in running those models in lower precisions, such as FP16 and INT8. Running DNNs in INT8 precision can offer faster inference and a much lower memory footprint than its floating-point counterpart. NVIDIA TensorRT supports post-training quantization (PTQ) and QAT techniques to convert floating-point DNN models to INT8 precision.

In this post, we discuss these techniques, introduce the NVIDIA QAT toolkit for TensorFlow, and demonstrate an end-to-end workflow to design quantized networks optimal for TensorRT deployment.

Quantization-aware training

The main idea behind QAT is to simulate lower precision behavior by minimizing quantization errors during training. To do that, you modify the DNN graph by adding quantize and de-quantize (QDQ) nodes around desired layers. This enables the quantized networks to minimize accuracy loss over PTQ due to the fine-tuning of the model’s quantization and hyperparameters.

PTQ, on the other hand, performs model quantization using a calibration dataset after that model has already been trained. This can result in accuracy degradation due to the quantization not being reflected in the training process. Figure 1 shows both processes.

Block diagrams with quantization steps via PTQ (uses a calibration data to calculate q-parameters) and QAT (simulates quantization via QDQ nodes and fine-tuning).
Figure 1. Quantization workflows through PTQ and QAT

For more information about quantization, quantization methods (PTQ compared to QAT), and quantization in TensorRT, see Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT.

NVIDIA QAT Toolkit for TensorFlow

The goal of this toolkit is to enable you to easily quantize networks in a way that is optimal for TensorRT deployment.

Currently, TensorFlow offers asymmetric quantization in their open-source Model Optimization Toolkit. Their quantization recipe consists of inserting QDQ nodes at the outputs and weights (if applicable) of desired layers, and they offer quantization of the full model or partial by layer class type. This is optimized for TFLite deployment, not TensorRT deployment.

This toolkit is needed for obtaining a quantized model that is ideal for TensorRT deployment. TensorRT optimizer propagates Q and DQ nodes and fuses them with floating-point operations across the network to maximize the proportion of the graph that can be processed in INT8. This leads to optimal model acceleration on NVIDIA GPUs. Our quantization recipe consists of inserting QDQ nodes at the inputs and weights (if applicable) of desired layers.

We also perform symmetric quantization (used by TensorRT) and offer extended quantization support with partial quantization by layer name and pattern-based layer quantization.

Table 1 summarizes the differences between TFMOT and the NVIDIA QAT Toolkit for TensorFlow.

Feature TFMOT NVIDIA QAT Toolkit
QDQ node placements Outputs and weights Inputs and weights
Quantization support Whole model (full) and of some layers (partial by layer class) Extends TF quantization support: partial quantization by layer name and pattern-based layer quantization by extending CustomQDQInsertionCase
Quantization op used Asymmetric quantization (tf.quantization.fake_quant_with_min_max_vars) Symmetric quantization, needed for TensorRT compatibility (tf.quantization.quantize_and_dequantize_v2)
Table 1. Differences between the NVIDIA QAT Toolkit and TensorFlow Model Optimization Toolkit

Figure 2 shows a before/after example of a simple model, visualized with Netron. The QDQ nodes are placed in the inputs and weights(if applicable) of desired layers, namely convolution (Conv) and fully connected (MatMul).

Contains two images, one before QAT (no QDQ nodes), and one after QAT (with QDQ nodes before Conv and MatMul layers).
Figure 2. Example of a model before and after quantization (baseline and QAT model, respectively)

Workflow for deploying QAT models in TensorRT

Figure 3 shows the full workflow to deploy a QAT model, obtained with the QAT Toolkit, in TensorRT.

Block diagram with steps for model quantization, conversion to ONNX, and TensorRT deployment.
Figure 3. TensorRT deployment workflow for QAT models obtained with the QAT Toolkit
  • Assume a pretrained TensorFlow 2 model in SavedModel format, also referred to as the baseline model.
  • Quantize that model using the quantize_model function, which clones and wraps each desired layer with QDQ nodes.
  • Fine-tune the obtained quantized model, simulating quantization during training, and save it in SavedModel format.
  • Convert it to ONNX.

The ONNX graph is then consumed by TensorRT to perform layer fusions and other graph optimizations, such as dedicated QDQ optimizations, and generate an engine for faster inference.

Example with ResNet-50v1

In this example, we show you how to quantize and fine-tune a QAT model with the TensorFlow 2 toolkit and how to deploy that quantized model in TensorRT. For more information, see the full example_resnet50v1.ipynb Jupyter notebook.

Requirements

To follow along, you need the following resources:

  • Python 3.8
  • TensorFlow 2.8
  • NVIDIA TF-QAT Toolkit
  • TensorRT 8.4

Prepare the data

For this example, use the ImageNet 2012 dataset for image classification (task 1), which requires manual downloads due to the terms of the access agreement. This dataset is needed for the QAT model fine-tuning, and it is also used to evaluate the baseline and QAT models.

Log in or sign up on the linked website and download the train/validation data. You should have at least 155 GB of free space.

The workflow supports the TFRecord format, so use the following the instructions (modified from the TensorFlow instructions) to convert the downloaded .tar ImageNet files to the required format:

  1. Set IMAGENET_HOME=/path/to/imagenet/tar/files in data/imagenet_data_setup.sh.
  2. Download imagenet_to_gcs.py to $IMAGENET_HOME.
  3. Run ./data/imagenet_data_setup.sh.

You should now see the compatible dataset in $IMAGENET_HOME.

Quantize and fine-tune the model

from tensorflow_quantization import quantize_model
from tensorflow_quantization.custom_qdq_cases import ResNetV1QDQCase

# Create baseline model
model = tf.keras.applications.ResNet50(weights="imagenet", classifier_activation="softmax")

# Quantize model
q_model = quantize_model(model, custom_qdq_cases=[ResNetV1QDQCase()])

# Fine-tune
q_model.compile(
    optimizer="sgd",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=["accuracy"]
)
q_model.fit(
    train_batches, validation_data=val_batches,
    batch_size=64, steps_per_epoch=500, epochs=2
)

# Save as TF 2 SavedModel
q_model.save(“saved_model_qat”)

Convert SavedModel to ONNX

$ python -m tf2onnx.convert --saved-model= --output=  --opset 13

Deploy the TensorRT engine

Convert the ONNX model into a TensorRT engine (also obtains latency measurements):

$ trtexec --onnx= --int8 --saveEngine= -v

Obtain accuracy results on the validation dataset:

$ python infer_engine.py --engine= --data_dir= -b=

Results

In this section, we report accuracy and latency performance numbers for various models in the ResNet and EfficientNet families:

  • ResNet-50v1
  • ResNet-50v2
  • ResNet-101v1
  • ResNet-101v2
  • EfficientNet-B0
  • EfficientNet-B3

All results were obtained on the NVIDIA A100 GPU with batch size 1 using TensorRT 8.4 (EA for ResNet and GA for EfficientNet).

Figure 4 shows the accuracy comparison between baseline FP32 models and their quantized equivalent models (PTQ and QAT). As you can see, there’s little to no loss in accuracy between the baseline and QAT models. Sometimes there’s even better accuracy due to further overall fine-tuning of the model. There’s also overall higher accuracy in QAT over PTQ due to the fine-tuning of the model parameters in QAT.

Bar plot graph comparing the FP32 baseline, and INT8 PTQ and QAT models. The graph shows similar accuracies in all models.
Figure 4. Accuracy of ResNet and EfficientNet datasets in FP32 (baseline), INT8 with PTQ, and INT8 with QAT

ResNet, as a network structure, is stable for quantization in general, so the gap between PTQ and QAT is small. However, EfficientNet greatly benefits from QAT, noted by reduced accuracy loss from the baseline model when compared to PTQ.

For more information about how different models may benefit from QAT, see Table 7 in Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation (quantization whitepaper).

Figure 5 shows that PTQ and QAT have similar times and introduce an up to 19x speedup compared to their respective baseline model.

Bar plot with FP32 and INT8 latency: 17x speed-up in ResNet-50v1, 11x in 50v2, 19x in 101v1, and 13x in 101v2, and 10x in EfficientNet-B0 and 8x in B3.
Figure 5. Latency performance evaluation on various models in the ResNet and EfficientNet families

PTQ can sometimes be slightly faster than QAT as it tries to quantize all layers in the model, which usually results in faster inference, whereas QAT only quantizes the layers wrapped with QDQ nodes.

For more information about how TensorRT works with QDQ nodes, see Working with INT8 in the TensorRT documentation and the Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT GTC session.

For more information about performance numbers on various supported models, see the model zoo.

Conclusion

In this post, we introduced the NVIDIA QAT Toolkit for TensorFlow 2. We discussed the advantages of using the toolkit in the context of TensorRT inference acceleration. We then demonstrated how to use the toolkit with ResNet50 and perform accuracy and latency evaluations on ResNet and EfficientNet datasets.

Experimental results show that the accuracy of INT8 models trained with QAT is within around a 1% difference compared to FP32 models, achieving up to 19x speedup in latency.

For more information, see the following resources:

Categories
Misc

Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin

Accounting for nearly half of global vehicle sales in 2021, SUVs have grown in popularity given their versatility. Now, NIO aims to amp up the volume further. This week, the electric automaker unveiled the ES7 SUV, purpose-built for the intelligent vehicle era. Its sporty yet elegant body houses an array of cutting-edge technology, including the Read article >

The post Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.

Categories
Misc

AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases

At a time when much about COVID-19 remained a mystery, U.K.-based PrecisionLife used AI and combinatorial analytics to discover new genes associated with severe symptoms and hospitalizations for patients. The techbio company’s study, published in June 2020, pinpoints 68 novel genes associated with individuals who experienced severe disease from the virus. Over 70 percent of Read article >

The post AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases appeared first on NVIDIA Blog.

Categories
Misc

Get Your Wish: Genshin Impact Coming to GeForce NOW

Greetings, Traveler. Prepare for adventure. Genshin Impact, the popular open-world action role-playing game, is leaving limited beta and launching for all GeForce NOW members next week. Gamers can get their game on today with the six total games joining the GeForce NOW library. As announced last week, Warhammer 40,000: Darktide is coming to the cloud Read article >

The post Get Your Wish: Genshin Impact Coming to GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

Use copy-paste on TensorFlow object detection API

Is there a way to user copy-paste augmentation on TensorFlow object detection API?

submitted by /u/giakou4
[visit reddit] [comments]

Categories
Misc

GPU memory full error

My laptop has a RTX 3050 GPU(4gb VRAM) but training the model shows a memory full error.

How can I run my model?

submitted by /u/NCEnvironmental772
[visit reddit] [comments]

Categories
Misc

Weight decay TensorFlow object detection API

How can I add weight decay to the optimizer (e.g. ADAM) in Tensorflow object detection API?

When setting the optimizer, the options are:

 optimizer { adam_optimizer: { epsilon: 1e-7 # Match tf.keras.optimizers.Adam's default. learning_rate: { manual_step_learning_rate { initial_learning_rate: 1e-3 schedule { step: 90000 learning_rate: 1e-4 } schedule { step: 120000 learning_rate: 1e-5 } } } 

submitted by /u/giakou4
[visit reddit] [comments]

Categories
Misc

Headless VM or not for ML?

Is there any advantage over having a GUI, I work primarily in SSH anyways.

submitted by /u/AwardPsychological38
[visit reddit] [comments]

Categories
Misc

How can I set the max_split_size_mb ?

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 4.00 GiB total capacity; 3.42 GiB already allocated; 0 bytes free; 3.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

submitted by /u/AwardPsychological38
[visit reddit] [comments]

Categories
Misc

Spark-NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ SOTA models

Spark-NLP 4.0.0: New modern extractive Question answering (QA) annotators for ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa, official support for Apple silicon M1, support oneDNN to improve CPU up to 97%, improved transformers on GPU up to +700%, 1000+ SOTA models submitted by /u/dark-night-rises
[visit reddit] [comments]