Categories
Misc

How to make tensorflow lite available for the entire system

My tflite directory is as follows:

/home/me/tensorflow_src/tensorflow/lite/ 

However, I fail to import it in my C++ project:

#include "tensorflow/lite/interpreter.h" // getting a not found error 

How can I add resolve this error? My assumption is that I’d need to add the tflite to my bash to make it available for all of my projects. How can I add tflite to the bash file?

submitted by /u/janissary2016
[visit reddit] [comments]

Categories
Misc

Finding "look_back" & "look_ahead" hyper-parameters for Seq2Seq models

For Seq2Seq deep learning architectures, viz., LSTM/GRU and multivariate, multistep time series forecasting, its important to convert the data to a 3D dimension: (batch_size, look_back, number_features). Here _look_back_ decides the number of past data points/samples to consider using _number_features_ from your training dataset. Similarly, _look_ahead_ needs to be defined which defines the number of steps in future, you want your model to forecast for.

I have a written a function to help achieve this:

 def split_series_multivariate(data, n_past, n_future): ''' Create training and testing splits required by Seq2Seq architecture(s) for multivariate, multistep and multivariate output time-series modeling. ''' X, y = list(), list() for window_start in range(len(data)): past_end = window_start + n_past future_end = past_end + n_future if future_end > len(data): break # slice past and future parts of window- past, future = data[window_start: past_end, :], data[past_end: future_end, :] # past, future = data[window_start: past_end, :], data[past_end: future_end, 4] X.append(past) y.append(future) return np.array(X), np.array(y) 

But, _look_back_ and _look_ahead_ are hyper-parameters which need to be tuned for a given dataset.

 # Define hyper-parameters for Seq2Seq modeling: # look-back window size- n_past = 30 # number of future steps to predict for- n_future = 10 # number of features used n_features = 8 

What is the _best practice_ for choosing/finding _look_back_ and _look_ahead_ hyper-parameters?

submitted by /u/grid_world
[visit reddit] [comments]

Categories
Misc

Minimizing Storage Usage on Jetson

Some NVIDIA Jetson modules have limited storage space, which imposes a challenge in packing applications and libraries. Here are ways to cut down on disk usage.

NVIDIA Jetson provides flexible storage options/configurations for development, but some of the Jetson modules are equipped with a limited eMMC flash memory storage size for more cost-conscious, large-scale product deployment.

It may initially seem impossible to fit your applications and necessary libraries in the limited storage space, especially with the full set of NVIIA JetPack, BSP, and all the development software that NVIDIA has prepackaged for Jetson.

NVIDIA JetPack 5.0.1 DP (Rel 34.1.1)
NVIDIA Jetson AGX Orin Developer Kit
NVIDIA JetPack 4.6.2 (Rel 32.7.2)
NVIDIA Jetson AGX Xavier Developer Kit
NVIDIA JetPack 4.6.1 (Rel 32.7.1)
NVIDIA Jetson Xavier NX Developer Kit
(Original) Regular L4T ([a]) 6.1 GB 5.5 GB  
(Original) Full JetPack ([A]) 16.6 GB 11.6 GB 11.6 GB
Table 1. Disk usage in the original configurations

However, you can cut down on disk usage by removing unnecessary packages, libraries, and other assets. Table 2 shows how you can reclaim more than 8 GB of storage space on some of the latest NVIDIA JetPack versions.

NVIDIA JetPack 5.0.1 DP (Rel 34.1.1)
NVIDIA Jetson AGX Orin Developer Kit
NVIDIA JetPack 4.6.2 (Rel 32.7.2)
NVIDIA Jetson AGX Xavier Developer Kit
NVIDIA JetPack 4.6.1 (Rel 32.7.1)
NVIDIA Jetson Xavier NX Developer Kit
Example deployment configuration ([D]) 8.3 GB 5.2 GB 5.3 GB
Table 2. Disk usage in an optimized deployment configuration

In this post, I present simplified steps to minimize the disk usage on your Jetson device, while sharing tips on ways to analyze disk usage, actual commands, and the example outputs on different versions of JetPack. I also show how to check if an AI application is still working functionally under the slimmed-down configuration.

Identifying what takes up space

The jetson-min-disk documentation shows how to analyze the current disk usage, identify what files and directories take up the space, and clarify the package dependencies. It also shows example command outputs on NVIDIA JetPack 4.6.x and NVIDIA JetPack 5.0.x, so that you can assess how much you may be able to cut down for your application.

Minimized configurations

Figure 1 shows an overview of the minimal configurations. The jetson-min-disk documentation introduces multiple configurations ([A] to [D]) for different development and productization needs.

Overview of the minimal configurations.
Figure 1. Different minimal configurations

You can take the following actions to regain disk space.

  • Remove the desktop user interface.
  • Remove the documentation and samples package.
  • Remove dev packages.

Remove the desktop graphical user interface

You can remove ubuntu-desktop if you know your system does not require a graphical user interface on the NVIDIA Jetson native display output through HDMI, DP/eDP, or LVDS. For more information, see Removing GUI.

NVIDIA JetPack 5.0.1 DP (Rel 34.1.1)
NVIDIA Jetson AGX Orin Developer Kit
NVIDIA JetPack 4.6.2 (Rel 32.7.2)
NVIDIA Jetson AGX Xavier Developer Kit
NVIDIA JetPack 4.6.1 (Rel 32.7.1)
NVIDIA Jetson Xavier NX Developer Kit
Removing the graphical user interface 3.4 GB 5.5 GB 4.2 GB
Table 3. Disk space regained in a typical setup by removing the desktop graphical user interface.
$ sudo apt-get update
$ sudo apt-get purge $(cat apt-packages-only-in-full.txt)
$ sudo apt-get install network-manager
$ sudo reboot

Remove the documentation and samples package

If you have installed the full set of JetPack components (libraries and SDKs) either with the sudo apt install nvidia-jetpack command or by using SDK Manager to install all, you may have packages that you do not need for your application.

Documentation and samples packages are some of the safest to remove, so you can start by uninstalling them. For more information, see Removing docs/sample.

NVIDIA JetPack 5.0.1 DP (Rel 34.1.1)
NVIDIA Jetson AGX Orin Developer Kit
NVIDIA JetPack 4.6.2 (Rel 32.7.2)
NVIDIA Jetson AGX Xavier Developer Kit
NVIDIA JetPack 4.6.1 (Rel 32.7.1)
NVIDIA Jetson Xavier NX Developer Kit
Removing docs and samples 0.8 GB 1.2 GB 1.1 GB
Table 4. Disk space regained in a typical setup by removing documentation and sample packages
$ sudo dpkg -r --force-depends "cuda-documentation-10-2" "cuda-samples-10-2" "libnvinfer-samples" "libvisionworks-samples" "libnvinfer-doc" "vpi1-samples"

Remove dev packages and static libraries

When you are done with building your applications, you do not need dev packages that provide header files and static libraries. You can remove them after checking how much disk space each package takes up. For more information, see Removing dev packages.

NVIDIA JetPack 5.0.1 DP (Rel 34.1.1)
NVIDIA Jetson AGX Orin Developer Kit
NVIDIA JetPack 4.6.2 (Rel 32.7.2)
NVIDIA Jetson AGX Xavier Developer Kit
NVIDIA JetPack 4.6.1 (Rel 32.7.1)
NVIDIA Jetson Xavier NX Developer Kit
Removing static libraries 4.8 GB 2.1 GB 2.2 GB
Table 5. Disk space regained in a typical setup by removing dev packages
$ sudo dpkg -r --force-depends $(dpkg-query -Wf '${Package}n' | grep -E "(cuda[^ ]+dev|libcu[^ ]+dev|libnv[^ ]+dev|vpi[^ ]+dev)")

Starting with minimal L4T BSP

If you are using a host x86-64 Linux host machine to flash your Jetson, you can create a minimal configuration RootFS and flash that image onto Jetson.

For more information about building the minimal L4T RooFS image, see Option: Minimal L4T.

Verification

The guide introduces ways to use the NVIDIA DeepStream reference app as a typical AI application to verify the minimally configured Jetson environment. For more information, see Verification.

git clone https://github.com/NVIDIA-AI-IOT/jetson-min-disk/
cd jetson-min-disk
cd test-docker
./docker-run-deepstream-app-overlay.sh
GIF shows images of traffic and code
Figure 2. DeepStream reference app with overlay output config in action

Conclusion

In this post, I demonstrated ways to work within NVIDIA Jetson storage space limitations while identifying and keeping the essential runtime libraries for AI applications.

The documentation cross-referenced in this post provides commands and tips for different NVIDIA JetPack versions. They can be great tools if you are interested in optimizing the storage usage, especially on NVIDIA Jetson production modules.

Categories
Misc

Improving GPU Utilization in Kubernetes

To improve NVIDIA GPU utilization in K8s clusters, we offer new GPU time-slicing APIs, enabling multiple GPU-accelerated workloads to time-slice and run on a single NVIDIA GPU.

For scalable data center performance, NVIDIA GPUs have become a must-have. 

NVIDIA GPU parallel processing capabilities, supported by thousands of computing cores, are essential to accelerating a wide variety of applications across different industries. The most compute-intensive applications across diverse industries use GPUs today:

  • High-performance computing, such as aerospace, bioscience research, or weather forecasting
  • Consumer applications that use AI to improve search, recommendations, language translation, or transportation, such as autonomous driving
  • Healthcare, such as enhanced medical imaging
  • Financial, such as fraud detection
  • Entertainment, such as visual effects

Different applications across this spectrum can have different computational requirements. Training giant AI models where the GPUs batch process hundreds of data samples in parallel, keeps the GPUs fully utilized during the training process. However, many other application types may only require a fraction of the GPU compute, thereby resulting in underutilization of the massive computational power.

In such cases, provisioning the right-sized GPU acceleration for each workload is key to improving utilization and reducing the operational costs of deployment, whether on-premises or in the cloud.

To address the challenge of GPU utilization in Kubernetes (K8s) clusters, NVIDIA offers multiple GPU concurrency and sharing mechanisms to suit a broad range of use cases. The latest addition is the new GPU time-slicing APIs, now broadly available in Kubernetes with NVIDIA K8s Device Plugin 0.12.0 and the NVIDIA GPU Operator 1.11. Together, they enable multiple GPU-accelerated workloads to time-slice and run on a single NVIDIA GPU.

Before diving into this new feature, here’s some background on use cases where you should consider sharing GPUs and an overview of all the technologies available to do that.

When to share NVIDIA GPUs 

Here are some example workloads that can benefit from sharing GPU resources for better utilization: 

  • Low-batch inference serving, which may only process one input sample on the GPU
  • High-performance computing (HPC) applications, such as simulating photon propagation, that balance computation between the CPU (to read and process inputs) and GPU (to perform computation). Some HPC applications may not achieve high throughput on the GPU portion due to bottlenecks on the CPU core performance.
  • Interactive development for ML model exploration using Jupyter notebooks 
  • Spark-based data analytics applications, where some tasks, or the smallest units of work, are run concurrently and benefit from better GPU utilization
  • Visualization or offline rendering applications that may be bursty in nature
  • Continuous integration/continuous delivery (CICD) pipelines that want to use any available GPUs for testing

In this post, we explore the various technologies available for sharing access to NVIDIA GPUs in a Kubernetes cluster, including how to use them and the tradeoffs to consider while choosing the right approach.

GPU concurrency mechanisms

The NVIDIA GPU hardware, in conjunction with the CUDA programming model, provides a number of different concurrency mechanisms for improving GPU utilization. The mechanisms range from programming model APIs, where the applications need code changes to take advantage of concurrency, to system software and hardware partitioning including virtualization, which are transparent to applications (Figure 1).

 Figure showing the various concurrency mechanisms supported by NVIDIA GPUs, ranging from programming model APIs, CUDA MPS, time-slicing, MIG and NVIDIA vGPU.
Figure 1. GPU concurrency mechanisms

CUDA streams

The asynchronous model of CUDA means that you can perform a number of operations concurrently by a single CUDA context, analogous to a host process on the GPU side, using CUDA streams.

A stream is a software abstraction that represents a sequence of commands, which may be a combination of computation kernels, memory copies, and so on that all execute in order. Work launched in two different streams can execute simultaneously, allowing for coarse-grained parallelism. The application can manage parallelism using CUDA streams and stream priorities

CUDA streams maximize GPU utilization for inference serving, for example, by using streams to run multiple models in parallel. You either scale the same model or serve different models. For more information, see Asynchronous Concurrent Execution

The tradeoff with streams is that the APIs can only be used within a single application, thus offering limited hardware isolation, as all resources are shared, and error isolation between various streams. 

Time-slicing

When dealing with multiple CUDA applications, each of which may not fully utilize the GPU’s resources, you can use a simple oversubscription strategy to leverage the GPU’s time-slicing scheduler. This is supported by compute preemption starting with the Pascal architecture. This technique, sometimes called temporal GPU sharing, does carry a cost for context-switching between the different CUDA applications, but some underutilized applications can still benefit from this strategy. 

Since CUDA 11.1 (R455+ drivers), the time-slice duration for CUDA applications is configurable through the nvidia-smi utility: 

$ nvidia-smi compute-policy --help

    Compute Policy -- Control and list compute policies.

    Usage: nvidia-smi compute-policy [options]

    Options include:
    [-i | --id]: GPU device ID's. Provide comma
                 separated values for more than one device

    [-l | --list]: List all compute policies

    [ | --set-timeslice]: Set timeslice config for a GPU:
                          0=DEFAULT, 1=SHORT, 2=MEDIUM, 3=LONG

    [-h | --help]: Display help information

The tradeoffs with time-slicing are increased latency, jitter, and potential out-of-memory (OOM) conditions when many different applications are time-slicing on the GPU. This mechanism is what we focus on in the second part of this post.

CUDA Multi-Process Service

You can take the oversubscription strategy described earlier a step further with CUDA MPS. MPS enables CUDA kernels from different processes, typically MPI ranks, to be processed concurrently on the GPU when each process is too small to saturate the GPU’s compute resources. Unlike time-slicing, MPS enables the CUDA kernels from different processes to execute in parallel on the GPU. 

Newer releases of CUDA (since CUDA 11.4+) have added more fine-grained resource provisioning in terms of being able to specify limits on the amount of memory allocatable (CUDA_MPS_PINNED_DEVICE_MEM_LIMIT) and the available compute to be used by MPS clients (CUDA_MPS_ACTIVE_THREAD_PERCENTAGE). For more information about the usage of these tuning knobs, see the Volta MPS Execution Resource Provisioning

The tradeoffs with MPS are the limitations with error isolation, memory protection, and quality of service (QoS). The GPU hardware resources are still shared among all MPS clients. You can CUDA MPS with Kubernetes today but NVIDIA plans to improve support for MPS over the coming months.

Multi-instance GPU (MIG)

The mechanisms discussed so far rely either on changes to the application using the CUDA programming model APIs, such as CUDA streams, or CUDA system software, such as time-slicing or MPS. 

With MIG, GPUs based on the NVIDIA Ampere Architecture, such as NVIDIA A100, can be securely partitioned up to seven separate GPU Instances for CUDA applications, providing multiple applications with dedicated GPU resources. These include streaming multiprocessors (SMs) and GPU engines, such as copy engines or decoders, to provide a defined QoS with fault isolation for different clients such as processes, containers or virtual machines (VMs).

When the GPU is partitioned, you can use the prior mechanisms of CUDA streams, CUDA MPS, and time-slicing within a single MIG instance. 

For more information, see the MIG user guide and MIG Support in Kubernetes

Virtualization with vGPU

NVIDIA vGPU enables virtual machines with full input-output memory management unit (IOMMU) protection to have simultaneous, direct access to a single physical GPU. Apart from security, NVIDIA vGPU brings in other benefits such as VM management with live VM migration and the ability to run mixed VDI and compute workloads, as well as integration with a number of industry hypervisors.

On GPUs that support MIG, each GPU partition is exposed as a single-root I/O virtualization (SR-IOV) virtual function for a VM. All VMs can run in parallel as opposed to being time-sliced (on GPUs that do not support MIG). 

Table 1 summarizes these technologies including when to consider these concurrency mechanisms. 

Streams MPS Time-Slicing MIG vGPU
Partition Type Single process Logical Temporal (Single process) Physical Temporal & Physical – VMs
Max Partitions Unlimited 48 Unlimited 7 Variable
SM Performance Isolation No Yes (by percentage, not partitioning) Yes Yes Yes
Memory Protection No Yes Yes Yes Yes
Memory Bandwidth QoS No No No Yes Yes
Error Isolation No No Yes Yes Yes
Cross-Partition Interop Always IPC Limited IPC Limited IPC No
Reconfigure Dynamic At process launch N/A When idle N/A
GPU Management (telemetry) N/A Limited GPU metrics N/A Yes – GPU metrics, support for containers Yes – live migration and other industry virtualization tools
Target use cases (and when to use each) Optimize for concurrency within a single application Run multiple applications in parallel but can deal with limited resiliency Run multiple applications that are not latency-sensitive or can tolerate jitter Run multiple applications in parallel but need resiliency and QoS Support multi-tenancy on the GPU through virtualization and need VM management benefits
Table 1. Comparison of GPU concurrency mechanisms

With this background, the rest of the post focuses on oversubscribing GPUs using the new time-slicing APIs in Kubernetes. 

Time-slicing support in Kubernetes

NVIDIA GPUs are advertised as schedulable resources in Kubernetes through the device plugin framework. However, this framework only allows for devices, including GPUs (as nvidia.com/gpu)  to be advertised as integer resources and thus does not allow for oversubscription. In this section, we discuss a new method for oversubscribing GPUs in Kubernetes using time-slicing. 

Before we discuss the new APIs, we introduce a new mechanism for configuring the NVIDIA Kubernetes device plugin using a configuration file. 

New configuration file support

The Kubernetes device plugin offers a number of options for configuration, which are set either as command-line options or environment variables, such as setting MIG strategy, device enumeration, and so on. Similarly, gpu-feature-discovery (GFD) uses a similar option for generating labels to describe GPU nodes. 

As configuration options become more complex, you use a configuration file to express these options to the Kubernetes device plugin and GFD, which is then deployed as a configmap object and applied to the plugin and the GFD pods during startup. 

The configuration options are expressed in a YAML file. In the following example, you record the various options in a file called dp-example-config.yaml, created under /tmp

$ cat  /tmp/dp-example-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"
  gfd:
    oneshot: false
    noTimestamp: false
    outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
    sleepInterval: 60s
EOF

Then, start the Kubernetes device plugin by specifying the location of the config file and using the gfd.enabled=true option to start GFD as well:

$ helm install nvdp nvdp/nvidia-device-plugin 
    --version=0.12.2 
    --namespace nvidia-device-plugin 
    --create-namespace 
    --set gfd.enabled=true 
    --set-file config.map.config=/tmp/dp-example-config.yaml

Dynamic configuration changes

The configuration is applied to all GPUs on all nodes by default. The Kubernetes device plugin enables multiple configuration files to be specified. You can override the configuration on a node-by-node basis by overwriting a label on the node.

The Kubernetes device plugin uses a sidecar container that detects changes in desired node configurations and reloads the device plugin so that new configurations can take effect. In the following example, you create two configurations for the device plugin: a default that is applied to all nodes and another that you can apply to A100 GPU nodes on demand.

$ helm install nvdp nvdp/nvidia-device-plugin 
    --version=0.12.2 
    --namespace nvidia-device-plugin 
    --create-namespace 
    --set gfd.enabled=true 
    --set-file config.map.default=/tmp/dp-example-config-default.yaml 
    --set-file config.map.a100-80gb-config=/tmp/dp-example-config-a100.yaml

The Kubernetes device plugin then enables dynamic changes to the configuration whenever the node label is overwritten, allowing for configuration on a per-node basis if so desired:

$ kubectl label node 
   --overwrite 
   --selector=nvidia.com/gpu.product=A100-SXM4-80GB 
    nvidia.com/device-plugin.config=a100-80gb-config

Time-slicing APIs

To support time-slicing of GPUs, you extend the definition of the configuration file with the following fields:

version: v1
sharing:
  timeSlicing:
    renameByDefault: 
    failRequestsGreaterThanOne: 
    resources:
    - name: 
      replicas: 
    ...

That is, for each named resource under sharing.timeSlicing.resources, a number of replicas can now be specified for that resource type.

Moreover, if renameByDefault=true, then each resource is advertised under the name source-name>.shared instead of simply .

The failRequestsGreaterThanOne flag is false by default for backward compatibility. It controls whether pods can request more than one GPU resource. A request of more than one GPU does not imply that the pod gets proportionally more time slices, as the GPU scheduler currently gives an equal share of time to all processes running on the GPU.

The failRequestsGreaterThanOne flag configures the behavior of the plugin to treat a request of one GPU as an access request rather than an exclusive resource request. 

As the new oversubscribed resources are created, the Kubernetes device plugin assigns these resources to the requesting jobs. When two or more jobs land on the same GPU, the jobs automatically use the GPU’s time-slicing mechanism. The plugin does not offer any other additional isolation benefits. 

Labels applied by GFD

For GFD, the labels that get applied depend on whether renameByDefault=true. Regardless of the setting for renameByDefault,  the following label is always applied:

nvidia.com/.replicas = 

However, when renameByDefault=false, the following suffix is also added to the nvidia.com/.product label:

nvidia.com/gpu.product = -SHARED

Using these labels, you have a way of selecting a shared or non-shared GPU in the same way you would traditionally select one GPU model over another. That is, the SHARED annotation ensures that you can use a nodeSelector object to attract pods to nodes that have shared GPUs on them. Moreover, the pods can ensure that they land on a node that is dividing a GPU into their desired proportions using the new replicas label.

Oversubscribing example

Here’s a complete example of oversubscribing GPU resources using the time-slicing APIs. In this example, you walk through the additional configuration settings for the Kubernetes device plugin and GFD) to set up GPU oversubscription and launch a workload using the specified resources.

Consider the following configuration file:

version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 5
    ...

If this configuration were applied to a node with eight GPUs on it, the plugin would now advertise 40 nvidia.com/gpu resources to Kubernetes instead of eight. If the renameByDefault: true option was set, then 40 nvidia.com/gpu.shared resources would be advertised instead of eight nvidia.com/gpu resources. 

You enable time-slicing in the following example configuration. In this example, oversubscribe the GPUs by 2x:

$ cat  /tmp/dp-example-config.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: "envvar"
    deviceIDStrategy: "uuid"
  gfd:
    oneshot: false
    noTimestamp: false
    outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
    sleepInterval: 60s
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 2
EOF

Set up the Helm chart repository:

$ helm repo add nvdp https://nvidia.github.io/k8s-device-plugin 
   && helm repo update

Now, deploy the Kubernetes device plugin by specifying the location to the config file created earlier: 

$ helm install nvdp nvdp/nvidia-device-plugin 
    --version=0.12.2 
    --namespace nvidia-device-plugin 
    --create-namespace 
    --set gfd.enabled=true 
    --set-file config.map.config=/tmp/dp-example-config.yaml

As the node only has a single physical GPU, you can now see that the device plugin advertises two GPUs as allocatable:

$ kubectl describe node
...
Capacity:
  cpu:                4
  ephemeral-storage:  32461564Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             16084408Ki
  nvidia.com/gpu:     2
  pods:               110
Allocatable:
  cpu:                4
  ephemeral-storage:  29916577333
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             15982008Ki
  nvidia.com/gpu:     2
  pods:               110

Next, deploy two applications (in this case, an FP16 CUDA GEMM workload) with each requesting one GPU. Observe that the applications context switch on the GPU and thus only achieve approximately half the peak FP16 bandwidth on a T4. 

$ cat 



You can now see the two containers deployed and running on a single physical GPU, which would not have been possible in Kubernetes without the new time-slicing APIs:

$ kubectl get pods -A
NAMESPACE              NAME                                                              READY   STATUS    RESTARTS   AGE
default                dcgmproftester-1                                                  1/1     Running   0          45s
default                dcgmproftester-2                                                  1/1     Running   0          45s
kube-system            calico-kube-controllers-6fcb5c5bcf-cl5h5                          1/1     Running   3          32d

You can use nvidia-smi on the host to see that the two containers are scheduled on the same physical GPU by the plugin and context switch on the GPU:

$ nvidia-smi -L
GPU 0: Tesla T4 (UUID: GPU-491287c9-bc95-b926-a488-9503064e72a1)

$ nvidia-smi
......

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    466420      C   /usr/bin/dcgmproftester11         315MiB |
|    0   N/A  N/A    466421      C   /usr/bin/dcgmproftester11         315MiB |
+-----------------------------------------------------------------------------+

Summary

Get started with leveraging the new GPU oversubscription support in Kubernetes today. Helm charts for the new release of the Kubernetes device plugin make it easy to start using the feature right away.

The short-term roadmap includes integration with the NVIDIA GPU Operator so that you can get access to the feature, whether that is with Red Hat’s OpenShift, VMware Tanzu, or provisioned environments such as NVIDIA Cloud Native Core on NVIDIA LaunchPad. NVIDIA is also working on improving support for CUDA MPS in the Kubernetes device plugin so that you can take advantage of other GPU concurrency mechanisms within Kubernetes.

If you have questions or comments, please leave them in the comments section. For technical questions about installation and usage, we recommend filing an issue on the NVIDIA/k8s-device-plugin GitHub repo. We appreciate your feedback! 

Categories
Misc

Just Released: TensorRT 8.4

Today NVIDIA released TensorRT 8.4, which includes new tools to explore TensorRT optimized engines and quantize the TensorFlow models with QAT.

Categories
Misc

Exploring NVIDIA TensorRT Engines with TREx

This walkthrough summarizes the TREx workflow and highlight API features for examining data and TensorRT engines.

The primary function of NVIDIA TensorRT is the acceleration of deep-learning inference, achieved by processing a network definition and converting it into an optimized engine execution plan. TensorRT Engine Explorer (TREx) is a Python library and a set of Jupyter notebooks for exploring a TensorRT engine plan and its associated inference profiling data.

TREx provides visibility into the generated engine, empowering you with new insights through summarized statistics, charting utilities, and engine graph visualization. TREx is useful for high-level network performance optimization and debugging, such as comparing the performance of two versions of a network. For in-depth performance analysis, NVIDIA Nsight Systems is the recommended performance analysis tool.

In this post, I summarize the TREx workflow and highlight API features for examining data and TensorRT engines. To see TREx in action, I walk through the process of how to achieve by .

How TREx works

The main TREx abstraction is trex.EnginePlan, which encapsulates all the information related to an engine. An EnginePlan is constructed from several input JSON files, each of which describes a different aspect of the engine, such as its data-dependency graph and its profiling data. The information in an EnginePlan is accessible through a Pandas DataFrame, which is a familiar, powerful, and convenient data structure.

Before using TREx, you must build and profile your engine. TREx provides a simple utility script, process_engine.py, to do this. The script is provided as a reference and you may collect this information in any way you choose.

This script uses trtexec to build an engine from an ONNX model and profile the engine. It also creates several JSON files that capture various aspects of the engine building and profiling session:

Plan-graph JSON file

A plan-graph JSON file describes the engine data-flow graph in a JSON format.

A TensorRT engine plan is a serialized format of a TensorRT engine. It contains information about the final inference graph and can be deserialized for inference runtime execution. 

TensorRT 8.2 introduced the IEngineInspector API, which provides the ability to examine an engine’s layers, their configuration, and their data dependencies. IEngineInspector provides this information using a simple JSON formatting schema. This JSON file is the primary input to a TREx trex.EnginePlan object and is mandatory.

Profiling JSON file

A profiling JSON file provides profiling information for each engine layer.

The trtexec command-line application implements the IProfiler interface and generates a JSON file containing a profiling record for each layer. This file is optional if you only want to investigate the structure of an engine without its associated profiling information.

Timing records JSON file

A JSON file contains timing records for each profiling iteration.

To profile an engine, trtexec executes the engine many times to smooth measurement noise. The timing information of each engine execution may be recorded as a separate record in a timing JSON file and the average measurement is reported as the engine latency. This file is optional and generally useful when assessing the quality of a profiling session.

If you see excessive variance in the engine timing information, you may want to ensure that you are using the GPU exclusively and the compute and memory clocks are locked.

Metadata JSON file

A metadata JSON file describes the engine’s builder configuration and information about the GPU used to build the engine. This information provides a more meaningful context to the engine profiling session and is particularly useful when you are comparing two or more engines.

TREx workflow

Figure 1 summarizes the TREx workflow:

  • Start by converting your deep-learning model to a TensorRT network.
  • Build and profile an engine while also producing collateral JSON files.
  • Spin up TREx to explore the contents of the files.
Workflow diagram shows that TREx uses JSON files to capture metadata from the engine building and profiling stages.
Figure 1. TensorRT Engine Explorer workflow

TREx features and API

After collecting all profiling data, you can create an EnginePlan instance:

plan = EnginePlan(
    "my-engine.graph.json",
    "my-engine.profile.json",
    "my-engine.profile.metadata.json")

With a trex.EnginePlan instance, you can access most of the information through a Pandas DataFrame object. Each row in the DataFrame represents one layer in the plan file, including its name, tactic, inputs, outputs, and other attributes describing the layer.

# Print layer names
plan = EnginePlan("my-engine.graph.json")
df = plan.df
print(df['Name'])

Abstracting the engine information using a DataFrame is convenient as it is both an API that many Python developers know and love and a powerful API with facilities for slicing, dicing, exporting, graphing, and printing data.

For example, listing the three slowest layers in an engine is straightforward:

# Print the 3 slowest layers
top3 = plan.df.nlargest(3, 'latency.pct_time')
for i in range(len(top3)):
    layer = top3.iloc[i]
    print("%s: %s" % (layer["Name"], layer["type"]))
features.16.conv.2.weight + QuantizeLinear_771 + Conv_775 + Add_777: Convolution
features.15.conv.2.weight + QuantizeLinear_722 + Conv_726 + Add_728: Convolution
features.12.conv.2.weight + QuantizeLinear_576 + Conv_580 + Add_582: Convolution

We often want to group information. For example, you may want to know the total latency consumed by each layer type:

# Print the latency of each layer type
plan.df.groupby(["type"]).sum()[["latency.avg_time"]]
Chart of latency time results by convolution, pooling, reformat, and scale.
Figure 2. Total latency results

Pandas mixes well with other libraries such as dtale, a convenient library for viewing and analyzing dataframes, and Plotly, a graphing library with interactive plots. Both libraries are integrated with the sample TREx notebooks, but there are many user-friendly alternatives such as qgrid, matplotlib, and Seaborn.

There are also convenience APIs that are thin wrappers for Pandas, Plotly, and dtale:

  • Plotting data (plotting.py)
  • Visualizing an engine graph (graphing.py)
  • Interactive notebooks (interactive.py and notebook.py)
  • Reporting (report_card.py and compare_engines.py)

Finally, the linting API (lint.py) uses static analysis to flag performance hazards, akin to a software linter. Ideally, the layer linters provide expert performance feedback that you can act on to improve your engine’s performance. For example, if you are using suboptimal convolution input shapes or suboptimal placement of quantization layers. The linting feature is in an early development state and NVIDIA plans to improve it.

TREx also comes with a couple of tutorial notebooks and two workflow notebooks: one for analyzing a single engine and another for comparing two or more engines.

With the TREx API you can code new ways to explore, extract, and display TensorRT engines, which you can share with the community.

Example TREx walkthrough

Now that you know how TREx operates, here’s an example that shows TREx in action.

In this example, you create an optimized TensorRT engine of a quantized ResNet18 PyTorch model, profile it, and finally inspect the engine plan using TREx. ] You then adjust the model, based on your learnings, to improve its performance. The code for this example is available in the TREx GitHub repository.

Start by exporting the PyTorch ResNet model to an ONNX format. Use the NVIDIA PyTorch Quantization Toolkit for adding quantization layers in the model, but you don’t perform calibration and fine-tuning as you are concentrating on performance, not accuracy.

In a real use case, you should follow the full quantization-aware training (QAT) recipe. The QAT Toolkit automatically inserts fake-quantization operations into the Torch model. These operations are exported as the QuantizeLinear and DequantizeLinear ONNX operators:

import torch
import torchvision.models as models
# For QAT
from pytorch_quantization import quant_modules
quant_modules.initialize()
from pytorch_quantization import nn as quant_nn
quant_nn.TensorQuantizer.use_fb_fake_quant = True

resnet = models.resnet18(pretrained=True).eval()
# Export to ONNX, with dynamic batch-size
with torch.no_grad():
    input = torch.randn(1, 3, 224, 224)
    torch.onnx.export(
        resnet, input, "/tmp/resnet/resnet-qat.onnx",
    	  input_names=["input.1"],
    	  opset_version=13,
    	  dynamic_axes={"input.1": {0: "batch_size"}})=

Next, use the TREx utility process_engine.py script to do the following:

  1. Build an engine from the ONNX model.
  2. Create an engine-plan JSON file.
  3. Profile the engine execution and store the results in a profiling JSON file. You also record the timing results in a timing JSON file.
python3 /utils/process_engine.py /tmp/resnet/resnet-qat.onnx /tmp/resnet/qat int8 fp16 shapes=input.1:32x3x224x224

The script process_engine.py uses trtexec to do the heavy lifting. You can transparently pass arguments to trtexec from the process_engine.py command line by simply listing them without the -- prefix.

In the example, the arguments int8, fp16, and shapes=input.1:32x3x224x224 are forwarded to trtexec, instructing it to optimize for FP16 and INT8 precisions and set the input batch-size to 32. The first script parameter is the input ONNX file (/tmp/resnet/resnet-qat.onnx), and the second parameter (/tmp/resnet/qat) points to the directory to contain the generated JSON files.

You are now ready to examine the optimized engine plan, so go to TREx Engine Report Card notebook. I won’t go through the entire notebook in this post, just a few cells useful for this example.

The first cell sets the engine file and creates a trex.EnginePlan instance from the various JSON files:

engine_name = "/tmp/resnet/qat/resnet-qat.onnx.engine"
plan = EnginePlan(      f"{engine_name}.graph.json",      
  f"{engine_name}.profile.json",
  f"{engine_name}.profile.metadata.json")

The next cell creates a visualization of the engine’s data-dependency graph, which is most useful to understanding the transformation of the original network to an engine. TensorRT executes the engine as a topologically sorted layer list, and not as a parallelizable graph.

The default rendering format is SVG, which is searchable, stays sharp at different scales, and supports hover-text for providing additional information without taking up a lot of space.

graph = to_dot(plan, layer_type_formatter)
svg_name = render_dot(graph, engine_name, 'svg')

The function creates an SVG file and prints its name. Rendering inside the notebook is cumbersome even for small networks and you can open the SVG file in a separate browser window for rendering.

The TREx graphing API is configurable, allowing for various coloring and formatting, and the available formatters are packed with information. With the default formatter, for example, layers are colored according to their operation and are labeled by name, type, and profiled latency. Tensors are depicted as edges connecting the layers and are colored according to their precision and labeled with their shape and memory layout information.

In the generated ResNet QAT engine graph (Figure 3), you see some FP32 tensors (in red). Investigate further because you want to have as many layers as possible executing using INT8 precision. Using INT8 data and compute precision increases throughput and lowers latency and power.

Animated view of a ResNet18 engine graph.
Figure 3. A data-dependency graph of the QAT ResNet18 engine

The Performance cell provides various views of performance data, and specifically the precision-per-layer view (Figure 4) shows several layers computing using FP32 and FP16.

report_card_perf_overview(plan)
Graph of precision per layer view of latency average time vs name for ResNet18 QAT
Figure 4. Precision per layer view, with ResNet18 QAT (TREx uses red for FP32, orange for FP16, and Nvidia-Green for INT8 precisions)

When examining the latency-per-layer-type view, there are 12 reformatting nodes that account for about 26.5% of the runtime. That’s quite a lot. Reformatting nodes are inserted in the engine graph during optimization, but they are also inserted to convert precisions. Each reformat layer has an origin attribute describing the reason for its existence.

If you see too many precision conversions, you should see if there’s something you can do to reduce these conversions. In TensorRT 8.2, you see scale layers, instead of reformatting layers for Q/DQ operations. This is due to the different graph optimization strategies used in TensorRT 8.2 and 8.4.

Screenshot showing the output of the convolution linter in table format.
Figure 5. Count and latency per layer-type views, ResNet18 QAT

To dig deeper, turn to the engine linting API available in the linting cells. You see that both the Convolution and Q/DQ linters flag some potential problems.

The Convolution linter flags 13 convolutions having INT8 inputs and FP32 outputs. Ideally, you want convolutions to output INT8 data if they are followed by INT8 precision layers. The linter suggests adding a quantization operation following the convolution. Why are the outputs of these convolutions not quantized?

Interactive views of QAT ResNet18.
Figure 6. Output of the convolution linter, warning about INT8 convolutions with float outputs

Take a closer look. To look up a convolution in the engine graph, copy the name of the convolution from the linter table and search for it in the graph SVG browser tab. It turns out that these convolutions are involved in residual-add operations.

After consulting Q/DQ Layer-Placement Recommendations, you might conclude that you must add Q/DQ layers to the residual-connections in the PyTorch model. Unfortunately, the QAT Toolkit cannot perform this automatically and you must manually intervene in the PyTorch model code. For more information, see the example in the TensorRT QAT Toolkit (resnet.py).

The following code example shows the BasicBlock.forward method, with the new quantization code highlighted in yellow.

def forward(self, x: Tensor) -> Tensor:
    identity = x
    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)
    out = self.conv2(out)
    out = self.bn2(out)
    if self.downsample is not None:
        identity = self.downsample(x)
 
    if self._quantize:
        out += self.residual_quantizer(identity)
    else:
        out += identity
    out = self.relu(out)
 
    return out

After you change the PyTorch code, you must regenerate the model and iterate again through the notebook cells using the revised model. You’re now down to three reformatting layers consuming about 20.5% of the total latency (down from 26.5%), and most of the layers now execute in INT8 precision.

Interactive views of QAT ResNet18
Figure 7. QAT ResNet18 mode, after adding Q/DQ on residual-connections

The remaining FP32 layers surround the global average pooling (GAP) layer at the end of the network. Modify the model again to quantize the GAP layer.

def _forward_impl(self, x: Tensor) -> Tensor:
        # See note [TorchScript super()]
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        if self._quantize_gap:
            x = self.gap_quantizer(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

Iterate one final time through the notebook cells using the new model. Now you have only a single reformatting layer and all other layers are executing in INT8. Nailed it!

Precision per layer view for ResNet QAT with quantized residual connection and GAP layers.
Figure 8. Precision per layer view, after adding Q/DQ on residual-connections and quantizing the GAP layer

Now that you are done optimizing, you can use the Engine Comparison notebook to compare the two engines. This notebook is useful not only when you are actively optimizing your network’s performance as you’re doing here, but also in the following situations:

  • When you want to compare engines built for different GPU HW platforms or different TensorRT versions.
  • When you want to assess how layers’ performance scales across different batch sizes.
  • To understand if accuracy disagreement between engines is due to different TensorRT layer precision choices.

The Engine Comparison notebook provides both tabular and graphical views to compare engines and both are applicable, depending on the level of details that you need. Figure 8 shows the stacked latencies of five engines that we’ve built for the PyTorch ResNet18 model. For brevity, I didn’t discuss creating the FP32 and FP16 engines, but these are available in the TREx GitHub repository.

Bar graph of stacked latencies of five engines in the same ResNet18 network illustrating.
Figure 9. Stacked latencies of five engines of the same ResNet18 network

The engine optimized for FP16 precision is about 2x faster than the FP32 engine, but it is also faster than our first attempt at an INT8 QAT engine. As I analyzed earlier, this is due to the many INT8 convolutions that output FP16 data and then require reformat layers to quantize explicitly back to INT8.

If you concentrate only on the three QAT engines optimized in this post, you can see how you eliminated 11 FP16 engine layers when you added Q/DQ to the residual connections. You eliminated another two FP32 layers when you quantized the GAP layer.

Q/DQ placement decisions affect the number of layers executed in INT8 precision compared to floating-point precision.
Figure 10. Precision counts for the three engines optimized

You can also look at how the optimizations affected the latencies of the three engines (Figure 10).

At each Q/DQ placement iteration, we’ve reduced the time consumed to execute the convolution and reformat layers.
Figure 11. Latencies of our three engines, grouped by layer types

You may notice a couple of odd-looking, pooling-layer, latency results: the total pooling latency drops 10x when you quantize the residual connection, and then goes up 70% when you quantize the GAP layer.

Both results are counterintuitive so look at them more closely. There are two pooling layers, a large one after the first convolution, and another tiny one before the last convolution. After you quantized the residual-connections, the first pooling and convolution layers could execute using the output in INT8 precision. They are fused with the sandwiched ReLU into a ConvActPool layer, but this fusion is not supported for floating-point types.

Why did the GAP layer increase in latency when it was quantized? Well, the activation size of this layer is small and each INT8 input coefficient is converted to FP32 for averaging using high precision. Finally, the result is converted back to INT8.

The layer’s data size is also small and resides in the fast L2 cache, and thus the extra precision-conversion computation is relatively expensive. Nonetheless, because you could get rid of the two reformat layers surrounding the GAP layer, the total engine latency (which is what you really care about) is reduced.

Summary

In this post, I introduced the TensorRT Engine Explorer, briefly reviewed its APIs and features, and walked through an example showing how TREx can help when optimizing the performance of a TensorRT engine. TREx is available in TensorRT’s GitHub repository, under the experimental tools directory.

I encourage you to try the APIs and to build new workflows beyond the two workflow notebooks.

Categories
Misc

Accelerating Quantized Networks with the NVIDIA QAT Toolkit for TensorFlow and NVIDIA TensorRT

Introduction to the NVIDIA Quantization-Aware Training toolkit for TensorFlow 2 for model quantization for TensorRT acceleration on NVIDIA GPUs.

We’re excited to announce the NVIDIA Quantization-Aware Training (QAT) Toolkit for TensorFlow 2 with the goal of accelerating the quantized networks with NVIDIA TensorRT on NVIDIA GPUs. This toolkit provides you with an easy-to-use API to quantize networks in a way that is optimized for TensorRT inference with just a few additional lines of code.

This post is accompanied by the Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT GTC session. For the PyTorch quantization toolkit equivalent, see PyTorch Quantization.

Background

Accelerating deep neural networks (DNN) inference is an important step in realizing latency-critical deployment of real-world applications such as image classification, image segmentation, natural language processing, and so on.

The need for improving DNN inference latency has sparked interest in running those models in lower precisions, such as FP16 and INT8. Running DNNs in INT8 precision can offer faster inference and a much lower memory footprint than its floating-point counterpart. NVIDIA TensorRT supports post-training quantization (PTQ) and QAT techniques to convert floating-point DNN models to INT8 precision.

In this post, we discuss these techniques, introduce the NVIDIA QAT toolkit for TensorFlow, and demonstrate an end-to-end workflow to design quantized networks optimal for TensorRT deployment.

Quantization-aware training

The main idea behind QAT is to simulate lower precision behavior by minimizing quantization errors during training. To do that, you modify the DNN graph by adding quantize and de-quantize (QDQ) nodes around desired layers. This enables the quantized networks to minimize accuracy loss over PTQ due to the fine-tuning of the model’s quantization and hyperparameters.

PTQ, on the other hand, performs model quantization using a calibration dataset after that model has already been trained. This can result in accuracy degradation due to the quantization not being reflected in the training process. Figure 1 shows both processes.

Block diagrams with quantization steps via PTQ (uses a calibration data to calculate q-parameters) and QAT (simulates quantization via QDQ nodes and fine-tuning).
Figure 1. Quantization workflows through PTQ and QAT

For more information about quantization, quantization methods (PTQ compared to QAT), and quantization in TensorRT, see Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT.

NVIDIA QAT Toolkit for TensorFlow

The goal of this toolkit is to enable you to easily quantize networks in a way that is optimal for TensorRT deployment.

Currently, TensorFlow offers asymmetric quantization in their open-source Model Optimization Toolkit. Their quantization recipe consists of inserting QDQ nodes at the outputs and weights (if applicable) of desired layers, and they offer quantization of the full model or partial by layer class type. This is optimized for TFLite deployment, not TensorRT deployment.

This toolkit is needed for obtaining a quantized model that is ideal for TensorRT deployment. TensorRT optimizer propagates Q and DQ nodes and fuses them with floating-point operations across the network to maximize the proportion of the graph that can be processed in INT8. This leads to optimal model acceleration on NVIDIA GPUs. Our quantization recipe consists of inserting QDQ nodes at the inputs and weights (if applicable) of desired layers.

We also perform symmetric quantization (used by TensorRT) and offer extended quantization support with partial quantization by layer name and pattern-based layer quantization.

Table 1 summarizes the differences between TFMOT and the NVIDIA QAT Toolkit for TensorFlow.

Feature TFMOT NVIDIA QAT Toolkit
QDQ node placements Outputs and weights Inputs and weights
Quantization support Whole model (full) and of some layers (partial by layer class) Extends TF quantization support: partial quantization by layer name and pattern-based layer quantization by extending CustomQDQInsertionCase
Quantization op used Asymmetric quantization (tf.quantization.fake_quant_with_min_max_vars) Symmetric quantization, needed for TensorRT compatibility (tf.quantization.quantize_and_dequantize_v2)
Table 1. Differences between the NVIDIA QAT Toolkit and TensorFlow Model Optimization Toolkit

Figure 2 shows a before/after example of a simple model, visualized with Netron. The QDQ nodes are placed in the inputs and weights(if applicable) of desired layers, namely convolution (Conv) and fully connected (MatMul).

Contains two images, one before QAT (no QDQ nodes), and one after QAT (with QDQ nodes before Conv and MatMul layers).
Figure 2. Example of a model before and after quantization (baseline and QAT model, respectively)

Workflow for deploying QAT models in TensorRT

Figure 3 shows the full workflow to deploy a QAT model, obtained with the QAT Toolkit, in TensorRT.

Block diagram with steps for model quantization, conversion to ONNX, and TensorRT deployment.
Figure 3. TensorRT deployment workflow for QAT models obtained with the QAT Toolkit
  • Assume a pretrained TensorFlow 2 model in SavedModel format, also referred to as the baseline model.
  • Quantize that model using the quantize_model function, which clones and wraps each desired layer with QDQ nodes.
  • Fine-tune the obtained quantized model, simulating quantization during training, and save it in SavedModel format.
  • Convert it to ONNX.

The ONNX graph is then consumed by TensorRT to perform layer fusions and other graph optimizations, such as dedicated QDQ optimizations, and generate an engine for faster inference.

Example with ResNet-50v1

In this example, we show you how to quantize and fine-tune a QAT model with the TensorFlow 2 toolkit and how to deploy that quantized model in TensorRT. For more information, see the full example_resnet50v1.ipynb Jupyter notebook.

Requirements

To follow along, you need the following resources:

  • Python 3.8
  • TensorFlow 2.8
  • NVIDIA TF-QAT Toolkit
  • TensorRT 8.4

Prepare the data

For this example, use the ImageNet 2012 dataset for image classification (task 1), which requires manual downloads due to the terms of the access agreement. This dataset is needed for the QAT model fine-tuning, and it is also used to evaluate the baseline and QAT models.

Log in or sign up on the linked website and download the train/validation data. You should have at least 155 GB of free space.

The workflow supports the TFRecord format, so use the following the instructions (modified from the TensorFlow instructions) to convert the downloaded .tar ImageNet files to the required format:

  1. Set IMAGENET_HOME=/path/to/imagenet/tar/files in data/imagenet_data_setup.sh.
  2. Download imagenet_to_gcs.py to $IMAGENET_HOME.
  3. Run ./data/imagenet_data_setup.sh.

You should now see the compatible dataset in $IMAGENET_HOME.

Quantize and fine-tune the model

from tensorflow_quantization import quantize_model
from tensorflow_quantization.custom_qdq_cases import ResNetV1QDQCase

# Create baseline model
model = tf.keras.applications.ResNet50(weights="imagenet", classifier_activation="softmax")

# Quantize model
q_model = quantize_model(model, custom_qdq_cases=[ResNetV1QDQCase()])

# Fine-tune
q_model.compile(
    optimizer="sgd",
    loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    metrics=["accuracy"]
)
q_model.fit(
    train_batches, validation_data=val_batches,
    batch_size=64, steps_per_epoch=500, epochs=2
)

# Save as TF 2 SavedModel
q_model.save(“saved_model_qat”)

Convert SavedModel to ONNX

$ python -m tf2onnx.convert --saved-model= --output=  --opset 13

Deploy the TensorRT engine

Convert the ONNX model into a TensorRT engine (also obtains latency measurements):

$ trtexec --onnx= --int8 --saveEngine= -v

Obtain accuracy results on the validation dataset:

$ python infer_engine.py --engine= --data_dir= -b=

Results

In this section, we report accuracy and latency performance numbers for various models in the ResNet and EfficientNet families:

  • ResNet-50v1
  • ResNet-50v2
  • ResNet-101v1
  • ResNet-101v2
  • EfficientNet-B0
  • EfficientNet-B3

All results were obtained on the NVIDIA A100 GPU with batch size 1 using TensorRT 8.4 (EA for ResNet and GA for EfficientNet).

Figure 4 shows the accuracy comparison between baseline FP32 models and their quantized equivalent models (PTQ and QAT). As you can see, there’s little to no loss in accuracy between the baseline and QAT models. Sometimes there’s even better accuracy due to further overall fine-tuning of the model. There’s also overall higher accuracy in QAT over PTQ due to the fine-tuning of the model parameters in QAT.

Bar plot graph comparing the FP32 baseline, and INT8 PTQ and QAT models. The graph shows similar accuracies in all models.
Figure 4. Accuracy of ResNet and EfficientNet datasets in FP32 (baseline), INT8 with PTQ, and INT8 with QAT

ResNet, as a network structure, is stable for quantization in general, so the gap between PTQ and QAT is small. However, EfficientNet greatly benefits from QAT, noted by reduced accuracy loss from the baseline model when compared to PTQ.

For more information about how different models may benefit from QAT, see Table 7 in Integer Quantization for Deep Learning Inference: Principles and Empirical Evaluation (quantization whitepaper).

Figure 5 shows that PTQ and QAT have similar times and introduce an up to 19x speedup compared to their respective baseline model.

Bar plot with FP32 and INT8 latency: 17x speed-up in ResNet-50v1, 11x in 50v2, 19x in 101v1, and 13x in 101v2, and 10x in EfficientNet-B0 and 8x in B3.
Figure 5. Latency performance evaluation on various models in the ResNet and EfficientNet families

PTQ can sometimes be slightly faster than QAT as it tries to quantize all layers in the model, which usually results in faster inference, whereas QAT only quantizes the layers wrapped with QDQ nodes.

For more information about how TensorRT works with QDQ nodes, see Working with INT8 in the TensorRT documentation and the Toward INT8 Inference: An End-to-End Workflow for Deploying Quantization-Aware Trained Networks Using TensorRT GTC session.

For more information about performance numbers on various supported models, see the model zoo.

Conclusion

In this post, we introduced the NVIDIA QAT Toolkit for TensorFlow 2. We discussed the advantages of using the toolkit in the context of TensorRT inference acceleration. We then demonstrated how to use the toolkit with ResNet50 and perform accuracy and latency evaluations on ResNet and EfficientNet datasets.

Experimental results show that the accuracy of INT8 models trained with QAT is within around a 1% difference compared to FP32 models, achieving up to 19x speedup in latency.

For more information, see the following resources:

Categories
Misc

Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin

Accounting for nearly half of global vehicle sales in 2021, SUVs have grown in popularity given their versatility. Now, NIO aims to amp up the volume further. This week, the electric automaker unveiled the ES7 SUV, purpose-built for the intelligent vehicle era. Its sporty yet elegant body houses an array of cutting-edge technology, including the Read article >

The post Smart Utility Vehicle: NIO ES7 Redefines Category with Intelligent, Versatile EV Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.

Categories
Misc

AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases

At a time when much about COVID-19 remained a mystery, U.K.-based PrecisionLife used AI and combinatorial analytics to discover new genes associated with severe symptoms and hospitalizations for patients. The techbio company’s study, published in June 2020, pinpoints 68 novel genes associated with individuals who experienced severe disease from the virus. Over 70 percent of Read article >

The post AI for Personalized Health: Startup Advances Precision Medicine for COVID-19, Chronic Diseases appeared first on NVIDIA Blog.

Categories
Misc

Get Your Wish: Genshin Impact Coming to GeForce NOW

Greetings, Traveler. Prepare for adventure. Genshin Impact, the popular open-world action role-playing game, is leaving limited beta and launching for all GeForce NOW members next week. Gamers can get their game on today with the six total games joining the GeForce NOW library. As announced last week, Warhammer 40,000: Darktide is coming to the cloud Read article >

The post Get Your Wish: Genshin Impact Coming to GeForce NOW appeared first on NVIDIA Blog.