Categories
Misc

Optimizing Data Movement in GPU Applications with the NVIDIA Magnum IO Developer Environment

Magnum IO is the collection of IO technologies from NVIDIA and Mellanox that make up the IO subsystem of the modern data center and enable applications at scale. If you are trying to scale up your application to multiple GPUs, or scaling it out across multiple nodes, you are probably using some of the libraries … Continued

Magnum IO is the collection of IO technologies from NVIDIA and Mellanox that make up the IO subsystem of the modern data center and enable applications at scale. If you are trying to scale up your application to multiple GPUs, or scaling it out across multiple nodes, you are probably using some of the libraries in Magnum IO. NVIDIA is now publishing the Magnum IO Developer Environment 21.04 as an NGC container, containing a comprehensive set of tools to scale IO. This allows you to begin scaling your applications on a laptop, desktop, workstation, or in the cloud.

Magnum IO brings the ability to improve end-to-end wall clock time of IO bound applications. Imagine a workflow with three stages:

  • ETL (extract, transform, load)
  • Scaled-out compute
  • Post-processing and output

The first-stage ETL jobs are dominated by reading large amounts of data into GPUs and can achieve optimal performance by using Magnum IO GPUDirect Storage (GDS) for directly copying data from storage into GPU memory. This also helps reduce the CPU utilization and improves overall data center utilization.

The second stage, which comprises a distributed communication GPU-to-GPU IO intense job, can benefit from optimizing communication with NCCL message-passing or NVSHMEM shared memory models, on low latency InfiniBand networks.

The final post-processing and output stages, as well as the checkpointing and temporary storage during the workflow, can again improve performance with GDS. Magnum IO management layers also enable monitoring, troubleshooting, and detecting anomalies of all the stages of the workflow.

The principles of Magnum IO architecture are based on flexibility, concurrency, asynchrony, hierarchy, and telemetry to enable you to balance concurrency and locality.
Figure 1. Architectural principles of Magnum IO.

Scaling applications to run efficiently is often a complex and time-consuming task. We understand that the changes to code to adopt the Magnum IO technologies can be invasive, and any changes require development, debugging, testing, and benchmarking. Magnum IO libraries also work alongside the profilers, logging, and monitoring tools needed to observe what’s happening, locate bottlenecks, and address them. You should understand the performance tradeoffs of each stage of the computation and understand the relationships and between the hardware components in the system.

Magnum IO libraries provide APIs that manage the underlying hardware, allowing you to focus on the algorithmic aspects of your applications. The APIs are designed to be high-level so that they are easy to integrate with, but also to expose finer controls for when you start fine-tuning performance, after the behaviors and tradeoffs of the application running at scale are understood.

The high bandwidth and low latency offered by NVLink operating at 300GB/s, and InfiniBand in NVIDIA DGX A100 systems also opens new possibilities for algorithms. The NVLink bandwidth between GPUs now makes remote memory almost local. The total number of PCIe lanes from remote storage may exceed those from local storage. Magnum IO libraries on NVIDIA hardware allow algorithms to take full advantage of the GPU memory across all nodes, rather than sacrificing efficiency to avoid what was bottlenecking IO with high latencies in the past.

Magnum IO technologies are grouped under Network IO, Storage IO, In-Network Compute ,and Management.
Figure 2. Magnum IO technologies.

Magnum IO: GPU-to-GPU communications

Core to Magnum IO are libraries that allow GPUs to talk directly to each other over the fasted links available.

NCCL

The NVIDIA Collective Communications Library (NCCL, pronounced “nickel”) is a library providing inter-GPU communication primitives that are topology-aware and can be easily integrated into applications.

NCCL is smart about IO on systems with complex topology: systems with multiple CPUs, GPUs, PCI busses, and network interfaces. It can selectively use NVLink, Ethernet, and InfiniBand, using multiple links when possible. Consider using NCCL APIs whenever you plan your application or library to run on a mix of multi-GPU multi-node systems in a data center, cloud, or hybrid system. At runtime, NCCL determines the topology and optimizes layout and communication methods.

NVSHMEM

NVSHMEM creates a global address space for data that spans the memory of multiple GPUs and can be accessed with fine-grained GPU-initiated operations, CPU-initiated operations, and operations on CUDA streams.

In many HPC workflows, models and simulations are run that far exceed the size of a single GPU or node. NVSHMEM allows for a simpler asynchronous communication model in a shared address space that spans GPUs within or across nodes, with lower overheads, possibly resulting in stronger scaling compared to a traditional Message Passing Interface (MPI).

UCX

Unified Communication X (UCX) uses high-speed networks, including InfiniBand, for inter-node communication and shared memory mechanisms for efficient intra-node communication.  If you need a standard CPU-driven MPI, PGAS OpenSHMEM libraries, and RPC, GPU-aware communication is layered on top of UCX.

UCX is appropriate when driving IO from the CPU, or when system memory is being shared. UCX enables offloading the IO operations to both host adapter (HCA) and switch, which reduces CPU load. UCX simplifies the portability of many peer-to-peer operations in MPI systems.

Magnum IO: Storage-to-GPU communications

Magnum IO also addresses the need to move data between GPUs and storage systems, both local and remote, with as little overhead as possible along the way.

GDS

NVIDIA GPUDirect Storage (GDS) enables a direct data path for Remote Direct Memory Access (RDMA) transfers between GPU memory and storage, which avoids a bounce buffer and management by the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.

GDS and the cuFile APIs should be used whenever data needs to move directly between storage and the GPU. With storage systems that support GDS, significant increases in performance on clients are observed when IO is a bottleneck. In cases where the storage system does not support GDS, IO transparently falls back to normal file reads and writes.

Moving the IO decode/encode from the CPU to GPU creates new opportunities for direct data transfers between storage and GPU memory which can benefit from GDS performance. An increasing number of data formats are supported in CUDA.

Magnum IO: Profiling and optimization

NVIDIA Nsight Systems lets you see what’s happening in the system and NVIDIA Cumulus NetQ allows you to analyze what’s happening on the NICs and switches. This is critical to finding some causes of bottlenecks in multi-node applications.

Nsight Systems

Nsight Systems is a low-overhead performance analysis tool designed to provide insights that you need to optimize your software. It provides everything that you would expect from a profiler for a GPU. Nsight Systems has a tight integration with many core CUDA libraries, giving you detailed information on what is happening.

Nsight Systems allows you to see exactly what’s happening on the system, what code is taking a long time, and when algorithms are waiting on GPU/CPU compute, or device IO. Nsight Systems is relevant to Magnum IO and included in the Magnum IO container for convenience, but its scope spans well outside of Magnum IO to monitoring compute that’s unrelated to IO.

Nsight Systems trace of a NCCL application that has a command stall. Nsight Systems reveals where time is being spent and where there are idle resources.
Figure 4. Diagram of a NCCL application trace with Nsight Systems.

NetQ

NetQ is a highly scalable, modern, network operations tool set that provides visibility, troubleshooting and lifecycle management of your open networks in real time. It enables network profiling functionality that can be used along with Nsight Systems or application logs to observe the network’s behavior while the application is running.

NetQ is part of Magnum IO itself, given its integral involvement in managing IO in addition to profiling it. 

Getting started with the Magnum IO Developer Environment

We are launching the Magnum IO Developer Environment as a container hosted on NVIDIA NGC for GTC 21. A bare-metal installer for Ubuntu and RHEL may be coming soon. The container provides a sealed environment with the latest versions of the libraries compatible with each other. It makes it easy for you to begin optimizing your application’s IO. Installing and working with the container does not interfere with any existing system setup, which may have different versions of the components.

The Magnum IO components included in the 21.04 container are as follows:

  • Ubuntu 20.04
  • CUDA
  • Nsight Systems CLI
  • GDS
  • GPUDirect RDMA
  • GPUDirect P2P
  • NCCL
  • UCX
  • NVSHMEM

The first step is to profile the application and find the bottlenecks, then evaluate which of the Magnum IO tools, libraries, or algorithm changes are appropriate for removing those bottlenecks and optimizing the application.

Download the developer environment today: Magnum IO SDK.

Categories
Misc

Creating Medical Imaging Models with NVIDIA Clara Train 4.0

In the field of medicine, advancements in artificial intelligence are constantly evolving. To keep up with the pace of innovation means adapting and providing the best experience to researchers, clinicians, and data scientists. NVIDIA Clara Train, an application framework for training medical imaging models, has undergone significant changes for its upcoming release at the beginning … Continued

In the field of medicine, advancements in artificial intelligence are constantly evolving. To keep up with the pace of innovation means adapting and providing the best experience to researchers, clinicians, and data scientists. NVIDIA Clara Train, an application framework for training medical imaging models, has undergone significant changes for its upcoming release at the beginning of May, with product enhancements for better AI model training.

Two drawings of people, labeled data scientists and developers. Clara Train helps data scientists eliminate mundane tasks, standardize workflows, and focus on domain research. Clara Train helps devs speed up development and reduce technical debt.
Figure 1. Get the benefits of using Clara Train whether you’re a researcher or application developer.

In this post, I cover three new major features introduced in Clara Train 4.0:

  • Upgrade of the underlying infrastructure of Clara Train based on MONAI.
  • Expansion into digital pathology, with a training pipeline to help you get started.
  • Update of the DeepGrow model to annotate organs effectively in 3D images. 

The Clara Train Early Access program gives you access to all features:  Sign up today!

First, Clara Train has updated its backend infrastructure to use MONAI, the Medical Open Network for AI. MONAI is an open-source, PyTorch-based framework that provides domain-optimized foundational capabilities for healthcare. This community-led library helps create reproducible experiments by reducing the need for duplication or re-implementation. Figure 2 shows the three layers that make up Clara Train.

A 3-tier diagram showing PyTorch and Triton at the bottom representing the base of Clara Train.  A middle layer with Data Loaders and Transforms, Network Architectures, and Training and Evaluation Engines which are built in to MONAI.  A top layer with Pretrained models, AI-Assisted Annotation, Training Pipelines and Deployment Pipelines shows all the Clara Train features built using the underlying technologies.
Figure 2. Clara Train stack, built from the ground up using PyTorch, MONAI, and NVIDIA Technologies.

The top layer includes pretrained models that can be downloaded from the NGC catalog and which are now updated to work with MONAI. You can also continue to use all the features already in Clara Train, like AI-assisted annotation, federated learning, and training and deployment pipelines.

Specialized for training medical imaging models, the middle layer showcases MONAI components. These include data loaders and transforms, network architectures, and training and evaluation engines. MONAI aims to provide a comprehensive list of medical image–specific transformations and reference networks that provides flexibility and code readability.

The bottom layer highlights the two base frameworks that make up the foundation of MONAI and Clara Train. By being built on top of PyTorch, you receive all the benefits of using one of the most widely used machine learning frameworks, as well as the community support. For inference, Clara Train uses NVIDIA Triton, which simplifies the deployment of AI models and maximizes GPU utilization.

Second, Clara Train is expanding into digital pathology.  And although digital pathology is an imaging workload, it differs significantly from radiology in its details and challenges. To help address these challenges, we’ve created a digital pathology pipeline. 

This pipeline includes optimized data loading using cuCIM, which can tile large datasets on-demand and process them through a CUDA-enabled pipeline. It also includes training optimizations like Smart Cache, which re-uses a portion of data in memory at each epoch and produces a more efficient training workflow.  Last, it includes a fully convolutional classification network that works with whole-slide images.  All these features provide you with up to a 10x speedup in training, compared to other pathology pipelines.

A diagram of a fully convolutional network architecture that uses whole-slide images and breaks the images into a grid of patches for training
Figure 3. New digital pathology pipeline architecture.

To use cuCIM outside Clara Train, you can install it using pip by issuing the following command:

pip install cucim

We’ve also included a pretrained model that detects tumors in lymph nodes using whole-slide histopathology images.  You can use this model to create your own digital pathology model.

A sample image from the CAMELYON-16 dataset that is segmented, and it being zoomed in on over four frames. The final frame shows the most zoomed in picture with text signifying it will classify the image as Tumor or Not Tumor
Figure 4. The pretrained model shows how it uses classification to help segment whole-slide images.

Last, we’ve updated the DeepGrow model to work on 3D CT images. This updated model gives you the ability to segment an organ in 3D with only a few clicks across the organ. If you’re looking to create an organ-specific, DeepGrow 3D model, we’ve provided a pipeline to help you get started quickly.

Federated learning with homomorphic encryption

In Clara Train 4.0, we also added homomorphic encryption tools for federated learning. Homomorphic encryption allows you to compute data while the data is still encrypted. 

Two images are connected by a line.  The first image is a side-by-side image of a brain showing how differential privacy affects the image. The second image shows a central hospital securely communicating with three edge-node hospitals and aggregating the encrypted weights.
Figure 5. Homomorphic encryption helped in preserving privacy while using federated learning.

In Clara Train 3.1, all clients used certified SSL channels to communicate their local model updates with the server. The SSL certificates are needed to establish trusted communication channels and are provided through a third party that runs the provisioning tool and securely distributes them to the hospitals. This secures the communication to the server, but the server can still see the raw model (unencrypted) updates to do aggregation.

With Clara Train 4.0, the communication channels are still established using SSL certificates and the provisioning tool. However, each client optionally also receives additional keys to homomorphically encrypt their model updates before sending them to the server. The server doesn’t own a key and only sees the encrypted model updates. With homomorphic encryption, the server can aggregate these encrypted weights and then send the updated model back to the client. The clients can decrypt the model weights because they have the keys and can then continue with the next round of training.

Homomorphic encryption ensures that each client’s changes to the global model stays hidden by preventing the server from reverse-engineering the submitted weights and discovering any training data. This added security comes at a computational cost on the server. However, it can play an important role in healthcare in making sure that patient data stays secure at each hospital while still benefiting from using federated learning with other institutions.

Bring your own components to Clara Train

MONAI provides a lot of domain-specific functionality directly through their transformations, loss, and metric functions. These core components are independent modules and can be integrated into any PyTorch program. However, if you’re a researcher and developing state-of-the-art models, these components might not be sufficient.

When this is the case, you can include your own custom functions directly into Clara Train through the bring your own components (BYOC) functionality. By writing your components modularly and in Python, you can add them to the training configuration file.

Before getting started, you must define the Medical Model Archive (MMAR).  In Clara Train, an MMAR defines a standard structure for organizing all artifacts produced during the model development life cycle and defining your training workflow. You modify these configuration files to add in your custom functions.

Here’s an example of how to do this by adding your own custom network architecture and loss function.  First, start by defining these functions in their own Python file. For this example, assume that your custom functions are in a BYOC folder to make sure that you keep everything organized. This also allows you to see how the pathing works for calling out to your custom function from within the MMAR config file.

{ 
  "epochs": 10, 
  "use_gpu": true, 
  "multi_gpu": false, 
  "amp": true, 
  "determinism": {    }, 
  "train": { 
    "loss": {
              
            }, 
    "optimizer": {    }, 
    "lr_scheduler": {    }, 
    "model": {
              
             }, 
    "pre_transforms": [    ], 
    "dataset": {    }, 
    "dataloader": {    }, 
    "inferer": {    }, 
    "handlers": [    ], 
    "post_transforms": [    ], 
    "metrics": [    ], 
    "trainer": {    } 
  }, 
  "validate": {    } 
} 

For this post, we’re not including all the functions needed to create this network in the code examples. To see all the code required, see the complete example at NVIDIA/clara-train-examples in the BYOC Jupyter notebook.

The following code example defines your custom MyBasicUNet class, a UNet implementation with 1D, 2D, and 3D support, defined in a file labeled myNetworkArch.py:

from typing import Sequence, Union

import torch
import torch.nn as nn

from monai.networks.blocks import Convolution, UpSample
from monai.networks.layers.factories import Conv, Pool
from monai.utils import ensure_tuple_rep

class MyBasicUNet(nn.Module):
    def __init__(
        self,
        dimensions: int = 3,
        in_channels: int = 1,
        out_channels: int = 2,
        features: Sequence[int] = (32, 32, 64, 128, 256, 32),
        act: Union[str, tuple] = ("LeakyReLU", {"negative_slope": 0.1, "inplace": True}),
        norm: Union[str, tuple] = ("instance", {"affine": True}),
        dropout: Union[float, tuple] = 0.0,
        upsample: str = "deconv",
    ):
        super().__init__()
…

Next, you define the custom loss function that computes the average dice loss between two tensors. The following code example is a section of the MyDiceLoss class defined in a file labeled myLoss.py:

from typing import Callable, Optional, Union
 
import torch
from torch.nn.modules.loss import _Loss
 
from monai.networks import one_hot
from monai.utils import LossReduction, Weight

class MyDiceLoss(_Loss):
    def __init__(self,include_background: bool = True,to_onehot_y: bool = False,sigmoid: bool = False,softmax: bool = False,
        other_act: Optional[Callable] = None, squared_pred: bool = False, jaccard: bool = False,
        reduction: Union[LossReduction, str] = LossReduction.MEAN,smooth_nr: float = 1e-5,smooth_dr: float = 1e-5,batch: bool = False,
    ) -> None:
        super().__init__()
…

Now that you’ve defined the custom network and loss functions, here’s how to add them to the MMAR configuration. The configuration file for this run is labeled trn_BYOC_arch_loss.json, and you focus on two different sections of the JSON file.

First, add the custom network to the config by defining a train section and within that section a model field. This is where you add a reference to the custom model. When using an MMAR, you can also define arguments to pass to the function.  

The following code example shows the train->model section of the config file:

"model": {        
  "path": "BYOC.myNetworkArch.MyBasicUNet",        
  "args": {          
    "dimensions": 3,           
    "in_channels": 1,          
    "out_channels": 2,          
    "features": [16, 32, 64, 128, 256, 16],          
    "norm": "batch"        
  } 
} 
  • path—Set to BYOC.myNetworkArch.MyBasicUNet.
  • args—Passes parameters to the custom model function.

The path description for the model is defined by the path from the root directory to your custom network file and then the file and class name for the network. The following code example shows how to determine each part of the path parameter.

"path": "BYOC.myNetworkArch.MyBasicUNet"  
BYOC = Folder where the myNetworkArch.py file is located   
myNetworkArch = The name of the Python file that contains the custom network   
MyBasicUNet = The class to instantiate to call the custom network 
  • BYOC—Folder where the myNetworkArch.py file is located. 
  • myNetworkArch—Name of the Python file that contains the custom network. 
  • MyBasicUNet—Class that you instantiate to call the custom network.

You use a similar structure to add in the custom loss function, but this time you place it in train -> loss within the config file. The following code example shows what the train -> loss section of the config should look like:

 “loss”: {
      “path”: “BYOC.myLoss.MyDiceLoss”,
      “args”: {
                      “to_onehot_y”: true,
                      “softmax”: true
                     }
 }       

It has two arguments:

  • path—Set to BYOC.myLoss.MyDiceLoss.
  • args—Passes parameters to the custom loss function.

This config section follows the same rules as earlier for the path argument.  

To start training, I’ve included a bash script called train_W_Config.sh. Pass the config file as the first argument when calling the script. The following code example shows the training script that calls out to the relevant Clara Train module, along with all the parameters.

python3 -u -m medl.apps.train  
    -m $MMAR_ROOT  
    -c $CONFIG_FILE  
    -e $ENVIRONMENT_FILE  
    --write_train_stats  
    --set  
    print_conf=True  
    MMAR_CKPT_DIR=$MMAR_CKPT_DIR 

Now you’re ready to start training! Run the following command, which calls the training script and passes in the configuration file: 

 $MMAR_ROOT/commands/train_W_Config.sh trn_BYOC_arch_loss.json 

Summary

You’ve now added your own custom functions to an MMAR training pipeline. You can add any other custom function to the MMAR in a similar way as either of the functions walked through earlier. To find the complete example of this BYOC Jupyter notebook, along with additional notebooks on AI-assisted annotation, AutoML, digital pathology, and federated learning, see the NVIDIA/clara-train-examples GitHub repository.

You can also sign up today to get access to the Clara Train 4.0 Early Access program. We’re only a few weeks away from general availability, so check back soon for the full release!

Categories
Misc

Fast-Tracking Hand Gesture Recognition AI Applications with Pretrained Models from NGC

One of the main challenges and goals when creating an AI application is producing a robust model that is performant with high accuracy. Building such a deep learning model is time consuming. It can take weeks or months of retraining, fine-tuning, and optimizing until the model satisfies the necessary requirements. For many developers, building a … Continued

One of the main challenges and goals when creating an AI application is producing a robust model that is performant with high accuracy. Building such a deep learning model is time consuming. It can take weeks or months of retraining, fine-tuning, and optimizing until the model satisfies the necessary requirements. For many developers, building a deep learning AI pipeline from scratch is not a viable option, which is why we built the NVIDIA NGC catalog.

The NGC catalog is the NVIDIA GPU-optimized hub of AI and HPC containers, pretrained models, SDKs, and Helm charts. It is designed to simplify and accelerate end-to-end workflows.

The NGC catalog also hosts a rich variety of task-specific, pretrained models for a variety of domains, such as healthcare, retail, and manufacturing, and across AI tasks, such as computer vision and speech and language understanding. In this post, we discuss the benefits of using pretrained models from the NGC catalog. We then show how you can use pretrained models for computer vision to build a hand gesture recognition AI application.

Why use a pretrained model?

To build an AI model from scratch, you often need access to large, high-quality datasets. In many instances, you may not have access to such datasets and may have to acquire the data yourself or use third-party resources. Even then, the data might require restructuring and preparation for training. This becomes a bottleneck for data scientists who must spend a lot of time labeling, annotating, and transforming the data instead of designing AI models.

Other typical development steps involve building a deep learning model from an open-source framework, training, refining, and retraining several times to reach the desired accuracy over several iterations. The size and complexity of deep learning models is another challenge. Over the last five years, the demand for computational resources has increased by ~30,000 times, from ResNet 50 five years ago to BERT-Megatron today. Coping with such large models requires you to have access to large-scale clusters to take advantage of scalability offered by multi-node systems.

A pretrained model, as the name suggests, is a model that has been previously trained on a particular representative dataset. It contains the weights and biases fine-tuned for this representation. To accelerate development, you can initialize your own models with pretrained ones. This typically helps you save time and allows you to run more iterations to refine the model. The technique is transfer learning.

Pretrained models for various use cases and domains

The NGC catalog hosts models specific to certain industries, such as automotive, healthcare, manufacturing, retail, and so on. The catalog also provides models for the following use cases:

Computer vision

  • Detection: SSD PyTorch
  • Classification: ResNet50 v1.5, resnext101-32x4d
  • Segmentation: MaskRCNN, UNET Industrial

Speech

  • Automatic speech recognition: Jasper
  • Speech synthesis: FastPitch, TacoTron2, Waveglow
  • Translation: GMNT, Transformer

Understanding

  • Language modeling: BERT, Electra
  • Recommender systems: Wide and Deep, VAE

These pretrained models are developed directly by NVIDIA Research and by NVIDIA partners. You can readily integrate these pretrained models into existing industry SDKs, such as NVIDIA Clara for healthcare, NVIDIA Jarvis for conversational AI, NVIDIA Merlin for deep learning recommender systems, and NVIDIA DRIVE for autonomous vehicles, allowing you to get to production faster.

Model credentials

Models now include credentials that help you quickly identify the right model to deploy for your AI software development. These credentials provide a report card for the model, showing the training configurations, performance metrics, and other key parameters. The metrics show important hyperparameters like model accuracy, epoch, batch size, precision, training dataset, throughput, and other important dimensions that help you identify the usability of the models and give you the confidence to deploy them.

The model credential is a scorecard of the model itself and  shows many key parameters associated with the model: architecture, performance, batch size, epoch, type of dataset, and more
Figure 1. Model credentials for the BERT PyTorch Checkpoint pretrained model on the NGC catalog.

These model credentials enable you to quickly identify the right models and deploy them faster in production. The scorecard metrics are customizable so that you can use the appropriate attributes to better describe the models. For example, a computer vision model would better describe the inference performance with the images-per-second metric, while sentences-per-second is suitable for NLP models.

Transfer learning

After you select the model, you might need to train with a custom dataset for a different task. NVIDIA Transfer Learning Toolkit (TLT) is a Python-based AI toolkit for taking purpose-built pretrained AI models and customizing them with your own data. TLT adapts popular network architectures and backbones to your data, allowing you to train, fine-tune, prune, and export highly optimized and accurate AI models for edge deployment. TLT is a standard feature for all pretrained models. Fine-tune these pretrained models with your own data. This can considerably speed up model development time by nearly 10X: from around 80 weeks to about eight weeks.

The model credential is a scorecard of the model itself and  shows many key parameters associated with the model: architecture, performance, batch size, epoch, type of dataset, and more
Figure 2. The overall stack of the TLT that you can apply on NGC pretrained models with your own data.

Accelerating training performance

In this section, we highlight the breakthroughs in key technologies implemented across the pretrained models: automatic mixed precision and multi-GPU training.

Automatic mixed precision

Deep neural networks can often be trained with a mixed-precision strategy, employing FP16 and FP32 precision. This results in a significant reduction in computation time and memory bandwidth requirements, while preserving model accuracy. For more information, see the Mixed Precision Training paper from NVIDIA Research. With automatic mixed precision (AMP), you can enable mixed precision with either no code changes or only minimal changes.

AMP is a standard feature across all NGC models. It automatically uses the Tensor Cores on NVIDIA Volta, NVIDIA Turing, and NVIDIA Ampere Architectures. You can get results up to 3x faster training with Tensor Cores.

Multi-GPU training

Multi-GPU training is a standard feature implemented on all NGC models. Under the hood, the Horovod and NCCL libraries are employed for distributed training and efficient communication. For most of the models, multi-GPU training on a set of homogeneous GPUs is enabled by setting the number of GPUs.

Hand gesture recognition AI application

In this example, you start with a pretrained detection model, repurpose it for hand detection using TLT 3.0, and use it together with the purpose-built gesture recognition model. After it’s trained, you deploy this model on NVIDIA Jetson.

Setting up the environment

  • Ubuntu 18.04 LTS
  • python >=3.6.9
  • docker-ce >= 19.03.5
  • docker-API 1.40
  • nvidia-container-toolkit >= 1.3.0-1
  • nvidia-container-runtime >= 3.4.0-1
  • nvidia-docker2 >= 2.5.0-1
  • nvidia-driver >= 455.xx

You must also have an NGC account and API key. It’s free to use. When you’re registered, open the setup page for further instructions. For hardware requirements, see Requirements and Installation.

Set up your Python environment using virtualenv and virtualenvwrapper:

pip3 install virtualenv
pip3 install virtualenvwrapper

Add the following lines to your shell startup file (.bashrc, .profile, and so on.) to set the location where the virtual environments should live, the location of your development project directories, and the location of the script installed with this package:

export WORKON_HOME=$HOME/.virtualenvs
export PROJECT_HOME=$HOME/Devel
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3
export VIRTUALENVWRAPPER_VIRTUALENV=/home/USER_NAME/.local/bin/virtualenv
source ~/.local/bin/virtualenvwrapper.sh

Create a virtual environment:

mkvirtualenv tlt_gesture_demo

Activate your virtual environment:

workon tlt_gesture_demo

If you forget the virtualenv name, type the following:  workon.

For more information, see virtualenvwrapper 5.0.1.dev2.

Setting up TLT 3.0

In TLT3.0, we have created an abstraction above the container. You launch all your training jobs from the launcher. There’s no need to manually pull the appropriate container, as tlt-launcher handles that. You can install the launcher using pip with the following commands:

pip3 install nvidia-pyindex
pip3 install nvidia-tlt

You also need to install Jupyter notebook to work with this demo.

pip install notebook

Preparing the EgoHands dataset

To train the hand-detection model, we used the publicly available dataset EgoHands, provided by IU Computer Vision Lab, Indiana University. EgoHands contains 48 different videos of egocentric interactions with pixel-level, ground-truth annotations for 4,800 frames and more than 15,000 hands. To use it with TLT, the dataset must be converted into KITTI format. For this example, we adapted the open-source script by JK Jung.

Apply a small change to the original script to make it compatible with TLT. In the line function box_to_line(box), remove the score component by replacing the return statement with the following:

return ' '.join(['hand',
                     '0',
                     '0',
                     '0',
                     '{} {} {} {}'.format(*box),
                     '0 0 0',
                     '0 0 0',
                     '0'])

To convert the dataset, download the prepare_egohands.py file, apply the above-mentioned modification, set the correct paths, and follow the instructions in egohands_dataset/kitti_conversion.ipynb in this project. In addition to calling the original conversion script, this notebook converts your dataset into training and testing sets, as required by TLT.

Training the detection model

As part of TLT 3.0, we provide a set of Jupyter notebooks demonstrating training workflows for various models. The notebook for this demo can be found in the training_tlt directory in the gesture_recognition_tlt_deepstream GitHub repository.

After activating the virtual environment, navigate to the directory, start the notebook, and follow the instructions in the training_tlt/handdetect_training.ipynb notebook in your browser.

cd training_tlt
jupyter notebook

Running TLT training on DetectNet V2 initialized with PeopleNet

Start with fine-tuning the pretrained PeopleNet model from the NGC catalog. PeopleNet model is a DetectNet V2 model trained to recognize objects of three classes: persons, bags, and faces. After the training, these categories are overwritten by a single category: hands. We chose to initialize our model with PeopleNet, as hands are elements found in humans and the network should already have learned some representation of this category.

The initial model can detect one or more physical objects from three categories within an image and return a box around each object, as well as a category label for each object. Three categories of objects detected by this model are persons, bags, and faces.
Figure 3. The PeopleNet pretrained model available on the NGC catalog.

DetectNet V2 now supports restart from checkpoints. If the training job is killed prematurely, you may resume training from the closest checkpoint by re-running the same command. Make sure to use the checkpoint when restarting the training.

After training, the model should be evaluated to see if you were able to achieve the desired performance in terms of accuracy. For that, TLT provides an evaluation tool, which can be executed in the notebook by running the following command:

!tlt detectnet_v2 evaluate -e $SPECS_DIR/egohands_train_resnet34_kitti.txt
                           -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned_peoplenet/weights/resnet34_detector.tlt 
                           -k $KEY

The parameters are as follows:

  • The configuration file (the same one used for training)
  • The trained model file
  • The unique key used for training

Pruning the model

After you have fine-tuned the model, prune it so that you can build a model smaller in size for inference. Pruning is a process where you remove unnecessary connections in a neural network so that the corresponding computation does not need to be executed, freeing up memory and accelerating the model.

However, pruning brings about a loss in accuracy of the model. Usually, you just adjust a threshold for accuracy and model size trade-off. A higher threshold value gives you a smaller model but with lower accuracy. The threshold to use is dependent on the dataset. Pick some value as a starting point. If the retrain accuracy is good, you can increase this value to get a smaller model size. Otherwise, lower this value to get better accuracy. For some internal studies, we noticed that a threshold value of 0.01 is a good starting point for DetectNet V2 models.

# Create an output directory if it doesn't exist.
!mkdir -p $LOCAL_EXPERIMENT_DIR/experiment_dir_pruned

!tlt detectnet_v2 prune 
                  -m $USER_EXPERIMENT_DIR/experiment_dir_unpruned_peoplenet/weights/resnet34_detector.tlt 
                  -o $USER_EXPERIMENT_DIR/experiment_dir_pruned/resnet34_nopool_bn_detectnet_v2_pruned.tlt 
                  -eq union 
                  -pth 0.0000052 
                  -k $KEY

The model must be retrained to bring back accuracy after pruning. You should create a retraining specification with pretrained weights as a pruned model.

To load the pruned model graph, for retraining, set the load_graph option to true in the model_config and load the pruned model graph. If, after retraining, the model shows some decrease in mAP, it could be that the originally trained model was pruned too much. You can reduce the pruning threshold and reduce the pruning ratio and then use the new model to retrain.

DetectNet V2 now supports quantization aware training (QAT) to optimize the model even more. This step is usually performed during retraining after pruning.

Retraining after pruning with QAT

All DetectNet models, unpruned and pruned, can be converted to QAT models by setting the enable_qat parameter in the training_config component of the spec file to true.

!tlt detectnet_v2 train -e $SPECS_DIR/egohands_retrain_resnet34_kitti_qat.txt                         -r $USER_EXPERIMENT_DIR/experiment_dir_retrain_qat 
                        -k $KEY 
                        -n resnet34_detector_pruned_qat 
                        --gpus $NUM_GPUS

Evaluating the QAT-converted model

This section evaluates a QAT-enabled, pruned, retrained model. The mAP of this model should be comparable to that of the pruned retrained model without QAT. However, due to quantization, it is possible sometimes to see a drop in the mAP value for certain datasets. To evaluate the new model, execute the following command in the notebook:

!tlt detectnet_v2 evaluate -e $SPECS_DIR/egohands_retrain_resnet34_kitti_qat.txt                            -m $USER_EXPERIMENT_DIR/experiment_dir_retrain_qat/weights/resnet34_detector_pruned_qat.tlt 
                           -k $KEY 
                           -f tlt

Acquiring the gesture recognition model from the NGC catalog

For this application, use the trained hand-detection model cascaded with GestureNet, a gesture recognition model from the NGC. You can download the GestureNet model from the NGC catalog using the wget method:

Model card for GestureNet
Figure 4. The GestureNet model from the NGC catalog.
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/tlt_gesturenet/versions/deployable_v1.0/zip -O tlt_gesturenet_deployable_v1.0.zip

Or you can use the CLI command:

ngc registry model download-version "nvidia/tlt_gesturenet:deployable_v1.0"

As you are not retraining this model and using it directly for deployment, select the deployable_v1.0 version.

Deploying on NVIDIA Jetson with the DeepStream SDK

Now that you have fine-tuned a detection network and downloaded a gesture recognition model, deploy these models on Jetson, the target edge device. In this section, we show you how to deploy a model using the DeepStream SDK, a multi-platform scalable framework for video analytics applications

Prerequisites

These prerequisites are specific for Jetson deployment. To repurpose this solution to run on a discrete GPU, see DeepStream Getting Started.

  • CUDA 10.2
  • cuDNN 8.0.0
  • TensorRT 7.1.0
  • JetPack >= 4.4

If you don’t have the DeepStream SDK installed with your JetPack version, follow the Jetson setup instructions from the DeepStream Quick Start Guide.

Preparing the models for deployment

There are two ways of deploying a model with DeepStream SDK. The first relies on a TensorRT runtime and requires a model to be converted into a TensorRT engine. The second relies on Triton Inference Server. Triton Server is a server that can be used as a standalone solution, but it can also be used integrated into the DeepStream app. Such a setup allows high flexibility because it can accept models in various formats that do not necessarily have to be converted into TensorRT format. In this post, we show both types by deploying the hand detector using the TensorRT runtime and gesture recognition model using Triton Server.

To deploy a model to DeepStream using the TensorRT runtime, make sure that the model is convertible into TensorRT. All layers and operations in the model should be supported by TensorRT. For more information about supported layers and operations, see the TensorRT support matrix.

Converting the models to TensorRT

To take advantage of the hardware and software accelerations on the target edge device, you must convert the .etlt models into NVIDIA TensorRT engines. TensorRT is an SDK for high-performance, deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference applications.

There are two ways to convert your model into a TensorRT engine. You can do it either directly using DeepStream, or by using the tlt-converter utility. We show you both ways.

The trained detector model is converted automatically by DeepStream during the first run. For the following runs, you can specify the path to the produced engine in the corresponding DeepStream config. We are providing our DeepStream configs with this project.

Because the GestureNet model is quite new, the 5.0 version of DeepStream used for this demo does not support its conversion. However, you can convert it using the updated tlt-converter. To download it, choose your JetPack version:

For more information about using tlt-converter with different hardware and software, see Transfer Learning Toolkit Get Started.

When you have the tlt-converter installed on Jetson, convert the GestureNet model using the following command:

./tlt-converter -k nvidia_tlt  
    -t fp16 
    -p input_1,1x3x160x160,1x3x160x160,2x3x160x160 
    -e /EXPORT/PATH/model.plan 
    /PATH/TO/YOUR/GESTURENET/model.etlt

Because you didn’t change the model and are using it as-is, the model key remains the same (nvidia_tlt) as specified on NGC.

You are converting the model into FP16 format, as there isn’t any INT8 calibration file for the deployable model. Make sure to provide correct values for your model path as well as for the export path.

Configuring GestureNet for Triton Inference Server

To deploy your model using Triton Inference Server, prepare a model repository in a specified format. It should have the following structure:

└── trtis_model_repo
    └── hcgesture_tlt
        ├── 1
        │   └── model.plan
        └── config.pbtxt

In this structure, model.plan is the .plan file generated with the trt-converter, and config.pbtxt has the following content:

name: "hcgesture_tlt"
platform: "tensorrt_plan"
max_batch_size: 1
input [
  {
    name: "input_1"
    data_type: TYPE_FP32
    dims: [ 3, 160, 160 ]
  }
]
output [
  {
    name: "activation_18"
    data_type: TYPE_FP32
    dims: [ 6 ]
  }
]
dynamic_batching { }

For more information about configuring the Triton Server repository, see Model Repository.

Customizing deepstream-app

You can configure the sample deepstream-app in a flexible way: as a primary detector, a classifier, or even cascading several models such as a detector and a classifier. In such a case, the detector passes cropped objects of interest to the classifier. This process happens in a DeepStream pipeline where each component takes advantage of the hardware components in Jetson devices.

Figure 5 shows the pipeline for the application.

The pipeline includes capturing the video, batching, decoding, scaling the data, and then applying the fine-tuned hand detector model and then finally using the gesture recognition model to follow the hand gestures
Figure 5. Application pipeline, from capturing to the live video feed to recognizing gestures.

The GestureNet model you are using in this post was trained on images with a big margin around the region of interest (ROI). At the same time, the trained detector model produces narrow boxes around objects of interest (the hand, in this case). At first, this leads to the fact that the objects passed to the classifier are different from the representation learned by the classifier. There were two ways to solve this problem:

  • Retrain it with a new dataset reflecting the setup.
  • Extend the cropped ROIs by some suitable margin.

As we wanted to use the GestureNet model as-is, we chose the second path, which required modification of the original app.

To modify the metadata returned by the detector to crop bigger bounding boxes, implement the following function:

#define CLIP(a,min,max) (MAX(MIN(a, max), min))

int MARGIN = 200;

static void
modify_meta (GstBuffer * buf, AppCtx * appCtx)
{
  int frame_width;
  int frame_height;
  get_frame_width_and_height(&frame_width, &frame_height, buf);

  NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta (buf);
  NvDsFrameMetaList *frame_meta_list = batch_meta->frame_meta_list;
  NvDsObjectMeta *object_meta;
  NvDsFrameMeta *frame_meta;
  NvDsObjectMetaList *obj_meta_list;
  while (frame_meta_list != NULL) {
     frame_meta = (NvDsFrameMeta *) frame_meta_list->data;
     obj_meta_list = frame_meta->obj_meta_list;
     while (obj_meta_list != NULL) {
       object_meta = (NvDsObjectMeta *) obj_meta_list->data;
       object_meta->rect_params.left = CLIP(object_meta->rect_params.left - MARGIN, 0, frame_width - 1);
       object_meta->rect_params.top = CLIP(object_meta->rect_params.top - MARGIN, 0, frame_height - 1);
       object_meta->rect_params.width = CLIP(object_meta->rect_params.left + object_meta->rect_params.width + MARGIN, 0, frame_width - 1);
       object_meta->rect_params.height = CLIP(object_meta->rect_params.top + object_meta->rect_params.height + MARGIN, 0, frame_height - 1);
       obj_meta_list = obj_meta_list->next;
     }
     frame_meta_list = frame_meta_list->next;
  }
}

To display the original bounding boxes, implement the following function, which restores the meta bounding boxes to their original size:

static void
restore_meta (GstBuffer * buf, AppCtx * appCtx)
{

  int frame_width;
  int frame_height;
  get_frame_width_and_height(&frame_width, &frame_height, buf);

  NvDsBatchMeta *batch_meta = gst_buffer_get_nvds_batch_meta (buf);
  NvDsFrameMetaList *frame_meta_list = batch_meta->frame_meta_list;
  NvDsObjectMeta *object_meta;
  NvDsFrameMeta *frame_meta;
  NvDsObjectMetaList *obj_meta_list;
  while (frame_meta_list != NULL) {
     frame_meta = (NvDsFrameMeta *) frame_meta_list->data;
     obj_meta_list = frame_meta->obj_meta_list;
     while (obj_meta_list != NULL) {
       object_meta = (NvDsObjectMeta *) obj_meta_list->data;

       // reduce the bounding boxes for output (copy the reserve value from detector_bbox_info)
       object_meta->rect_params.left = object_meta->detector_bbox_info.org_bbox_coords.left;
       object_meta->rect_params.top = object_meta->detector_bbox_info.org_bbox_coords.top;
       object_meta->rect_params.width = object_meta->detector_bbox_info.org_bbox_coords.width;
       object_meta->rect_params.height = object_meta->detector_bbox_info.org_bbox_coords.height;

       obj_meta_list = obj_meta_list->next;
     }
     frame_meta_list = frame_meta_list->next;
  }

Also, implement this helper function to get frame width and height from the buffer.

static void
get_frame_width_and_height (int * frame_width, int * frame_height, GstBuffer * buf) {
    GstMapInfo map_info;
    memset(&map_info, 0, sizeof(map_info));
    if (!gst_buffer_map (buf, &map_info, GST_MAP_READ)){
      g_print("Error: Failed to map GST buffer");
    } else {
      NvBufSurface *surface = NULL;
      surface = (NvBufSurface *) map_info.data;
      *frame_width = surface->surfaceList[0].width;
      *frame_height = surface->surfaceList[0].height;
      gst_buffer_unmap(buf, &map_info);
    }
}

Building the application

To build the custom app, copy deployment_deepstream/deepstream-app-bbox to /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps.

Install the required dependencies:

sudo apt-get install libgstreamer-plugins-base1.0-dev libgstreamer1.0-dev 
   libgstrtspserver-1.0-dev libx11-dev libjson-glib-dev

Build an executable:

cd /opt/nvidia/deepstream/deepstream-5.0/sources/apps/sample_apps/deepstream-app-bbox
make

Configuring the DeepStream pipeline

Before executing the app, you must provide configuration files. For more information about configuration parameters, see Application Architecture. You can find the configuration files of this demo under deployment_deepstream/egohands-deepstream-app-trtis/. In the same directory, you can also find the label files required by the models.

Finally, you must make your models discoverable by the app. According to the configs, the directory structure under deployment_deepstream/egohands-deepstream-app-trtis/ for model storage looks like the following:

├── tlt_models
│   ├── tlt_egohands_qat
│   │   ├── calibration_qat.bin
│   │   └── resnet34_detector_qat.etlt
└── trtis_model_repo
    └── hcgesture_tlt
        ├── 1
        │   └── model.plan
        └── config.pbtxt

You may notice that the file resnet34_detector_qat.etlt_b16_gpu0_int8.engine specified in the config config_infer_primary_peoplenet_qat.txt is missing in the current setup. It is generated upon the first execution and used directly in the following runs.

Executing the app

In general, the execution command looks like the following:

./deepstream-app-bbox -c 

In this case, with the configs provided, it looks like the following:

./deepstream-app-bbox -c source1_primary_detector_qat.txt

The app should be running now.

Summary

In this post, we demonstrated how you can use a pretrained model from the NGC catalog to fine-tune, optimize, and deploy a gesture recognition application using the DeepStream SDK.

In addition to PeopleNet and GestureNet models used for this example, you can also find models in the NGC catalog for other use cases, such as conversational AI, speech, and language understanding. For more information, see the following resources:

Categories
Misc

NVIDIA Omniverse Machinima Releasing in Open Beta

Technical artists, developers and content creators can now take 3D storytelling to the next level: NVIDIA Omniverse Machinima is available in an open beta.

Technical artists, developers and content creators can now take 3D storytelling to the next level: NVIDIA Omniverse Machinima will be made available in open beta at the end of GTC.

Omniverse Machinima offers a suite of tools and extensions that enable users to render realistic graphics and animation using scenes and characters from games.

The app includes premade assets from NVIDIA and games such as Squad from Offworld Industries, and Mount & Blade Warband by TaleWorlds Entertainment, with more to come. 

Through Omniverse Machinima, users can:

  • Render scenes with materials, surfaces and textures from the NVIDIA MDL library or imported from third-party asset libraries.
  • Animate character’s faces using a simple voice recording through NVIDIA Audio2Face technology.
  • Create realistic visuals with physically accurate materials through NVIDIA PhysX 5, Blast and Flow extensions.
  • Capture human motion through a video feed using wrnch’s AI pose estimation technology
  • And leverage the built-in Omniverse RTX Renderer to produce an output with the highest fidelity.

NVIDIA and wrnch Inc., the leading provider of computer vision software, are collaborating to deliver AI-powered human pose estimation capabilities in Omniverse Machinima. The extension created by wrnch Inc. includes:

  • wrnch CaptureStream, a free downloadable tool that enables creators to use a mobile device’s camera to capture the human motion that they’d like to reproduce in an application.
  • wrnch AI Pose Estimator, an Omniverse extension that enables creators to detect and connect to the wrnch CaptureStream application running on the local network.

Omniverse users can leverage the wrnch Engine, which extracts human motion from video feeds and uses pose estimation algorithms to track skeletal joints and mimic the movements on the 3D character. 

Learn more about NVIDIA Omniverse Machinima and download the open beta today.

Categories
Misc

NVIDIA Omniverse Audio2Face Available Later This Week in Open Beta

NVIDIA Omniverse Audio2Face will be available later this week in open beta. With the Audio2Face app, Omniverse users can generate AI-driven facial animation from audio sources.

NVIDIA Omniverse Audio2Face will be available later this week in open beta. With the Audio2Face app, Omniverse users can generate AI-driven facial animation from audio sources.

The demand for digital humans is increasing across industries, from game development and visual effects to conversational AI and healthcare. But the animation process is tedious, manual and complex, plus existing tools and technologies can be difficult to use or implement into existing workflows. 

With Omniverse Audio2Face, anyone can now create realistic facial expressions and motions to match any voice-over track. The technology feeds the audio input into a pre-trained Deep Neural Network, based on NVIDIA and the output of the network drives the facial animation of 3D characters in real time.

Video 1. NVIDIA Omniverse Audio2Face – Multi-Instance Character Animation.

The open beta release includes:

  • Audio player and recorder: record and playback vocal audio tracks, then input the file to the neural network for immediate animation results.
  • Live mode: use a microphone to drive Audio2Face in real time.
  • Character transfer: retarget generated motions to any 3D character’s face, whether realistic or stylized.
  • Multiple instances: run multiple instances of Audio2Face with multiple characters in the same scene.

Learn more about NVIDIA Omniverse Audio2Face and join the open beta today.

Categories
Misc

ICYMI: New AI Tools and Technologies Announced at GTC 2021 Keynote

At GTC 2021, NVIDIA announced new software tools to help developers build optimized conversational AI, recommender, and video solutions.

At GTC 2021, NVIDIA announced new software tools to help developers build optimized conversational AI, recommender, and video solutions. Watch the keynote from CEO, Jensen Huang, for insights on all of the latest GPU technologies.

Announcing Availability of NVIDIA Jarvis

Today NVIDIA announced major conversational AI capabilities in NVIDIA Jarvis that will help enterprises build engaging and accurate applications for their customers. These include highly accurate automatic speech recognition, real-time translation for multiple languages and text-to-speech capabilities to create expressive conversational AI agents.

Highlights include:

  • Out-Of-The-Box speech recognition model trained on multiple large corpus with greater than 90% accuracy
  • Transfer Learning Toolkit in TAO to finetune models on any domain
  • Real-time translation for 5 languages that run under 100ms latency per sentence
  • Expressive Text-To-Speech that delivers 30x higher throughput compared with Tacotron2

The new capabilities are planned for release in Q2 2021as part of the NVIDIA Jarvis open beta program.

Resources:

 > NVIDIA Jarvis Developer Blogs – includes introduction to Jarvis and tutorials for building conversational AI apps.

Add this GTC session to your calendar to learn more:

 > Building and Deploying a Custom Conversational AI App with NVIDIA Transfer Learning Toolkit and Jarvis


Announcing NVIDIA TAO Framework – Early Access

Today NVIDIA announced NVIDIA Train, Adapt, and Optimize (TAO), a GUI-based, workflow-driven framework that simplifies and accelerates the creation of enterprise AI applications and services. By fine-tuning pretrained models, enterprises can produce domain specific models in hours rather than months, eliminating the need for large training runs and deep AI expertise.  

NVIDIA TAO simplifies the time-consuming parts of a deep learning workflow, from data preparation to training to optimization, shortening the time to value. 

Highlights include:

  • Access a diverse set of pre-trained models including speech, vision, natural language understanding and more
  • Speedup your AI development by over 10X with NVIDIA pre-trained models and TLT
  • Increase model performance with federated learning while preserving data privacy​
  • Optimize models for high-throughput, low-latency inference with NVIDIA TensorRT
  • Optimal configuration deployment for any model architecture on a CPU or GPU with NVIDIA Triton Inference Server 
  • Seamlessly deploy and orchestrate AI applications with NVIDIA Fleet Command

Apply for early access to NVIDIA TAO here


Announcing NVIDIA Maxine – Available for Download Now

Today NVIDIA announced availability for NVIDIA Maxine SDKs, which are used by developers to build innovative virtual collaboration and content creation applications such as video conferencing and live streaming. Maxine’s state-of-the-art AI technologies are highly optimized and deliver the highest performance possible on GPUs, both on PCs and in data centers.

Highlights from this release include:

  • Video Effects SDK: super resolution, video noise removal, virtual background
  • Augmented Reality SDK: 3D effects such as face tracking and body pose estimation
  • Audio Effects SDK: high quality noise removal and room echo removal

In addition, we announced AI Face Codec, a novel AI-based method from NVIDIA research to compress videos and render human faces for video conferencing. It can deliver up to 10x reduction in bandwidth vs H.264.

Developers building Maxine-based apps can use Jarvis for real time transcription, translation and virtual assistant capabilities.

Get started with Maxine here.

Resources:

  > Reinvent Video Conferencing, Content Creation & Streaming with AI Using NVIDIA Maxine

Add these GTC sessions to your calendar to learn more:

 > NVIDIA Maxine: An Accelerated Platform SDK for Developers of Video Conferencing Services

 > How to Process Live Video Streams on Cloud GPUs Using NVIDIA Maxine SDK

 > Real-time AI for Video-Conferencing with Maxine


Announcing NVIDIA Triton Inference Server 2.9

Today NVIDIA announced the latest version of the Triton Inference Server. Triton is an open source inference serving software that maximizes performance and simplifies production deployment at scale. 

Highlights from this release include:

  • Model Navigator, a new tool in Triton (alpha), automatically converts TensorFlow and PyTorch models to TensorRT plan, validates accuracy, and sets up a deployment environment.
  • Model Analyzer now automatically determines optimal batch size and number of concurrent model instances to maximize performance, based on latency or throughput targets.
  • Support for OpenVINO backend (beta) for high performance inferencing on CPU, Windows Triton build (alpha), and integration with MLOps platforms: Seldon and Allegro

Download Triton from NGC here.  Access code and documentation at GitHub.

Add this GTC session to your calendar to learn more:

 > Easily Deploy AI Deep Learning Models at Scale with Triton Inference Server


Announcing TensorRT 8.0

Today NVIDIA announced TensorRT 8.0, the latest version of its high-performance deep learning inference SDK. TensorRT includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for deep learning inference optimizations. With the new features and optimizations, inference applications can now run up to 2x faster with INT8 precision, with accuracy similar to FP32. 

Highlights from this release include:

  • Quantization Aware Training to experience FP32 accuracy with INT8 precision 
  • Support for Sparsity on Ampere GPUs delivers up to 50% higher  throughput on Ampere GPUs
  • Upto 2x faster inference for transformer based networks like BERT with new compiler optimizations 

TensorRT 8 will be available in Q2, 2021 from the TensorRT page. The latest version of samples, parsers and notebooks are always available in the TensorRT open source repo.

Add these GTC sessions to your calendar to learn more:

 > Accelerate Deep Learning Inference with TensorRT 8.0

 > Quantization Aware Training in PyTorch with TensorRT 8.0


Announcing NVIDIA Merlin End-to-End Accelerated Recommender System

Today NVIDIA announced the latest release of NVIDIA Merlin, an open beta application framework that enables the end-to-end development of deep learning recommender systems, from data preprocessing to model training and inference, all accelerated on NVIDIA GPUs. With this release, Merlin delivers a new API and inference support that streamlines the recommender workflow. 

Highlights from this release include:

  • New Merlin API makes it easier to define workflows and training pipelines
  • Deepened support for inference and integration with Triton Inference Server
  • Scales transparently to larger datasets and more complex models 

Resources:

Add these GTC sessions to your calendar to learn more: 

 > End-2-end Deployment of GPU Accelerated Recommender Systems: From ETL to Training to Inference (Training Session)

 > Accelerated ETL, Training and Inference of Recommender Systems on the GPU with Merlin, HugeCTR, NVTabular, and Triton


Announcing Data Labeling & Annotation Partner Services For Transfer Learning Toolkit

Today NVIDIA announced that it is working with six leading NVIDIA partners to provide solutions for data labeling, making it easy to adapt pre-trained models to specific domain data and train quickly and efficiently. These companies are AI Reverie, Appen, Hasty,ai, Labelbox, Sama, and Sky Engine.

Training reliable AI and machine learning models requires vast amounts of accurately labeled data and acquiring labeled and annotated data at scale is a challenge for several enterprises. Using these integrations, developers can use the partner services and platforms with NVIDIA Transfer Learning Toolkit (TLT) to either perform annotation, utilize partners’ synthetic data with TLT, or use external annotation tools and then import data to TLT for training and model optimization. 

To learn more about the integration, read the developer blog: 

 > Integrating with Data Generation and Labelling Tools for Accurate AI Training

Download Transfer Learning Toolkit and get started here

Add these GTC sessions to your calendar to learn more:

 > Train Smarter not Harder with NVIDIA Pre-trained models and Transfer Learning Toolkit 3.0

 > Connect with the Experts: Transfer Learning Toolkit and DeepStream SDK for Vision AI/Intelligent Video Analytics


Announcing DeepStream 6.0 

NVIDIA DeepStream SDK is the AI streaming analytics toolkit for building high performance, low-latency, complex video analytics apps and services. Today NVIDIA announced DeepStream 6.0. This latest version brings a new Graphical User Interface to help developers build reliable AI applications faster, and fast track the entire workflow from prototyping to deployment across the edge and cloud. With the new GUI and a suite of productivity tools you can build AI apps in days versus weeks.

Sign up to be notified for the early access program here.

Add these GTC sessions to your calendar to learn more: 

 > Bringing Scale and Optimization to Video Analytics Pipelines with NVIDIA Deepstream SDK

 > Connect with the Experts: Transfer Learning Toolkit and DeepStream SDK for Vision AI/Intelligent Video Analytics

 > Full list of intelligent video analytics talk at GTC

Register for GTC this week for more on the latest GPU-accelerated AI technologies.

Categories
Misc

Announcing Megatron for Training Trillion Parameter Models & NVIDIA Jarvis Availability

NVIDIA announced several major breakthroughs in conversational AI that will bring in a new wave of conversational AI applications.

Conversational AI is opening new ways for enterprises to interact with customers in every industry using applications like real-time transcription, translation, chatbots and virtual assistants. Building domain-specific interactive applications requires state-of-the-art models, optimizations for real time performance, and tools to adapt those models with your data. This week at GTC, NVIDIA announced several major breakthroughs in conversational AI that will bring in a new wave of conversational AI applications.

MEGATRON

NVIDIA Megatron is a PyTorch-based framework for training giant language models based on the transformer architecture. Larger language models are helping produce superhuman-like responses and are being used in applications such as email phrase completion, document summarization and live sports commentary. The Megatron framework has also been harnessed by the University of Florida to develop GatorTron, the world’s largest clinical language model.

Highlights include:

  • Linearly scale training up to 1 trillion parameters on DGX SuperPOD with advanced optimizations and parallelization algorithms. 
  • Built on cuBLAS, NCCL, NVLINK and InfiniBand to train a language model on multi-GPU, multi-node systems
  • Improvement in throughput by more than 100x when moving from 1 billion parameter model on 32 A100 GPUs to 1T parameter on 3072 A100 GPUs
  • Achieve sustained 50% utilization of Tensor Cores.

Read the technical blog post for more details.
Megatron is available on GitHub.

JARVIS

NVIDIA also announced new achievements for Jarvis, a fully accelerated conversational AI framework, including highly accurate automatic speech recognition, real-time translation for multiple languages and text-to-speech capabilities to create expressive conversational AI agents.

Highlights include:

  • Out-of-the-box speech recognition model trained on multiple large corpus with greater than 90% accuracy
  • Transfer Learning Toolkit in TAO to finetune models on any domain
  • Real-time translation for 5 languages that run under 100ms latency per sentence
  • Expressive text-to-speech that delivers 30x higher throughput compared with Tacotron2

These new capabilities will be available in Q2 2021 as part of the ongoing beta program.

Jarvis beta currently includes state-of-the-art models pre-trained for thousands of hours on NVIDIA DGX; Transfer Learning Toolkit for adapting those models to your domain with zero coding; Optimized end-to-end speech, vision, and language pipelines that run in real-time.

To get started with Jarvis, read this introductory blog on building and deploying custom conversational AI models using Jarvis and NVIDIA Transfer Learning Toolkit. Read the technical blog post >

Next, try these sample applications for ideas on what you can build with Jarvis out-of-the-box:

  1. Jarvis Rasa assistant: End-to-end voice enabled AI assistant demonstrating integration of Jarvis Speech and Rasa
  2. Jarvis Contact App: Peer-to-peer video chat with streaming transcription and named entity recognition
  3. Question Answering: Build a QA system with a few lines of Python code using read-to-use Jarvis NLP service 

Join us at NVIDIA GTC for free on April 13th for our session “Building and Deploying a Custom Conversational AI App with NVIDIA Transfer Learning Toolkit and Jarvis” to learn more.

Categories
Misc

NVIDIA Announces CPU for Giant AI and High Performance Computing Workloads

‘Grace’ CPU delivers 10x performance leap for systems training giant AI models, using energy-efficient Arm coresSwiss Supercomputing Center and US Department of Energy’s Los Alamos National …

Categories
Misc

NVIDIA and Partners Collaborate on Arm Computing for Cloud, HPC, Edge, PC

NVIDIA GPU + AWS Graviton2-Based Amazon EC2 Instances, HPC Developer Kit with Ampere Computing CPU and Dual GPUs, More Initiatives Help Expand Opportunities for Arm-Based SolutionsSANTA CLARA, …

Categories
Misc

Swiss National Supercomputing Centre, Hewlett Packard Enterprise and NVIDIA Announce World’s Most Powerful AI-Capable Supercomputer

‘Alps’ system to advance research across climate, physics, life sciences with 7x more powerful AI capabilities than current world-leading system for AI on MLPerfLUGANO, Switzerland, April 12, …