Categories
Misc

Deploying GPT-J and T5 with FasterTransformer and Triton Inference Server

This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to…

This is the second part of a two-part series about NVIDIA tools that allow you to run large transformer models for accelerated inference. For an introduction to the NVIDIA FasterTransformer library (Part 1), see Accelerated Inference for Large Transformer Models Using FasterTransformer and Triton Inference Server.

Introduction

This post is a guide to optimized inference of large transformer models such as EleutherAI’s GPT-J 6B and Google’s T5-3B. Both of these models demonstrate good results in many downstream tasks and are among the most available to researchers and data scientists. 

NVIDIA FasterTransformer (FT) in NVIDIA Triton allows you to run both of these models in a similar and simple manner while providing enough flexibility to integrate/combine with other inference or training pipelines. The same NVIDIA software stack can be used for inference of the trillion-parameters models combining tensor parallelism (TP) and pipeline parallelism (PP) techniques on multiple nodes.

Transformer models are increasingly used in numerous domains and demonstrate outstanding accuracy. More importantly, the size of the model directly affects its quality. Apart from the NLP, this is applicable to other domains as well. 

Researchers from Google demonstrated that the scaling of the transformer-based text encoder was crucial for the whole image generation pipeline in their Imagen model, the latest and one of the most promising generative text-to-image models. Scaling the transformers leads to outstanding results in both single and multi-domain pipelines. This guide uses transformer-based models of the same structure and a similar size.

Overview and walkthrough of main steps

This section presents the main steps for running T5 and GPT-J in optimized inference using FasterTransformer and Triton Inference Server. Figure 1 demonstrates the overall process for one neural network.

You can reproduce all steps using the step-by-step fastertransformer_backend notebook on GitHub.

It is highly recommended to do all the steps in a Docker container to reproduce the results. Instructions about preparing a FasterTransformer Docker container are available at the beginning of the same notebook.

If you have pretrained one of these models, you will have to convert the weights from your framework saved-model files into the binary format recognizable by the FT. Scripts for conversion are provided in the FasterTransformer repository.

Graphic depicting sn overall pipeline of the transformer neural network with FasterTransformer and Triton
Figure 1. An overall pipeline of the transformer neural network with FasterTransformer and Triton

Steps 1 and 2: Build Docker container with Triton inference server and FasterTransformer backend. Use the Triton inference server as the main serving tool proxying requests to the FasterTransformer backend. 

Steps 3 and 4: Build the FasterTransformer library. This library contains many useful tools for inference preparation as well as bindings for multiple languages and examples of how to do inference in C++ and Python.

Steps 5 and 6: Download weights of the pretrained models (T5-3B and GPT-J) and prepare them for the inference with FT by converting into binary format and splitting them into multiple partitions for parallelism and accelerated inference. Code from the FasterTransformer library will be used in this step.

Step 7: Use code from the FasterTransformer library to find optimal low-level kernels for the NN.

Step 8: Start the Triton server that uses all artifacts from previous steps and run the Python client code to send requests to the server with accelerated models. 

Step 1: Clone fastertransformer_backend from the Triton GitHub repository

Clone the fastertransformer_backend repo from GitHub:

git clone https://github.com/triton-inference-server/fastertransformer_backend.git
cd fastertransformer_backend && git checkout -b t5_gptj_blog remotes/origin/dev/t5_gptj_blog

Step 2: Build Docker container with Triton and FasterTransformer libraries

Build the Docker image using this file:

docker build --rm  --build-arg TRITON_VERSION=22.03 -t triton_with_ft:22.03 
             -f docker/Dockerfile .
cd ../

Run the Docker container and start an interactive bash session with this code:

docker run -it --rm --gpus=all --shm-size=4G  -v $(pwd):/ft_workspace 
           -p 8888:8888 triton_with_ft:22.03 bash

All further steps need to be run inside the Docker container interactive session. Jupyter Lab is also needed in this container to work with the notebook provided.

apt install jupyter-lab && jupyter lab -ip 0.0.0.0

The Docker container was built with Triton and FasterTransformer and started with the fastertransformer_backend source codes inside.

Steps 3 and 4: Clone FasterTransformer source codes and build the library

The FasterTransformer library is pre-built and placed into our container during the Docker build process.

Download the FasterTransformer source code from GitHub to use the additional scripts that allow converting the pre-trained model files of the GPT-J or T5 into FT binary format that will be used at the time of inference.

git clone https://github.com/NVIDIA/FasterTransformer.git

The library has the ability to run code for kernel autotuning later:

mkdir -p FasterTransformer/build && cd FasterTransformer/build
git submodule init && git submodule update
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
make -j32

GPT-J inference

GPT-J is a decoder model that was developed by EleutherAI and trained on The Pile, an 825GB dataset curated from multiple sources. With 6 billion parameters, GPT-J is one of the largest GPT-like publicly-released models. 

FasterTransformer backend has a config for the GPT-J model under fastertransformer_backend/all_models/gptj. This config is a perfect demonstration of a Triton ensemble. Triton allows you to run a single model inference, as well as construct complex pipes/pipelines comprising many models required for an inference task. 

You can also add additional Python/C++ scripts before and/or after any neural network for pre/post processing steps that could transform your data/results into the final form.

The GPT-J inference pipeline includes three different sequential steps at the server side:

pre-processing -> FasterTransformer -> post-processing

The config file combines all three stages into a single pipeline. Figure 2 illustrates the client-server inference scheme.

GPT-J inference with FasterTransformer and Triton. Scheme of the ensemble with all pre- and post-processing steps happening on the server side.
Figure 2. GPT-J inference with FasterTransformer and Triton. Scheme of the ensemble with all pre- and post-processing steps happening on the server side

Steps 5-8 are the same for both GPT-J and T5 and are provided below (GPT first, followed by T5).

Step 5 (GPT-J): Download and prepare weights of the GPT-J model

wget https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.zstd
tar -axf step_383500_slim.tar.zstd -C ./models/  

These weights need to be converted into the binary format recognized by the C++ FasterTransformer backend. FasterTransformer provides the tools/scripts for different pretrained neural networks. 

For the GPT-J weights you can use the script:

FasterTransformer/examples/pytorch/gptj/utils/gptj_ckpt_convert.py to convert the checkpoint as follows:

Step 6 (GPT-J): Convert weights into FT format

python3 ./FasterTransformer/examples/pytorch/gptj/utils/gptj_ckpt_convert.py 
          --output-dir ./models/j6b_ckpt 
          --ckpt-dir ./step_383500/ 
          --n-inference-gpus 2

The n-inference-gpus specifies the number of GPUs for tensor parallelism. This script will create ./models/j6b_ckpt/2-gpu directory and automatically write prepared weights there. These weights will be ready for TensorParallel 2 inference. Using this parameter, you can split your weights onto a larger number of GPUs to achieve even higher speed using the TP technique.

Step 7 (GPT-J): Kernel-autotuning for the GPT-J inference

The next step is kernel-autotuning. Matrix multiplication is the main and the heaviest operation in transformer-based neural networks. FT uses functionalities from CuBLAS and CuTLASS libraries to execute this type of operation. It is important to note that MatMul operation can be executed in tens of different ways using different low-level algorithms at the “hardware” level. 

The FasterTransformer library has a script that allows real-time benchmarking of all low-level algorithms and selection of the best one for the parameters of the model (size of the attention layers, number of attention heads, size of the hidden layer) and for your input data. This step is optional but achieves a higher inference speed.

Run the ./FasterTransformer/build/bin/gpt_gemm binary file that was built at the stage of building FasterTransformer library. Arguments for the script may be found in the GitHub’s documentation or by using --help argument.

./FasterTransformer/build/bin/gpt_gemm 8 1 32 12 128 6144 51200 1 2

Step 8 (GPT-J): Prepare the Triton config and serve the model

With the weights ready, the next step is to prepare a Triton config file for the GPT-J model. Open the main Triton config for the GPT-J model  at fastertransformer_backend/all_models/gptj/fastertransformer/config.pbtxt for editing. Only two mandatory parameters need to be changed there to start inference.

Update tensor_para_size. Weights were prepared for two GPUs, so set it equal to 2.

parameters {
  key: "tensor_para_size"
  value: {
    string_value: "2"
  }
}

Update the path to the checkpoint folder from the previous step:

parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "./models/j6b_ckpt/2-gpu/"
  }
}

Now start the Triton inference server with Triton backend and GPT-J:

CUDA_VISIBLE_DEVICES=0,1 /opt/tritonserver/bin/tritonserver  --model-repository=./triton-model-store/gptj/ &

If Triton starts successfully, you will see output lines informing that the models are loaded by Triton and the server is listening the designated ports for incoming requests:

# Info about T5 model that was found by the Triton in our directory:

+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| fastertransformer | 1       | READY  |
+-------------------+---------+--------+

# Info about that Triton successfully started and waiting for HTTP/GRPC requests:

I0503 17:26:25.226719 1668 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001
I0503 17:26:25.227017 1668 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000
I0503 17:26:25.283046 1668 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Next, send the inference requests to the server. On the client side, the tritonclient Python library allows communicating with our server from any of the Python apps. 

This example with GPT-J sends textual data straight to the Triton server and all preprocessing and postprocessing will happen on the server side. The full client script can be found at fastertransformer_backend/tools/end_to_end_test.py or in the Jupyter notebook provided.

The main parts include:

# Import libraries
import tritonclient.http as httpclient

# Initizlize client
client = httpclient.InferenceServerClient("localhost:8000",
                                           concurrency=1,
                                           verbose=False)
# ...

# Request text promp from user
print("Write any input prompt for the model and press ENTER:")
# Prepare tokens for sending to the server
inputs = prepare_inputs( [[input()]])
# Sending request
result = client.infer(MODEl_GPTJ_FASTERTRANSFORMER, inputs)
print(result.as_numpy("OUTPUT_0"))

T5 inference

T5 (Text-to-Text Transfer Transformer) is a recent architecture created by Google. It consists of encoder and decoder parts and is an instance of a full transformer architecture. It reframes all the natural language processing (NLP) tasks into a unified text-to-text format where the input and output are always text strings.

The T5 inference pipeline prepared in this section differs from the GPT-J model, in that only the NN inference stage is on the server side, not a full pipeline with data preprocessing and results in postprocessing. All computations for the pre- and post-processing stages are happening on the client side. 

Triton allows you to configure your inference flexibly so it is possible to build a full pipeline on the server side too, but other configurations are also possible. 

First, do a conversion from text into tokens in Python using the Huggingface library on the client side. Next, send an inference request to the server. Finally, after getting a response from the server, convert the generated tokens into text on the client side.

Figure 3 illustrates the client-server inference scheme.

T5 inference with FasterTransformer and TRITON. All pre- and post-processing steps happen on the client side and only the heavy inference part computes is on the server.
Figure 3. T5 inference with FasterTransformer and Triton. All pre- and post-processing steps happen on the client side and only the heavy inference part computes is on the server.

Preparation steps for T5 are the same as for GPT-J. Details for steps 5-8 are provided below for T5:

Step 5 (T5): Download weights of the T5-3B

First download weights of the T5 3b size. You will have to install git-lfs to successfully download the weights.

git clone https://huggingface.co/t5-3b

Step 6 (T5): Convert weights into FT format

Again, the weights need to be converted into the binary format recognized by  the C++ FasterTransformer backend. For T5 weights you can use the script at FasterTransformer/blob/main/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py to convert the checkpoint. 

The converter requires the following arguments. Quite similar to GPT-J but parameter i_g means the number of GPUs will be used for the inference in TP regime, so set it to 2:

python3 FasterTransformer/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py
        -i t5-3b/ 
        -o ./models/t5-3b/ 
        -i_g 2

Step 7 (T5): Kernel-autotuning for the T5-3B inference

The next step is kernel-autotuning for T5 using the t5_gemm binary file that will run experiments to benchmark the heaviest parts of the T5 model and find the best low-level kernels. Run ./FasterTransformer/build/bin/t5_gemm binary file that was built at the stage of building the FasterTransformer library (Step 2). This step is optional but including it achieves a higher inference speed. Again, the arguments for the script may be found in the GitHub’s documentation or by using --help argument.

./FasterTransformer/build/bin/t5_gemm 1 1 32 1024 32 128 16384 1024 32 128 16384 32128 1 2 1 1

Step 8 (T5): Prepare the Triton config of the T5 model

You will have to open the copied Triton config for the T5 model triton-model-store/t5/fastertransformer/config.pbtxt for editing. Only two mandatory parameters need to be changed there to start the inference.

Then update tensor_para_size. Weights were prepared for two GPUs, so set it to 2.

parameters {
  key: "tensor_para_size"
  value: {
    string_value: "2"
  }
}

Next, update the path to the folder with weights:

parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "./models/t5-3b/2-gpu/"
  }
}

Start the Triton inference server. Update the path to the converted model prepared in the previous step:

CUDA_VISIBLE_DEVICES=0,1 /opt/tritonserver/bin/tritonserver  --model-repository=./triton-model-store/t5/  

If Triton starts successfully, you will see these lines in the output:

# Info about T5 model that was found by the Triton in our directory:

+-------------------+---------+--------+
| Model             | Version | Status |
+-------------------+---------+--------+
| fastertransformer | 1       | READY  |
+-------------------+---------+--------+

# Info about that Triton successfully started and waiting for HTTP/GRPC requests:

I0503 17:26:25.226719 1668 grpc_server.cc:4421] Started GRPCInferenceService at 0.0.0.0:8001
I0503 17:26:25.227017 1668 http_server.cc:3113] Started HTTPService at 0.0.0.0:8000
I0503 17:26:25.283046 1668 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

Now run the client script. On the client side, transform the textual input to tokens using the Huggingface library and only then send a request to the server using the Python’s tritonclient library. Implement function preprocessing for this purpose. 

Then use an instance of the tritonclient http class that will request the 8000 port on the server (“localhost” if deployed locally) to send tokens to the model through HTTP. 

After receiving the response containing the  tokens, again transform the tokens into text form using a postprocessing helper function.

# Import libraries
from transformers import (
    T5Tokenizer,
    T5TokenizerFast
) 
import tritonclient.http as httpclient

# Initialize client
client = httpclient.InferenceServerClient(
    URL, concurrency=request_parallelism, verbose=verbose
)

# Initialize tokenizers from HuggingFace to do pre and post processings 
# (convert text into tokens and backward) at the client side
tokenizer = T5Tokenizer.from_pretrained(MODEL_T5_HUGGINGFACE, model_max_length=1024)
fast_tokenizer = T5TokenizerFast.from_pretrained(MODEL_T5_HUGGINGFACE, model_max_length=1024)

# Implement the function that takes text converts it into the tokens using 
# HFtokenizer and prepares tensorts for sending to Triton
def preprocess(t5_task_input):
    ...

# Implement function that takes tokens from Triton's response and converts 
# them into text
def postprocess(result):
    ...

# Run translation task with T5
text = "Translate English to German: He swung back the fishing pole and cast the line."
inputs = preprocess(text)
result = client.infer(MODEl_T5_FASTERTRANSFORMER, inputs)
postprocess(result)

Adding custom layers and new NN architectures

If you have some custom neural network with the transformer blocks inside, or you have added some custom layers into default NNs supported by FT (T5, GPT), this NN won’t be supported by FT out-of-the-box. You can either change the source code of FT to add support for this NN by adding support for new layers, or you can use FT blocks and C++, PyTorch, and TensorFlow API to integrate fast transformer blocks from FT into your custom inference script/pipeline.

Results

The optimizations carried out by FasterTransformer achieved up to 6x speed-up over native PyTorch GPU inference in FP16 mode and up to 33x speedup over PyTorch CPU inference for GPT-J and T5-3B.

Figure 4 shows the inference results for the GPT-J, and Figure 5 shows the inference results for T5-3B model at batch size 1 for the translation task. 

Graphic showing GPT-J 6B model inference speed-up comparison
Figure 4. GPT-J 6B model inference speed-up comparison
T5-3B model inference speed-up comparison
Figure 5. T5-3B model inference speed-up comparison 

The smaller the model and the bigger the batch size, the better optimization FasterTransformer demonstrates due to increasing computational bandwidth. Figure 6 shows the T5-small model, tests of which can be found on the FasterTrasformer GitHub. It demonstrates a ~22x throughput increase in comparison with GPU PyTorch inference. Similar results can be found on GitHub for the T5-base model.

T5-small model inference comparison
Figure 6. T5-small model inference comparison

Conclusion

The code example demonstrated here uses FasterTransformer and Triton inference server to run inference of the GPT-J-6B and T5-3B models. It achieved up to 33x acceleration in comparison with CPU and up to 22x in comparison with native PyTorch backend on GPU.

The same approach can be used for small transformer models like T5-small and BERT as well as huge models with trillions of parameters like GPT-3. Triton with FasterTransformer uses techniques like tensor and pipeline parallelism to provide optimized and highly accelerated inference to achieve low latency and high throughput for all of them. 

Read more about Triton and FasterTransformer or access the fastertransformer_backend example used in this post.

Training and inference of large models is a non-trivial task on the edge between AI and HPC. If you are interested in huge neural networks, NVIDIA released multiple tools that can help you make the most of them in the easiest, most efficient way. 

NeMo Megatron is a new capability in the NeMo framework that allows developers to effectively train and scale language models to billions of parameters. The inferencing relies on the same tools presented in this post. Learn more with Model Parallelism: Building and Deploying Large Neural Networks, a hands-on interactive live course on theoretical and practical aspects in training and inferencing of large models.

Categories
Offsites

Efficient Sequence Modeling for On-Device ML

The increasing demand for machine learning (ML) model inference on-device (for mobile devices, tablets, etc.) is driven by the rise of compute-intensive applications, the need to keep certain data on device for privacy and security reasons, and the desire to provide services when a network connection may not be available. However, on-device inference introduces a myriad of challenges, ranging from modeling to platform support requirements. These challenges relate to how different architectures are designed to optimize memory and computation, while still trying to maintain the quality of the model. From a platform perspective, the issue is identifying operations and building on top of them in a way that can generalize well across different product use cases.

In previous research, we combined a novel technique for generating embeddings (called projection-based embeddings) with efficient architectures like QRNN (pQRNN) and proved them to be competent for a number of classification problems. Augmenting these with distillation techniques provides an additional bump in end-to-end quality. Although this is an effective approach, it is not scalable to bigger and more extensive vocabularies (i.e., all possible Unicode or word tokens that can be fed to the model). Additionally, the output from the projection operation itself doesn’t contain trainable weights to take advantage of pre-training the model.

Token-free models presented in ByT5 are a good starting point for on-device modeling that can address pre-training and scalability issues without the need to increase the size of the model. This is possible because these approaches treat text inputs as a stream of bytes (each byte has a value that ranges from 0 to 255) that can reduce the vocabulary size for the embedding tables from ~30,000 to 256. Although ByT5 presents a compelling alternative for on-device modeling, going from word-level representation to byte stream representation increases the sequence lengths linearly; with an average word length of four characters and a single character having up to four bytes, the byte sequence length increases proportionally to the word length. This can lead to a significant increase in inference latency and computational costs.

We address this problem by developing and releasing three novel byte-stream sequence models for the SeqFlowLite library (ByteQRNN, ByteTransformer and ByteFunnelTransformer), all of which can be pre-trained on unsupervised data and can be fine-tuned for specific tasks. These models leverage recent innovations introduced by Charformer, including a fast character Transformer-based model that uses a gradient-based subword tokenization (GBST) approach to operate directly at the byte level, as well as a “soft” tokenization approach, which allows us to learn token boundaries and reduce sequence lengths. In this post, we focus on ByteQRNN and demonstrate that the performance of a pre-trained ByteQRNN model is comparable to BERT, despite being 300x smaller.

Sequence Model Architecture
We leverage pQRNN, ByT5 and Charformer along with platform optimizations, such as in-training quantization (which tracks minimum and maximum float values for model activations and weights for quantizing the inference model) that reduces model sizes by one-fourth, to develop an end-to-end model called ByteQRNN (shown below). First, we use a ByteSplitter operation to split the input string into a byte stream and feed it to a smaller embedding table that has a vocabulary size of 259 (256 + 3 additional meta tokens).

The output from the embedding layer is fed to the GBST layer, which is equipped with in-training quantization and combines byte-level representations with the efficiency of subword tokenization while enabling end-to-end learning of latent subwords. We “soft” tokenize the byte stream sequences by enumerating and combining each subword block length with scores (computed with a quantized dense layer) at each strided token position (i.e., at token positions that are selected at regular intervals). Next, we downsample the byte stream to manageable sequence length and feed it to the encoder layer.

The output from the GBST layer can be downsampled to a lower sequence length for efficient encoder computation or can be used by an encoder, like Funnel Transformer, which pools the query length and reduces the self-attention computation to create the ByteFunnelTransformer model. The encoder in the end-to-end model can be replaced with any other encoder layer, such as the Transformer from the SeqFlowLite library, to create a ByteTransformer model.

A diagram of a generic end-to-end sequence model using byte stream input. The ByteQRNN model uses a QRNN encoder from the SeqFlowLite library.

In addition to the input embeddings (i.e., the output from the embedding layer described above), we go a step further to build an effective sequence-to-sequence (seq2seq) model. We do so by taking ByteQRNN and adding a Transformer-based decoder model along with a quantized beam search (or tree exploration) to go with it. The quantized beam search module reduces the inference latency when generating decoder outputs by computing the most likely beams (i.e., possible output sequences) using the logarithmic sum of previous and current probabilities and returns the resulting top beams. Here the system uses a more efficient 8-bit integer (uint8) format, compared to a typical single-precision floating-point format (float32) model.

The decoder Transformer model uses a merged attention sublayer (MAtt) to reduce the complexity of the decoder self-attention from quadratic to linear, thereby lowering the end-to-end latency. For each decoding step, MAtt uses a fixed-size cache for decoder self-attention compared to the increasing cache size of a traditional transformer decoder. The following figure illustrates how the beam search module interacts with the decoder layer to generate output tokens on-device using an edge device (e.g., mobile phones, tablets, etc.).

A comparison of cloud server decoding and on-device (edge device) implementation. Left: Cloud server beam search employs a Transformer-based decoder model with quadratic time self-attention in float32, which has an increasing cache size for each decoding step. Right: The edge device implementation employs a quantized beam search module along with a fixed-size cache and a linear time self-attention computation.

Evaluation
After developing ByteQRNN, we evaluate its performance on the civil_comments dataset using the area under the curve (AUC) metric and compare it to a pre-trained ByteQRNN and BERT (shown below). We demonstrate that the fine-tuned ByteQRNN improves the overall quality and brings its performance closer to the BERT models, despite being 300x smaller. Since SeqFlowLite models support in-training quantization that reduces model sizes by one-fourth, the resulting models scale well to low-compute devices. We chose multilingual data sources that related to the task for pre-training both BERT and byte stream models to achieve the best possible performance.

Comparison of ByteQRNN with fine-tuned ByteQRNN and BERT on the civil_comments dataset.

Conclusion
Following up on our previous work with pQRNN, we evaluate byte stream models for on-device use to enable pre-training and thereby improve model performance for on-device deployment. We present an evaluation for ByteQRNN with and without pre-training and demonstrate that the performance of the pre-trained ByteQRNN is comparable to BERT, despite being 300x smaller. In addition to ByteQRNN, we are also releasing ByteTransformer and ByteFunnelTransformer, two models which use different encoders, along with the merged attention decoder model and the beam search driver to run the inference through the SeqFlowLite library. We hope these models will provide researchers and product developers with valuable resources for future on-device deployments.

Acknowledgements
We would like to thank Khoa Trinh, Jeongwoo Ko, Peter Young and Yicheng Fan for helping with open-sourcing and evaluating the model. Thanks to Prabhu Kaliamoorthi for all the brainstorming and ideation. Thanks to Vinh Tran, Jai Gupta and Yi Tay for their help with pre-training byte stream models. Thanks to Ruoxin Sang, Haoyu Zhang, Ce Zheng, Chuanhao Zhuge and Jieying Luo for helping with the TPU training. Many thanks to Erik Vee, Ravi Kumar and the Learn2Compress leadership for sponsoring the project and their support and encouragement. Finally, we would like to thank Tom Small for the animated figure used in this post.

Categories
Misc

Research Neural Fields Your Way with NVIDIA Kaolin Wisp

Research on neural fields has been an increasingly hot topic in computer graphics and computer vision in recent years. Neural fields can represent 3D data like…

Research on neural fields has been an increasingly hot topic in computer graphics and computer vision in recent years. Neural fields can represent 3D data like shape, appearance, motion, and other physical quantities by using a neural network that takes coordinates as input and outputs the corresponding data at that location. 

These representations have been proven to be useful in various applications like generative modeling and 3D reconstruction. NVIDIA projects such as NGLOD, GANcraft, NeRF-Tex, EG3D, Instant-NGP, and Variable Bitrate Neural Fields, are advancing state-of-the-art technology in neural fields, computer graphics, and computer vision in various ways.

Research challenges 

Research on neural fields is moving fast, which means that standards and software often lag behind. Implementation differences can cause large variations in quality metrics and performance. The ramp-up cost for new projects can be considerable, with the components of neural fields increasing in complexity. Work is often duplicated among research groups–creating whole interactive applications to visualize the neural field outputs, for example.

One important milestone is NVIDIA Instant-NGP, which has recently attracted much attention from the research community due to its ability to fit various signals like neural radiance fields (NeRFs), signed distance fields (SDFs), and images at near-instant speeds. It unlocks a new frontier of practical applications and research directions due to its computational efficiency. However, this computational efficiency can also be a barrier for research due to the highly specialized and optimized code that can be difficult to adapt and extend.

NVIDIA Kaolin Wisp

NVIDIA Kaolin Wisp was developed as a fast-paced research-oriented library for neural fields to support researchers navigating the challenges of a growing discipline. It is built on top of the core Kaolin Library functionality, which includes more general and stable components for 3D deep learning research. 

The goal of Wisp is to provide a common core library and framework for research on neural fields. The library consists of modular building blocks that can be used to create complex neural fields and an interactive app to train and visualize the neural fields.

A screenshot of NVIDIA Kaolin Wisp renders a lego engine's neural field and shows the software's user interface.
Figure 1. A screenshot of the NVIDIA Kaolin Wisp interactive render, showing an optimization of a neural field in progress. The cameras and occupancy state of the occupancy structure are visualized on top. The properties inspector on the right allows users to derive more information about the scene and manipulate it.

Rather than providing a specific implementation, Wisp supplies the building blocks for neural fields. The framework is easily extensible for research purposes and consists of a modular pipeline where each pipeline component can be easily interchanged to provide a plug-and-play configuration for standard training. 

Wisp does not aim to provide production-ready code, but to ship novel modules fast, staying at the leading edge of this technology. It also provides a rich set of examples that showcase the Kaolin Core framework and how Kaolin Core can be used to accelerate research.

Three examples of the many different formats of neural radiance fields that NVIDIA Kaolin Wisp supports.
Figure 2. NVIDIA Kaolin Wisp provides a variety of configurations and building blocks for researchers

NVIDIA Kaolin Wisp feature highlights

Kaolin Wisp uses a Python-based API, which builds on PyTorch, enabling users to develop a project quickly. Compatible with many other public PyTorch-based projects, Kaolin Wisp is easily customizable with PyTorch / CUDA-based building blocks. 

While Wisp is designed for developer speed over compute performance, the building blocks that are provided in the library are optimized to train neural fields within minutes and visualize them interactively. 

Kaolin Wisp is packed with building blocks to compose neural field pipelines with a mix-and-match approach. Notable examples are feature grids, which include: 

  • Hierarchical Octrees: From NGLOD for learning features on spatial subdivision trees. The Octree also supports ray tracing operations, which allows for training a multiview image-based NGLOD-NeRF variant in addition to SDFs.
  • Triplanar Features: Used in EG3D and Convolutional Occupancy Networks papers to learn volumetric features on triplanar texture maps. The triplanes also support multiple levels of detail (LODs) in a multi-resolution pyramid structure.
  • Codebooks: From Variable Bivariate Neural Fields, to learn compressed feature codebooks with differentiable learnable keys.
  • Hash Grids: From the Instant-NGP paper for learning compact cache-friendly feature codebooks with performant memory access.
A visual of a sectioned list of the tools and assets that the Kaolin Wisp library supports.
Figure 3. NVIDIA Kaolin Wisp architecture and building blocks

NVIDIA Kaolin Wisp is paired with an interactive renderer that supports flexible rendering of neural primitives pipelines, like variants of NeRF and neural SDFs. It allows the integration of new representations. 

OpenGL style rasterized primitives can be mixed and matched with neural representations to add visualizations of more data layers, such as camera and occupancy structures. It also allows for easy-to-build customizable apps by supporting custom widgets on the GUI that can interact with the training and rendering. 

Other useful features include property viewers, optimization controls, custom output render buffers, and camera object that allows for easy manipulation of scene cameras.

To learn more about Kaolin Wisp and other libraries, visit NVIDIA Research. You can access the kaolin-wisp project on GitHub. 

Join NVIDIA 3D deep learning researchers and Kaolin library developers at SIGGRAPH 2022 for a session on Illuminating the Future of Graphics. Ask questions, watch demos, and learn how Kaolin Wisp can accelerate your neural network research.

Categories
Misc

NVIDIA Jetson AGX Orin 32GB Production Modules Now Available; Partner Ecosystem Appliances and Servers Arrive

Bringing new AI and robotics applications and products to market, or supporting existing ones, can be challenging for developers and enterprises. The NVIDIA Jetson AGX Orin 32GB production module — available now — is here to help. Nearly three dozen technology providers in the NVIDIA Partner Network worldwide are offering commercially available products powered by Read article >

The post NVIDIA Jetson AGX Orin 32GB Production Modules Now Available; Partner Ecosystem Appliances and Servers Arrive appeared first on NVIDIA Blog.

Categories
Misc

Music to the Gears: NVIDIA’s Clément Farabet on Orchestrating AI Training for Autonomous Vehicles

Autonomous vehicles are one of the most complex AI challenges of our time. For AVs to operate safely in the real world, the networks running within them must come together as an intricate symphony, which requires intensive training, testing and validation on massive amounts of data. Clément Farabet, vice president of AI infrastructure at NVIDIA, Read article >

The post Music to the Gears: NVIDIA’s Clément Farabet on Orchestrating AI Training for Autonomous Vehicles appeared first on NVIDIA Blog.

Categories
Misc

How MONAI Fuels Open Research for Medical AI Workflows

MONAI is fueling open innovation for medical imaging with tools to accelerate image annotation, train state-of-the-art deep learning models, and create AI applications that drive research innovation.

It’s never been more important to put powerful AI tools in the hands of the world’s leading medical researchers. That’s why NVIDIA has invested in building a collaborative open-source foundation with MONAI, the Medical Open Network for AI. MONAI is fueling open innovation for medical imaging by providing tools that accelerate image annotation, train state-of-the-art deep learning models, and create AI applications that help drive research breakthroughs.

Developing domain-specific AI can be challenging, as a lack of best practices and open blueprints creates various impediments from research and development to clinical evaluation and deployment. Researchers needed a common foundation to accelerate the pace of medical AI research innovation.

The core principle behind creating Project MONAI is to unite doctors with data scientists to unlock the power of medical data. MONAI is a collaborative open-source initiative built by academic and industry leaders to establish and standardize the best practices for deep learning in healthcare imaging. Created by the imaging research community, for the imaging research community, MONAI is accelerating innovation in deep learning models and deployable applications for medical AI workflows.

Helping guide MONAI’s vision and mission, an Advisory Board and nine working groups, led by thought leaders throughout the medical research community. These focused working groups allow leaders in those fields to concentrate their efforts and bring effective contributions to the community. The working groups are open for anyone to attend.

MONAI is an open-source PyTorch-based framework for building, training, deploying, and optimizing AI workflows in healthcare. It focuses on providing high-quality, user-friendly software that facilitates reproducibility and easy integration. With these tenants researchers can share their results and build upon each other’s work, fostering collaboration among academic and industry researchers.

The suite of libraries, tools, and SDKs within MONAI provide a robust and common foundation that covers the end-to-end medical AI life cycle, from annotation through deployment.

Medical imaging annotation and segmentation

MONAI Label is an intelligent image labeling and learning tool that uses AI assistance to reduce the time and effort of annotating new datasets. Utilizing user interactions, MONAI Label trains an AI model for a specific task, continuously learns, and updates the model as it receives additional annotated images. 

MONAI Label provides multiple sample applications that include state-of-the-art interactive segmentation approaches like DeepGrow and DeepEdit. These sample applications are ready to use out of the box to quickly get started on annotating with minimal effort. Developers can also build their own MONAI Label applications with creative algorithms.

Client integrations help clinicians, radiologists, and pathologists interact with MONAI Label applications in their typical workflow. These clinical interactions are not dormant, as experts can correct annotations and immediately trigger training loops to adapt the model to input on the fly.

MONAI Label has integrations for 3D Slicer, OHIF for Radiology and QuPath, Digital Slide Archive for Pathology. Developers can also integrate MONAI Label into their custom viewer by using server and client APIs, which are well abstracted and documented for seamless integration. 

MONAI Label bridges the researchers world with clinical collaborators and can be integrated into any viewer, including 3D slicer and OHIF
Figure 1. MONAI Label architecture

Domain-specific algorithms and research pipelines

MONAI Core is the flagship library of Project MONAI and provides domain-specific capabilities for training AI models for healthcare imaging. These capabilities include medical-specific image transforms, state-of-the-art transformer-based 3D Segmentation algorithms like UNETR, and an AutoML framework named DiNTS.

With these foundational components, users can integrate MONAI’s domain-specialized components into their existing PyTorch programs. Users can also interface with MONAI at the workflow level for ease of robust training and research experiments. A rich set of functional examples demonstrates the capabilities and integration with other open-source packages like PyTorch Lightning, PyTorch Ignite, and NVIDIA FLARE. Finally, state-of-the-art reproducible research pipelines are included for self-supervised learning, AutoML, vision transformers for 3D, and 3D segmentation.

Screenshots of research pipelines on MONAI, such as Brain Tumor Segmentation, DeepAtlas, Vision Transformers and Multi Modal AI for Disease Classification.
Figure 2. State-of-the-art research pipelines available on MONAI Core

Deploying medical AI to clinical production

87% of data science projects never make it into production. Several steps are involved in crossing the chasm between a model and a deployable app. These include selecting the correct DICOM datasets, preprocessing input images, performing inference, exporting the results, visualizing the results, and further applying optimizations.

MONAI Deploy aims to become the de-facto standard for developing packaging, testing, deploying, and running medical AI applications in clinical production. MONAI Deploy creates a set of intermediate steps where researchers and physicians can build confidence in the techniques and approaches used with AI. This makes an iterative workflow, until the AI inference infrastructure is ready for clinical environments.

MONAI Deploy App SDK enables developers to take an AI model and turn them into AI applications. Available on GitHub, MONAI Deploy is also building open reference implementations of an inference orchestration engine, informatics gateway, and a workflow manager to help drive clinical integration.

 To drive innovation to the clinic, MONAI is building open reference implementations of inference orchestration engine, informatics gateway, and a workflow manager.
Figure 3. MONAI Deploy’s modular and open reference deployment framework

Advancing medical AI

The world’s leading research centers, including King’s College London, NIH National Cancer Institute, NHS Guy’s and St. Thomas’ Trust, Stanford University, Mass General Brigham, and Mayo Clinic are building and publishing using MONAI. Integration partners like AWS, Google Cloud, and Microsoft are all standing up MONAI on their platforms. To date, MONAI has grossed over 425,000 downloads and has a community of over 190 contributors who have published over 140 research papers.

The groundbreaking research using MONAI is fueled by the growth of its open community of contributors. Together, these researchers and innovators are collaborating on AI best practices in a platform that spans the full medical AI project lifecycle. From training to deployment, MONAI is bringing the healthcare community together to unlock the power of medical data and accelerate AI into clinical impact.

To learn more about MONAI and get started today, visit MONAI.io. A library of tutorials and recordings of MONAI bootcamps are also available for MONAI users on the MONAI YouTube channel.

Categories
Misc

Sensational Surrealism Astonishes This Week ‘In the NVIDIA Studio’

3D phenom FESQ joins us ‘In the NVIDIA Studio’ this week to share his sensational and surreal animation ‘Double/Sided’ as well as an inside look into his creative workflow. ‘Double/Sided’ is deeply personal to FESQ, who said the piece “translates really well to a certain period of my life when I was juggling both a programmer career and an artist career.”

The post Sensational Surrealism Astonishes This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Misc

Meet the Omnivore: Developer Builds Bots With NVIDIA Omniverse and Isaac Sim

While still in grad school, Antonio Serrano-Muñoz has helped author papers spanning planetary gravities, AI-powered diagnosis of rheumatoid arthritis and robots that precisely track millimetric-sized walkers, like ants.

The post Meet the Omnivore: Developer Builds Bots With NVIDIA Omniverse and Isaac Sim appeared first on NVIDIA Blog.

Categories
Misc

Turbocharging Multi-Cloud Security and Application Delivery with VirtIO Offloading

F5 Accelerates Security and App DeliveryBy accelerating Virtio-net in hardware, poor network performance can be avoided while maintaining transparent software implementation, including full support for VM live migration. F5 Accelerates Security and App Delivery

The incredible increase of traffic within data centers along with increased adoption of virtualization is placing strains on the traditional data centers.

Customarily, virtual machines rely on software interfaces such as VirtIO to connect with the hypervisor. Although VirtIO is significantly more flexible compared to SR-IOV, it can use up to 50% more compute power in the host, thus reducing the servers’ overall efficiency.

Similarly, the adoption of software-defined data centers is on the rise. Both virtualization and software-defined workloads are extremely CPU-intensive. This creates inefficiencies that reduce overall performance system-wide. Furthermore, infrastructure security is potentially compromised as the application domain and networking domain are not separated.

F5 and NVIDIA recently presented on how to solve these challenges [NEED SESSION LINK] at NVIDIA GTC. F5 discussed accelerating its BIG-IP Virtual Edition (VE) virtualized appliance portfolio by offloading VirtIO to the NVIDIA BlueField-2 data processing unit (DPU) and ConnectX-6 Dx SmartNIC. In the session, they discuss how the DPU provides optimal acceleration and offload due to its onboard networking ASIC and Arm processor cores, freeing CPU cores to focus on application workloads.

Offloading to the DPU also provides domain isolation to secure resources more tightly. Support for VirtIO also enables dynamic composability, creating a software-defined, hardware-accelerated solution that significantly decreases reliance on the CPU while maintaining the flexibility that VirtIO offers.

Virtual switching acceleration

DPUs accelerating Virtio in hardware avoiding poor network performance from software implementations.
Figure 1. Offloading VirtIO moves virtual datapath out of software and into the hardware of the SmartNIC or DPU where it can be accelerated

Virtual switching was born as a consequence of server virtualization. Hypervisors need the ability to enable transparent traffic switching between VMs and with the outside world.

One of the most commonly used virtual switching software solutions is Open vSwitch (OVS). NVIDIA Accelerated Switching and Packet Processing (ASAP2) technology accelerates virtual switching to improve performance in software-defined networking environments.

ASAP2 supports using vDPA to offload virtual switching (the OVS data plane) from the control plane. This permits flow rules to be programmed into the eSwitch within the network adapter or DPU and allows the use of standard APIs and common libraries such as DPDK to provide significantly higher OVS performance without the associated CPU load.

ASAP2 also supports SR-IOV for hardware acceleration of the data plane. The combination of the two capabilities provides a software-defined and hardware-accelerated solution that resolves performance issues associated within virtual SDN vSwitching solutions.

Accelerated networking

Earlier this year, NVIDIA released NVIDIA DOCA, a framework that simplifies application development for BlueField DPUs. DOCA makes it easier to program and manage the BlueField DPU. Applications developed using DOCA for BlueField will also run without changes on future versions, ensuring forward compatibility.

DOCA consists of industry-standard APIs, libraries, and drivers. One of these drivers is the DOCA VirtIO-net, which provides virtio interface acceleration. When using BlueField, the virtio interface is run on the DPU hardware. This reduces the CPU’s involvement and accelerates VirtIO’s performance while enabling features such as live migrations.

Bar chart of performance testing done with VirtIO offloading shows a dramatic increase in performance and improvements in processing time and packets processed
Figure 2. Performance advantages available with VirtIO offloading [VirtIO INCORRECTLY CAPITALIZED IN CHART TITLE]

BIG-IP VE results

During the joint GTC session, F5 demonstrated the advantages of hardware acceleration versus running without hardware acceleration. The demonstration showed BIG-IP VE performing SSL termination for NGINX. The TSUNG traffic generator is used to send 512K byte packets through multiple instances of BIG-IP VE.

With VirtIO running on the host, the max throughput reached only 5 Gbps and took 187 seconds to complete, with only 80% of all packets processed.

The same scenario using hardware acceleration resulted in 16 Gbps of throughput in only 62 seconds and 100% of the packets were processed.

Summary

Increasing network speeds, virtualization, and software-defined networking are adding strain on data center systems and creating a need for efficiency improvements.

VirtIO is a well-established I/O virtualization interface but has a software-only framework. SR-IOV technology was developed precisely to support high performance and efficient offload and acceleration of network functionality, but it requires a specific driver in each VM. By accelerating VirtIO-net in hardware, you can avoid poor network performance while maintaining transparent software implementation, including full support for VM live migration.

The demonstration with F5 Networks showed a 320% improvement in throughput, a 66% reduction in processing time, and 100% of packets were processed. This is evidence that the evolving way forward is through hardware vDPA that combines the out-of-the-box availability of VirtIO drivers with the performance gains of DPU hardware acceleration.

This session was presented simulive at NVIDIA GTC and can be replayed. For more information about the F5-NVIDIA joint solution that demonstrates the benefits of reduced CPU utilization while achieving high performance using VirtIO, see GTC session titled, Multi-cloud Security and Appllicaiton Delivery with VirtIO.

Categories
Offsites

Enhancing Backpropagation via Local Loss Optimization

While model design and training data are key ingredients in a deep neural network’s (DNN’s) success, less-often discussed is the specific optimization method used for updating the model parameters (weights). Training DNNs involves minimizing a loss function that measures the discrepancy between the ground truth labels and the model’s predictions. Training is carried out by backpropagation, which adjusts the model weights via gradient descent steps. Gradient descent, in turn, updates the weights by using the gradient (i.e., derivative) of the loss with respect to the weights.

The simplest weight update corresponds to stochastic gradient descent, which, in every step, moves the weights in the negative direction with respect to the gradients (with an appropriate step size, a.k.a. the learning rate). More advanced optimization methods modify the direction of the negative gradient before updating the weights by using information from the past steps and/or the local properties (such as the curvature information) of the loss function around the current weights. For instance, a momentum optimizer encourages moving along the average direction of past updates, and the AdaGrad optimizer scales each coordinate based on the past gradients. These optimizers are commonly known as first-order methods since they generally modify the update direction using only information from the first-order derivative (i.e., gradient). More importantly, the components of the weight parameters are treated independently from each other.

More advanced optimization, such as Shampoo and K-FAC, capture the correlations between gradients of parameters and have been shown to improve convergence, reducing the number of iterations and improving the quality of the solution. These methods capture information about the local changes of the derivatives of the loss, i.e., changes in gradients. Using this additional information, higher-order optimizers can discover much more efficient update directions for training models by taking into account the correlations between different groups of parameters. On the downside, calculating higher-order update directions is computationally more expensive than first-order updates. The operation uses more memory for storing statistics and involves matrix inversion, thus hindering the applicability of higher-order optimizers in practice.

In “LocoProp: Enhancing BackProp via Local Loss Optimization”, we introduce a new framework for training DNN models. Our new framework, LocoProp, conceives neural networks as a modular composition of layers. Generally, each layer in a neural network applies a linear transformation on its inputs, followed by a non-linear activation function. In the new construction, each layer is allotted its own weight regularizer, output target, and loss function. The loss function of each layer is designed to match the activation function of the layer. Using this formulation, training minimizes the local losses for a given mini-batch of examples, iteratively and in parallel across layers. Our method performs multiple local updates per batch of examples using a first-order optimizer (like RMSProp), which avoids computationally expensive operations such as the matrix inversions required for higher-order optimizers. However, we show that the combined local updates look rather like a higher-order update. Empirically, we show that LocoProp outperforms first-order methods on a deep autoencoder benchmark and performs comparably to higher-order optimizers, such as Shampoo and K-FAC, without the high memory and computation requirements.

Method
Neural networks are generally viewed as composite functions that transform model inputs into output representations, layer by layer. LocoProp adopts this view while decomposing the network into layers. In particular, instead of updating the weights of the layer to minimize the loss function at the output, LocoProp applies pre-defined local loss functions specific to each layer. For a given layer, the loss function is selected to match the activation function, e.g., a tanh loss would be selected for a layer with a tanh activation. Each layerwise loss measures the discrepancy between the layer’s output (for a given mini-batch of examples) and a notion of a target output for that layer. Additionally, a regularizer term ensures that the updated weights do not drift too far from the current values. The combined layerwise loss function (with a local target) plus regularizer is used as the new objective function for each layer.

Similar to backpropagation, LocoProp applies a forward pass to compute the activations. In the backward pass, LocoProp sets per neuron “targets” for each layer. Finally, LocoProp splits model training into independent problems across layers where several local updates can be applied to each layer’s weights in parallel.

Perhaps the simplest loss function one can think of for a layer is the squared loss. While the squared loss is a valid choice of a loss function, LocoProp takes into account the possible non-linearity of the activation functions of the layers and applies layerwise losses tailored to the activation function of each layer. This enables the model to emphasize regions at the input that are more important for the model prediction while deemphasizing the regions that do not affect the output as much. Below we show examples of tailored losses for the tanh and ReLU activation functions.

Loss functions induced by the (left) tanh and (right) ReLU activation functions. Each loss is more sensitive to the regions affecting the output prediction. For instance, ReLU loss is zero as long as both the prediction (â) and the target (a) are negative. This is because the ReLU function applied to any negative number equals zero.

After forming the objective in each layer, LocoProp updates the layer weights by repeatedly applying gradient descent steps on its objective. The update typically uses a first-order optimizer (like RMSProp). However, we show that the overall behavior of the combined updates closely resembles higher-order updates (shown below). Thus, LocoProp provides training performance close to what higher-order optimizers achieve without the high memory or computation needed for higher-order methods, such as matrix inverse operations. We show that LocoProp is a flexible framework that allows the recovery of well-known algorithms and enables the construction of new algorithms via different choices of losses, targets, and regularizers. LocoProp’s layerwise view of neural networks also allows updating the weights in parallel across layers.

Experiments
In our paper, we describe experiments on the deep autoencoder model, which is a commonly used baseline for evaluating the performance of optimization algorithms. We perform extensive tuning on multiple commonly used first-order optimizers, including SGD, SGD with momentum, AdaGrad, RMSProp, and Adam, as well as the higher-order Shampoo and K-FAC optimizers, and compare the results with LocoProp. Our findings indicate that the LocoProp method performs significantly better than first-order optimizers and is comparable to those of higher-order, while being significantly faster when run on a single GPU.

Train loss vs. number of epochs (left) and wall-clock time, i.e., the real time that passes during training, (right) for RMSProp, Shampoo, K-FAC, and LocoProp on the deep autoencoder model.

Summary and Future Directions
We introduced a new framework, called LocoProp, for optimizing deep neural networks more efficiently. LocoProp decomposes neural networks into separate layers with their own regularizer, output target, and loss function and applies local updates in parallel to minimize the local objectives. While using first-order updates for the local optimization problems, the combined updates closely resemble higher-order update directions, both theoretically and empirically.

LocoProp provides flexibility to choose the layerwise regularizers, targets, and loss functions. Thus, it allows the development of new update rules based on these choices. Our code for LocoProp is available online on GitHub. We are currently working on scaling up ideas induced by LocoProp to much larger scale models; stay tuned!

Acknowledgments
We would like to thank our co-author, Manfred K. Warmuth, for his critical contributions and inspiring vision. We would like to thank Sameer Agarwal for discussions looking at this work from a composite functions perspective, Vineet Gupta for discussions and development of Shampoo, Zachary Nado on K-FAC, Tom Small for development of the animation used in this blogpost and finally, Yonghui Wu and Zoubin Ghahramani for providing us with a nurturing research environment in the Google Brain Team.