Categories
Misc

Simplifying Model Development and Building Models at Scale with PyTorch Lightning and NGC

PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research. PyTorch code with Lightning enables seamless training on multiple-GPUs and uses best practices such as checkpointing, logging, sharding, and mixed precision. In this post, we walk you through building speech models with PyTorch Lightning on NVIDIA GPU-powered AWS instances managed by the Grid.ai platform.

AI is driving the fourth Industrial Revolution with machines that can hear, see, understand, analyze, and then make smart decisions at superhuman levels. However, the effectiveness of AI depends on the quality of the underlying models. So, whether you’re an academic researcher or a data scientist, you want to quickly build models with a variety of parameters and identify the most effective ones for your solutions.

In this post, we walk you through building speech models with PyTorch Lightning on NVIDIA GPU-powered AWS instances.

PyTorch Lightning + Grid.ai: Build models faster, at scale

PyTorch Lightning is a lightweight PyTorch wrapper for high-performance AI research. Organizing PyTorch code with Lightning enables seamless training on multiple GPUs, TPUs, CPUs, and the use of difficult to implement best practices such as checkpointing, logging, sharding, and mixed precision. A PyTorch Lightning container and developer environment is available on the NGC catalog.

Grid enables you to scale training from your laptop to the cloud without having to modify your code. Running on cloud providers such as AWS, Grid supports Lightning as well as all the classic machine learning frameworks such as Sci Kit, TensorFlow, Keras, PyTorch and more. With Grid, you can scale the training of models from the NGC catalog.

NGC: The hub for GPU-optimized AI software

The NGC catalog is the hub for GPU-optimized software including AI/ML containers, pretrained models, and SDKs that can be easily deployed across on-premises, cloud, edge, and hybrid environments. NGC offers NVIDIA TAO Toolkit that enables retraining models with custom data and NVIDIA Triton Inference Server to run predictions on CPU and GPU-powered systems.

The rest of this post walks you through how to leverage models from the NGC catalog and the NVIDIA NeMo framework to train an automatic speech recognition (ASR) model with PyTorch Lightning using the following tutorial based on the ASR with NeMo tutorial.

Raw data is trained, evaluated, iterated, and retrained for highly accurate results and performant models.
Figure 1. AI model training process

Training NGC models with Grid sessions, PyTorch Lightning, and NVIDIA NeMo

ASR is the task of transcribing spoken language to text and is a critical component of Speech to Text systems. When training ASR models, your goal is to generate text from a given audio input that minimizes the word error rate (WER) metric on human transcribed speech. The NGC catalog contains state-of-the-art pretrained models for ASR.

In the remainder of this post, we show you how to use Grid sessions, NVIDIA NeMo, and PyTorch Lightning to fine-tune these models on the AN4 dataset.

The AN4 dataset, also known as the Alphanumeric dataset, was collected and published by Carnegie Mellon University. It consists of recordings of people spelling out addresses, names, telephone numbers, and so on, one letter or number at a time, as well as their corresponding transcripts.

Step 1: Create a Grid session optimized for Lightning and pretrained NGC models

Grid sessions run on the same hardware that you need to scale while providing you with preconfigured environments to iterate the research phase of the machine learning process faster than before. Sessions are linked to GitHub, loaded with JupyterHub, and can be accessed through SSH and your IDE of choice without having to do any setup yourself.

With sessions, you pay only for the compute that you need to get a baseline operational, and then you can scale your work to the cloud with Grid runs. Grid sessions are optimized for PyTorch Lightning and models hosted on the NGC catalog. They even provide specialized Spot pricing.

For an in-depth walkthrough, see the Grid Session tour (requires a Grid.ai account).

Step-by-step process to create a Grid session optimized for PyTorch Lightning: Choose New Session, Pick a machine, attach a datastore, start the session, log in, and pause to stop paying without losing progress.
Figure 2. Workflow to create a Grid session

Step 2: Clone the ASR demo repo and open the tutorial notebook

Now that you have a developer environment optimized for PyTorch Lightning, the next step is to clone the NGC-Lightning-Grid-Workshop repo.

You can do this directly from a terminal in your Grid Session with the following command:

git clone https://github.com/aribornstein/NGC-Lightning-Grid-Workshop.git

After you’ve cloned the repo, you can open up the notebook to use to fine-tune the NGC hosted model with NeMo and PyTorch Lightning.

Step 3: Install NeMo ASR dependencies

First, install all the session dependencies. Run tools such as PyTorch Lightning and NeMo and process the AN4 dataset to do this. Run the first cell in the tutorial notebook, which runs the following bash commands to install the dependencies.

## Install dependencies
!pip install wget
!sudo apt-get install sox libsndfile1 ffmpeg -y
!pip install unidecode
!pip install matplotlib>=3.3.2
## Install NeMo
BRANCH = 'main'
!python -m pip install --user git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]
## Grab the config we'll use in this example
!mkdir configs
!wget -P configs/ https://raw.githubusercontent.com/NVIDIA/NeMo/$BRANCH/examples/asr/conf/config.yaml

Step 4: Convert and visualize the AN4 dataset

The AN4 dataset comes in raw Sof audio files, but most models process on mel spectrograms.  Convert the Sof files to the Wav format so that you can use NeMo audio processing.

import librosa
import IPython.display as ipd
import glob
import os
import subprocess
import tarfile
import wget

# Download the dataset. This will take a few moments...
print("******")
if not os.path.exists(data_dir + '/an4_sphere.tar.gz'):
    an4_url = 'http://www.speech.cs.cmu.edu/databases/an4/an4_sphere.tar.gz'
    an4_path = wget.download(an4_url, data_dir)
    print(f"Dataset downloaded at: {an4_path}")
else:
    print("Tarfile already exists.")
    an4_path = data_dir + '/an4_sphere.tar.gz'

if not os.path.exists(data_dir + '/an4/'):
    # Untar and convert .sph to .wav (using sox)
    tar = tarfile.open(an4_path)
    tar.extractall(path=data_dir)

    print("Converting .sph to .wav...")
    sph_list = glob.glob(data_dir + '/an4/**/*.sph', recursive=True)
    for sph_path in sph_list:
        wav_path = sph_path[:-4] + '.wav'
        cmd = ["sox", sph_path, wav_path]
        subprocess.run(cmd)
print("Finished conversion.n******")
# Load and listen to the audio file
example_file = data_dir + '/an4/wav/an4_clstk/mgah/cen2-mgah-b.wav'
audio, sample_rate = librosa.load(example_file)
ipd.Audio(example_file, rate=sample_rate)

You can then visualize the audio example as images of the audio waveform. Figure 3 shows the activity in the waveform that corresponds to each letter in the audio, as your speaker here enunciates quite clearly!

Example of five letters in audio wave form.
Figure 3. Audio waveform of the sample example

Each spoken letter has a different “shape.” It’s interesting to note that the last two blobs look relatively similar, which is expected because they are both the letter N.

Spectrograms

Modeling audio is easier in the context of frequencies of sound over time. You can get a better representation than this raw sequence of 57,330 values. A spectrogram is a good way of visualizing how the strengths of various frequencies in the audio vary over time. It is obtained by breaking up the signal into smaller, usually overlapping chunks, and performing a short-time Fourier transform (STFT) on each.

Figure 4 shows what the spectrogram of the sample looks like.

Waveform diagram of audio spectrogram where each letter is being pronounced
Figure 4. Audio spectrogram of the sample example

As in the earlier waveform, you see each letter being pronounced. How do you interpret these shapes and colors? Just as in the earlier waveform plot, you see time passing on the x-axis (all 2.6s of audio). However, now the y-axis represents different frequencies (on a log scale), and the color on the plot shows the strength of a frequency at a particular point in time.

Mel spectrograms

You’re still not done, as you can make one more potentially useful tweak by visualizing the data using the mel spectrogram. Change the frequency scale from linear (or logarithmic) to the mel scale, which better represents the pitches that are perceivable to the human ear. Mel spectrograms are intuitively useful for ASR. Because you are processing and transcribing human speech, mel spectrograms reduce background noise that can affect the model.

Example of Mel spectrogram useful for ASR
Figure 5. Mel spectrogram of the sample example

Step 5: Load and inference a pretrained QuartzNet model from NGC

Now that you’ve loaded and properly understood the AN4 dataset, look at how to use NGC to load an ASR model to be fine-tuned with PyTorch Lightning. NeMo’s ASR collection comes with many building blocks and even complete models that you can use for training and evaluation. Moreover, several models come with pretrained weights.

To model the data for this post, you use a Jasper architecture called QuartzNet from the NGC Model Hub. Jasper architecture consists of repeated block structures that uses 1D convolutions to model spectrogram data (Figure 6).

Jasper/Quartz model that displays how Jasper architecture uses 1D convolutions to model spectrogram data
Figure 6. Jasper/QuartzNet model

QuartzNet is a better variant of Jasper with a key difference in that it uses time-channel separable 1D convolutions. This enables it to reduce the number of weights dramatically while keeping similar accuracy.

The following command downloads the pretrained QuartzNet15x5 model from the NGC catalog and instantiates it for you.

tgmuartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En")

Step 6: Fine-tune the model with Lightning

When you have a model, you can fine-tune it with PyTorch Lightning, as follows.

import pytorch_lightning as pl
from omegaconf import DictConfig
trainer = pl.Trainer(gpus=1, max_epochs=10)
params['model']['train_ds']['manifest_filepath'] = train_manifest
params['model']['validation_ds']['manifest_filepath'] = test_manifest
first_asr_model = nemo_asr.models.EncDecCTCModel(cfg=DictConfig(params['model']), trainer=trainer)

# Start training!!!
trainer.fit(first_asr_model)

Because you are using this Lightning Trainer, you get some key advantages, such as model checkpointing and logging by default. You can also use 50+ best-practice tactics without needing to modify the model code, including multi-GPU training, model sharding, deep speed, quantization-aware training, early stopping, mixed precision, gradient clipping, and profiling.

Example commands for fine-tuning tactics, such as early stopping, model checkpoint, experiment managers, and more.
Figure 7. Fine-tuning tactics

Step 7: Inference and deployment

Now that you have a baseline model, inference it.

Figure 9. Run inference

Step 8: Pause session

Now that you have trained the model, you can pause the session and all the files that you need are persisted.

Screenshot shows the paused Grid session.
Figure 9. Monitor Grid session

Paused sessions are free of charge and can be resumed as needed.

Conclusion

Now, you should have a better understanding of PyTorch Lightning, NGC, and Grid. You’ve fine-tuned your first NGC NeMo model and optimized it with Grid runs. We are excited to see what you do next with Grid and NGC.

Categories
Misc

Ray Tracing Gems II Available Today in Hardcover

Ray Tracing Games II is now available as a hardcover on Apress and Amazon.

Ray Tracing Games II is now available to download for free via Apress and Amazon and as a hardcover on Apress and Amazon as well. For those who love books as a physical medium, we recommend purchasing a copy for your home library, while also downloading the free PDF version for easy digital access on the go. 

This Open Access book is a must-have for anyone interested in real-time rendering. Ray tracing is the holy grail of gaming graphics, simulating the physical behavior of light to bring real-time, cinematic-quality rendering to even the most visually intense games. Ray tracing is also a fundamental algorithm used for architecture applications, visualization, sound simulation, deep learning, and more.

We’ve collaborated with our partners to make four limited edition versions of the book, featuring custom covers that highlight real-time ray tracing in Fortnite, Control, Watch Dogs: Legion, and Quake II RTX.

To win a limited edition print copy of Ray Tracing Gems II, enter the giveaway contest here: https://developer.nvidia.com/ray-tracing-gems-ii

Categories
Offsites

Detecting Abnormal Chest X-rays using Deep Learning

The adoption of machine learning (ML) for medical imaging applications presents an exciting opportunity to improve the availability, latency, accuracy, and consistency of chest X-ray (CXR) image interpretation. Indeed, a plethora of algorithms have already been developed to detect specific conditions, such as lung cancer, tuberculosis and pneumothorax. By virtue of being trained to detect a specific disease, however, the utility of these algorithms may be limited in a general clinical setting, where a wide variety of abnormalities could surface. For example, a pneumothorax detector is not expected to highlight nodules suggestive of cancer, and a tuberculosis detector may not identify findings specific to pneumonia. Since an initial triaging step is to determine whether a CXR contains any concerning abnormalities, a general-purpose algorithm that identifies X-rays containing any sort of abnormality could significantly facilitate the workflow. However, developing a classifier to detect any abnormality is challenging due to the ​​wide variety of abnormal findings that present on CXRs.

In “Deep Learning for Distinguishing Normal versus Abnormal Chest Radiographs and Generalization to Two Unseen Diseases Tuberculosis and COVID-19”, published in Scientific Reports, we present a model that can distinguish between normal and abnormal CXRs across multiple de-identified datasets and settings. We find that the model performs well on general abnormalities, as well as unseen examples of tuberculosis and COVID-19. We are also releasing our set of radiologists’ labels1 for the test set used in this study for the publicly available ChestX-ray14 dataset.

A Deep Learning System for Detecting Abnormal Chest X-rays
The deep learning system we used is based on the EfficientNet-B7 architecture, pre-trained on ImageNet. We trained the model using over 200,000 de-identified CXRs from the Apollo Hospitals in India. Each CXR was assigned a label of either “normal” or “abnormal” using a regular expression–based natural language processing approach on the associated radiology reports.

To evaluate how well the system generalizes to new patient populations, we compared its performance on two datasets consisting of a wide spectrum of abnormalities: the test split from the Apollo Hospitals dataset (DS-1), and the publicly available ChestX-ray14 (CXR-14). The labels for these two test sets were annotated for the purposes of this project by a group of US board-certified radiologists. The system achieved areas under the receiver operating characteristic curve (AUROC) of 0.87 on DS-1 and 0.94 on CXR-14 (higher is better).

Though the evaluations on DS-1 and CXR-14 contained a wide range of abnormalities, a possible use-case would be to utilize such an abnormality detector in novel or unforeseen settings with diseases that it had not encountered before. To evaluate the generalizability of the system to new patient populations and in the presence of diseases not seen in the training set, we used four de-identified datasets from three countries, including two publicly available tuberculosis datasets and two COVID-19 datasets from Northwestern Medicine. The system achieved AUCs of 0.95-0.97 in detecting tuberculosis, and 0.65-0.68 in detecting COVID-19. Because CXRs that are negative for these diseases could still contain other concerning abnormalities, we further evaluated the system for its ability to detect abnormalities more broadly (instead of disease positive vs. negative), finding AUCs of 0.91-0.93 for the tuberculosis dataset, and AUCs of 0.86 for the COVID-19 dataset.

The purpose of multiple evaluations (abnormality detection and disease detection) is the distinction between the two: a given disease can present with a certain abnormality or not; and a certain abnormality can arise from multiple diseases. Our study evaluates for both.

<!–

​​AUCs for Three Evaluation Setups
1. General Abnormalities 2. Unseen disease:

Tuberculosis

3. Unseen disease:

COVID-19

Detect abnormalities 0.87-0.94 0.91-0.93 0.86
Detect respective disease 0.95-0.97 0.65-0.68

–>

The large drop in performance for COVID-19 is because many cases flagged by the system as “positive” for abnormalities were negative for COVID-19, but nevertheless contained abnormal CXR findings that needed attention. This further highlights the usefulness of abnormality detectors even if disease-specific models are available.

In addition, it’s important to note that there is a difference between generalization to unseen diseases (i.e., tuberculosis and COVID-19) versus generalization to unseen CXR findings (e.g., pleural effusion, consolidation/infiltrate). In this study, we demonstrated the generalizability of the system to unseen diseases but not necessarily unseen CXR findings.

Sample chest X-rays of true and false positives, and true and false negatives for (A) general abnormalities, (B) tuberculosis, and (C) COVID-19. On each CXR, we outline in red the areas on which the model focused to identify abnormalities (i.e., the class activation map), and outline the regions of interest indicated by a radiologist in yellow.

Potential Benefits in the Clinic
To understand the potential utility of the deep learning model in improving clinical workflow, we simulated its use for case prioritization, where abnormal cases are “expedited” ahead of normal cases. In these simulations, the system reduced the turnaround time for abnormal cases by up to 28%. This reprioritization setup could be used to divert complex abnormal cases to cardiothoracic specialist radiologists, enable rapid triage of cases that may need urgent decisions, and provide the opportunity to batch negative CXRs for streamlined review.

Impact of a simulated deep learning model–based prioritization in comparison with random review order for (A) general abnormalities, (B) tuberculosis, and (C) COVID-19. The red bars indicate sequences of abnormal CXRs in red and normal CXRs in pink; a greater density of red towards the left indicates abnormal CXRs are reviewed sooner than normal ones. The histograms indicate the average improvement in turnaround time.

Additionally, we found that the system can be used as a pre-trained model to improve other ML algorithms for chest X-rays, especially when data is limited. For example, we used the normal/abnormal classifier in our recent study to detect pulmonary tuberculosis from chest X-rays. Abnormality and tuberculosis detectors can play a critical role in supporting early diagnosis in regions that lack access to resources like trained radiologists or molecular testing.

Sharing Improved Reference Standard Labels
Much work remains to be done to realize the potential of ML to aid chest X-ray interpretation around the world. In particular, obtaining high-quality labels on de-identified data can be a significant barrier to developing and evaluating ML algorithms in healthcare. To accelerate these efforts, we are expanding upon our previous label release by releasing the labels used in this study for the publicly available ChestX-ray14 dataset. We look forward to future machine learning projects by the community in this space.

AcknowledgementsKey contributors to this project at Google include Zaid Nabulsi, Andrew Sellergren‎, Shahar Jamshy, Charles Lau, Eddie Santos, Atilla P. Kiraly, Wenxing Ye, Jie Yang, Rory Pilgrim, Sahar Kazemzadeh, Jin Yu, Greg S. Corrado, Lily Peng, Krish Eswaran, Daniel Tse, Neeral Beladia, Yun Liu, Po-Hsuan Cameron Chen, Shravya Shetty. Significant contributions and input were also made by radiologist collaborators Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia Vicente, David Melnick. For the CXR-14 dataset, we thank the NIH Clinical Center for making it publicly available. For tuberculosis data collection, thanks go to Sameer Antani, Stefan Jaeger, Sema Candemir, Zhiyun Xue, Alex Karargyris, George R. Thomas, Pu-Xuan Lu, Yi-Xiang Wang, Michael Bonifant, Ellan Kim, Sonia Qasba, and Jonathan Musco. The authors would also like to acknowledge many members of the Google Health Radiology and labeling software teams, in particular Shruthi Prabhakara, Scott McKinney, and Akib Uddin. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study; Jonny Wong for coordinating the imaging annotation work; Gavin Bee, Mikhail Fomitchev, Shabir Adeel, Jeff Bertram, and Benedict Noero for data releasing; David F. Steiner, Kunal Nagpal, and Michael D. Howell for providing feedback on the manuscript; Craig Mermel, Lauren Winer, Johnny Luu, Adrienne Welch, Annisah Um’rani, and Ashley Zlatinov for feedback on the blogpost.


1Labels include atelectasis, cardiomegaly, effusion, infiltration, mass, nodule, pneumonia, pneumothorax, consolidation, edema, emphysema, fibrosis, pleural thickening, hernia, other abnormality, and normal vs abnormal. 

Categories
Misc

What Is Accelerated Computing?

Accelerated computing is humming in the background, making life better even on a quiet night at home. It prevents credit card fraud when you buy a movie to stream. It recommends a dinner you might like and arranges a fast delivery. Maybe it even helped the movie’s director win an Oscar for stunning visuals. So, Read article >

The post What Is Accelerated Computing? appeared first on The Official NVIDIA Blog.

Categories
Misc

Machine Learning Frameworks Interoperability, Part 3: Zero-Copy in Action using an E2E Pipeline

Machine Learning Frameworks Interoperability part 3 covers on the implementation of an end-to-end pipeline demonstrating the discussed techniques for optimal data transfer across data science frameworks.

Introduction

Efficient pipeline design is crucial for data scientists. When composing complex end-to-end workflows, you may choose from a wide variety of building blocks, each of them specialized for a dedicated task. Unfortunately, repeatedly converting between data formats is an error-prone and performance-degrading endeavor. Let’s change that!

In this blog series, we discuss different aspects of efficient framework interoperability:

  • In the first post, we discussed pros and cons of distinct memory layouts as well as memory pools for asynchronous memory allocation to enable zero-copy functionality.
  • In the second post, we highlighted bottlenecks occurring during data loading/transfers and how to mitigate them using Remote Direct Memory Access (RDMA) technology.
  • In this post, we dive into the implementation of an end-to-end pipeline demonstrating the discussed techniques for optimal data transfer across data science frameworks.

To learn more on framework interoperability, check out our presentation at NVIDIA’s GTC 2021 Conference.

Let’s dive into the implementation details of a fully functional pipeline for:

  • Parsing of 20 hours of continuously measured electrocardiograms (ECGs) from plain CSV files.
  • Unsupervised segmentation of bespoke ECG stream into individual heart beats using traditional signal processing techniques.
  • Subsequent training of a variational autoencoder (VAE) for outlier detection.
  • Final visualization of the results.

For each of the previous steps, a different data science library is used, so making efficient data conversion a crucial task. Most importantly, you should avoid costly CPU roundtrips when copying data from one GPU-based framework to another.

Zero-Copy in Action: End-to-End pipeline

Enough talk! Let’s see framework interoperability in action. In the following, we will discuss the end-to-end pipeline step by step. If you are an impatient person, you can directly download the full Jupyter notebook here. The source code can be executed from within a recent RAPIDS docker container. 

Step 1: Data loading

In the first step, we download 20 hours of electrocardiograms as a CSV file and write it to disk (see Cell 1). After, we parse the 500 MB of scalar values from the CSV file and transfer them directly to the GPU using RAPIDS’ blazing fast CSV reader (see Cell 2). Now, the data resides on the GPU and will stay there until the very end. Next, we plot the whole time series consisting of 20 million scalar data points using the cuxfilter (ku-cross-filter) framework (see Cell 3).

A table of floating-point numbers in CSV format, the API call for cudf.io.csv.read_csv, as well as the resulting data frame in RAPIDS.
Figure 1:  Parsing comma-separated values (CSV) using the RAPIDS CSV parser.

Step 2: Data segmentation

In the next step, we segment the 20 hours ECG into individual heartbeats using traditional signal processing techniques. We achieve that by convolving the ECG stream with the second derivative of a Gaussian distribution, also known as Ricker wavelet, in order to isolate the corresponding frequency band of the initial peak in a prototypical heartbeat. Both the sampling of the wavelet as well as the FFT-based convolution are facilitated using CuPy, a CUDA-accelerated library for dense linear algebra and array operations. As a direct result, the RAPIDS cuDF dataframe storing the ECG data has to be converted to a CuPy array using DLPack as a zero-copy mechanism. 

The left panel depicts a single heartbeat and a Ricker wavelet. The right panel depicts 13 continuously measured heartbeats and the corresponding output of the convolution with the Ricker wavelet.
Figure 2: Convolving the electrocardiogram (ECG) stream with a Ricker wavelet of fixed-width using CuPy.

The feature response (result) of the convolution measures the presence of a fixed frequency content for each position in the stream. Note that we have chosen the wavelet in a way so that local maxima correspond to the initial peak of a heartbeat. 

Step 3: Local maxima detection

In the next step, we map these extremal points to a binary gate using a 1D variant of non-maximum suppression (NMS). NMS determines for every position in the stream if the corresponding value is the maximum in a predefined window (neighborhood). A CUDA implementation of this embarrassingly parallel problem is straightforward. In our example, we use the just-in-time compiler Numba to allow for seamless Python integration. Both Numba and Cupy implement the CUDA array interface as a zero-copy mechanism and thus the explicit cast from CuPy arrays to Numba device arrays can be fully avoided. 

The left panel depicts the segmentation gates for 13 continuously measured heartbeats coming from the 1D non-maximum suppression. The right panel visualizes a few approximately aligned heartbeats embedded in a vector of length 256.
Figure 3:  1D non-maximum suppression and embedding of heartbeats using Numba JIT.

The length of each heartbeat is determined by computing the adjacent difference (finite step derivative) of the gate positions. We facilitate this by filtering the index domain with the predicate gate==1 followed by a call to cupy.diff(). The resulting histogram depicts the length distribution.

Step 4: Candidate pruning and embedding

We intend to train a (convolutional) Variational Autoencoder (VAE) on the set of heartbeats using a fixed-length input matrix. The embedding of heartbeats in a vector of zeros can be realized with a CUDA kernel. Here, we again use Numba for both candidate pruning and embedding. 

Step 5: Outliers detection

In this step, we train the VAE model on 75% of the data. DLPack is used again as the zero-copy mechanism to map the CuPy data matrix to a PyTorch tensor.

The network topology of a variational autoencoder with an approximately isotropic point cloud below the (middle) latent layer.
Figure 4: Training a variational autoencoder using PyTorch.

Step 6: Results visualization

In a final step, we visualize the latent space of the remaining 25% of the data.

An approximately isotropic Gaussian point cloud in the 2D plane and 9 (fake) heart beats generated with the decoder of the autoencoder.
Figure 5: Sampling and visualizing the latent space using RAPIDS cuxfilter.

Conclusion

A key takeaway from this and the preceding blog posts is that interoperability is crucial for the design of efficient data pipelines. Copy-and-converting data between different frameworks is an expensive and incredibly time-consuming task that adds zero value to data science pipelines. Data science workloads are becoming increasingly complex, and the interaction between multiple software libraries is common practice. DLPack and the CUDA Array Interface are the de facto data format standards that guarantee zero-copy data exchange among GPU-based frameworks.

Support for external memory managers is a nice-to-have feature to consider when appraising which software libraries your pipeline will use. For instance, if your task requires both DataFrame and array data manipulation, a great choice of libraries is RAPIDS cuDF + CuPy. They both benefit from GPU acceleration, support DLPack to exchange data with zero-copy, and share the same memory manager, RMM. Alternatively, RAPIDS cuDF + JAX would also be an excellent option. Nevertheless, the latter combination might require extra development efforts to leverage memory usage because of JAX’s lack of support for external memory allocators.

Data loading and data transfer bottlenecks are frequent when dealing with large datasets. NVIDIA GPUDirect technology comes to the rescue, supporting moving data into or out of the GPU memory without burdening the CPU and reducing to one the number of data copies needed when transferring data between GPUs on different nodes.

Categories
Misc

NVIDIA Research: Tensors Are the Future of Deep Learning

This post discusses tensor methods, how they are used in NVIDIA, and how they are central to the next generation of AI algorithms. Tensors in modern machine learning Tensors, which generalize matrices to more than two dimensions, are everywhere in modern machine learning. From deep neural networks features to videos or fMRI data, the structure … Continued

This post discusses tensor methods, how they are used in NVIDIA, and how they are central to the next generation of AI algorithms.

Tensors in modern machine learning

Tensors, which generalize matrices to more than two dimensions, are everywhere in modern machine learning. From deep neural networks features to videos or fMRI data, the structure in these higher-order tensors is often crucial.

Deep neural networks typically map between higher-order tensors. In fact, it is the ability of deep convolutional neural networks to preserve and leverage local structure that made the current levels of performance possible, along with large datasets and efficient hardware. Tensor methods enable you to preserve and leverage that structure further, for individual layers or whole networks.

Circular diagram with deep tensor nets in center with compression, speed-ups, generalization, and robustness in circles around it
Figure 1. Deep Tensor Nets diagram.

Combining tensor methods and deep learning can lead to better models, including:

  • Better performance and generalization, through better inductive biases
  • Improved robustness, from implicit (low-rank structure) or explicit (tensor dropout) regularization
  • Parsimonious models, with a large reduction in the number of parameters
  • Computational speed-ups by operating directly and efficiently on factorized tensors

One example is factorized convolution. With a CP structure, it is possible to decompose the kernel of a convolution and express it efficiently as a separable one. This decouples the dimensions and enables you to transduct, such as training on 2D and generalizing to 3D while leveraging the information learned in 2D.

static 2D image turns into 3D information through tensor method
Figure 2. The process of factorized convolution: How 2D information turns into 3D information.

The proper implementation of tensor-based deep neural networks can be tricky. Major neural networks libraries such as PyTorch or TensorFlow do not provide layers based on tensor algebraic methods and have limited support for sparse tensors. In NVIDIA, we lead the development of a series of tools to make the use of tensor methods in deep learning seamless, through the TensorLy project and Minkowski Engine.

TensorLy ecosystem

TensorLy offers a high-level API for tensor methods, including decomposition and algebra.

It enables you to use tensor methods easily without requiring a lot of background knowledge. You can choose and seamlessly integrate with your computational backend of choice (NumPy, PyTorch, MXNet, TensorFlow, CuPy, or JAX), without having to change the code.

Two toned triangle TensorLy layered diagram that breaks it down by tie
Figure 3. The TensorLy-Torch layer diagram.

TensorLy-Torch is a new library that builds on top of TensorLy and provides PyTorch layers implementing these tensor operations. They can be used out-of-the-box and readily integrated in any deep neural network. At the core of it is the concept of factorized tensors: tensors are represented, stored, and manipulated directly in decomposed form. Whenever possible, operations then directly operate on these decomposed tensors.

These factorized tensors can then be used to parametrize deep neural network layers efficiently, such as factorized convolutions and linear layers. Finally, tensor hooks enable you to apply techniques such as generalized lasso and tensor dropout seamlessly for improved generalization and robustness.

Spatially sparse tensors and Minkowski Engine

In many high-dimensional problems, data become sparse as the volume of the space increases faster. The sparsity is mostly embedded in the spatial dimension where you can compute distance. The most well-known example of such sparsity is 3D data, such as meshes and scans.

3D pixelated scan of a room with two beds
Figure 4. 3D Reconstruction of a Room with Two Beds.

Here’s an example 3D reconstruction of a room with two beds. The 3D bounding volume that it occupies can be quite large, but the data, or the 3D surface reconstruction, occupies only a fraction of the space. In this example, 95.5% of the space is empty and less than 5% contains a valid surface. Using a dense tensor to represent such data results in wasting not just large amounts of memory but also computation if you were to process such data.

For such cases, you could use a sparse representation that does not waste memory and computation on the empty space for building a neural network or a learning algorithm. Specifically, you use sparse tensors to represent such data, which is one of the most widely used representations for sparse data. Sparse tensors represent data using a pair of positions and values of nonzero values.

Minkowski Engine is a PyTorch extension that provides an extensive set of neural network layers for sparse tensors. All functions in Minkowski Engine support CPU and CUDA operations where CUDA operations accelerate over 100x over the top-of-the-line CPUs.

Line graph that details the speedup time of V100 and A100 over speedup
Figure 5. Sparse Representation Graphs: Number of non-zero elements over time, number of non-zero elements over speedup.

For other interesting projects, see all NVIDIA Research posts.

Categories
Misc

NVIDIA Research: Auditing AI Models for Verified Deployment under Semantic Specifications

○ Quality assurance and audits are necessary for deep learning models. Current AI models require large data sets for training or a designed reward function that must be optimized. Algorithmically, AI is prone to optimizing behaviors that were not intended by the human designer. To help combat this, the AuditAI framework was developed to help audit these problems, which increases safety and ethical use of deep learning models during deployment.

When you purchased your last car, did you check for safety ratings or quality assurances from the manufacturer. Perhaps, like most consumers, you simply went for a test drive to see if the car offered all the features and functionality you were looking for, from comfortable seating to electronic controls.

Audits and quality assurance are the norm across many industries. Consider car manufacturing, where the production of the car is followed by rigorous tests of safety, comfort, networking, and so on, before deployment to end users. Based on this, we ask the question, “How can we design a similarly motivated auditing scheme for deep learning models?”

AI has enjoyed widespread success in real-world applications. Current AI models—deep neural networks in particular—do not require exact specifications of the type of desired behavior. Instead, they require large datasets for training, or a designed reward function that must be optimized over time.

While this form of implicit supervision provides flexibility, it often leads to the algorithm optimizing for behavior that was not intended by the human designer. In many cases, it also leads to catastrophic consequences and failures in safety-critical applications, such as autonomous driving and healthcare.

As these models are prone to failure, especially under domain shifts, it is important to know before their deployment when they might fail. As deep learning research becomes increasingly integrated with real world applications, we must come up with schemes of formally auditing deep learning models.

Semantically aligned unit tests

One of the biggest challenges in auditing is in understanding how we can obtain human-interpretable specifications that are directly useful to end users. We addressed this challenge through a sequence of semantically aligned unit tests. Each unit test verifies whether a predefined specification (for example, accuracy over 95%) is satisfied with respect to controlled and semantically aligned variations in the input space (for example, in face recognition, the angle relative to the camera).

We perform these unit tests by directly verifying the semantically aligned variations in an interpretable latent space of a generative model. Our framework, AuditAI, bridges the gap between interpretable formal verification of software systems and scalability of deep neural networks.     

Machine learning process broken down by sections from designer, verifier, and the end user
Figure 1. A general machine learning process for AI as it goes from a project to deployment.

Consider a typical machine learning production pipeline with three parties: the end user of the deployed model, verifier, and model designer. The verifier plays the critical role of verifying whether the model from the designer satisfies the need of the end user. For example, unit test 1 could be verifying whether a given face classification model maintains over 95% accuracy when the face angle is within d degrees. Unit test 2 could be checking under what lighting condition the model has over 86% accuracy. After verification, the end user can then use the verified specification to determine whether to use the trained DL model during deployment.

Certified training test formula that checks for expression and facial angles
Figure 2. Deep networks undergo certified training to ensure that unit tests are likely to be satisfied.

Verified deployment

To verify the deep network for semantically aligned properties, we bridge it with a generative model, such that they share the same latent space and the same encoder that projects inputs to latent codes. In addition to verifying whether a unit test is satisfied, we can also perform certified training to ensure that the unit test is likely to be satisfied in the first place. The framework has appealing theoretical properties, and we show in our paper how the verifier is guaranteed to be able to generate a proof of whether the verification is true or false. For more information, see Auditing AI models for Verified Deployment under Semantic Specifications [LINK].

Verification and certified training of neural networks for pixel-based perturbations covers a much narrower range of semantic variations in the latent space, compared to AuditAI. To perform a quantitative comparison, for the same verified error, we project the pixel-bound to the latent space and compare it with the latent-space bound for AuditAI. We show that AuditAI can tolerate around 20% larger latent variations compared to pixel-based counterparts as measured by L2 norm, for the same verified error. For the implementation and experiments, we used NVIDIA V100 GPUs and Python with the PyTorch library. 

In Figure 3, we show qualitative results for generated outputs corresponding to controlled variations in the latent space. The top row shows visualizations for AuditAI, and the bottom row shows visualizations for pixel-perturbations for images of class hen on ImageNet, chest X-ray images with the condition pneumonia, and human faces with different degrees of smile respectively. From the visualizations, it is evident that wider latent variations correspond to a wider set of semantic variations in the generated outputs.

Figure 3. Top row: AuditAI visualizations, Bottom row: Visualizations for pixel-perturbations

Future work

In this paper, we developed a framework for auditing of deep learning (DL) models. There are increasingly growing concerns about innate biases in the DL models that are deployed in a wide range of settings and there have been multiple news articles about the necessity for auditing DL models before deployment. Our framework formalizes this audit problem, which we believe is a step towards increasing safety and the ethical use of DL models during deployment.

One of the limitations of AuditAI is that its interpretability is limited by that of the built-in generative model. While exciting progress has been made for generative models, we believe it is important to incorporate domain expertise to mitigate potential dataset biases and human error in both training and deployment.

Currently, AuditAI doesn’t directly integrate human domain experts in the auditing pipeline. It indirectly uses domain expertise in the curation of the dataset used for creating the generative model. Incorporating the former would be an important direction for future work.

Although we have demonstrated AuditAI primarily for auditing computer vision classification models, we hope that this would pave the way for more sophisticated domain-dependent AI-auditing tools and frameworks in language modeling and decision-making applications.

Acknowledgments

This work was conducted wholly at NVIDIA. We thank our co-authors De-An Huang, Chaowei Xiao, and Anima Anandkumar for helpful feedback, and everyone in the AI Algorithms team at NVIDIA for insightful discussions during the project.

Categories
Misc

Learn How to Use Deep Learning for Industrial Inspection

Register now for the Sept. 21 instructor-led training from DLI covering training, accelerating, and optimizing a defect detection classifier.

NVIDIA GPUs are used to develop the most accurate automated inspection solutions for manufacturing semiconductors, electronics, automotive components, and assemblies. Along with accompanying software tools, GPUs enable efficient training of models for greater accuracy and optimized inference deployment at the edge. These models dramatically improve the accuracy of industrial inspection, resulting in reduced test escapes and increased yield at greater throughput.

The NVIDIA Deep Learning Institute (DLI) is offering an instructor-led class about training, accelerating, and optimizing a defect detection classifier. 

You will start by exploring key challenges around industrial inspection, and problem formulation along with data curation, exploration, and formatting. 

Then you will learn about the fundamentals of transfer learning, online augmentation, modeling, and fine-tuning. 

By the end of the workshop, you’ll be familiar with the key concepts of optimized inference, performance assessment, and interpretation of deep learning models.

By participating in this workshop, you will learn how to: 

  • Formulate an industrial inspection case study and curate datasets generated by automated optical inspection (AOI) machines.
  • Deal with the logistics and challenges of data handling in an industrial inspection workflow.
  • Extract meaningful insights from our dataset using pandas DataFrame and NumPy library.
  • Apply transfer learning to a deep learning classification model (Inception v3). 
  • Fine-tune the deep learning model and set up evaluation metrics.
  • Optimize the trained Inception v3 model on an NVIDIA V100 Tensor Core GPU using NVIDIA TensorRT™ 5.
  • Experiment with FP16 half-precision fast inferencing with the V100’s Tensor Cores.

You will have access to a GPU-accelerated server in the cloud and earn an NVIDIA DLI certificate to demonstrate subject-matter competency and accelerate your career growth.

This workshop will be offered twice to accommodate both CEST and PDT timezones on:
Tue, Sept. 21, 2021, 9 am–5 pm, CEST/EMEA, UTC+2
Tue, Sept. 21, 2021, 9 am–5 pm, PDT, UTC-7

Register now, space is limited!

Categories
Offsites

Introducing Omnimattes: A New Approach to Matte Generation using Layered Neural Rendering

Image and video editing operations often rely on accurate mattes — images that define a separation between foreground and background. While recent computer vision techniques can produce high-quality mattes for natural images and videos, allowing real-world applications such as generating synthetic depth-of-field, editing and synthesising images, or removing backgrounds from images, one fundamental piece is missing: the various scene effects that the subject may generate, like shadows, reflections, or smoke, are typically overlooked.

In “Omnimatte: Associating Objects and Their Effects in Video”, presented at CVPR 2021, we describe a new approach to matte generation that leverages layered neural rendering to separate a video into layers called omnimattes that include not only the subjects but also all of the effects related to them in the scene. Whereas a typical state-of-the-art segmentation model extracts masks for the subjects in a scene, for example, a person and a dog, the method proposed here can isolate and extract additional details associated with the subjects, such as shadows cast on the ground.

A state-of-the-art segmentation network (e.g., MaskRCNN) takes an input video (left) and produces plausible masks for people and animals (middle), but misses their associated effects. Our method produces mattes that include not only the subjects, but their shadows as well (right; individual channels for person and dog visualized as blue and green).

Also unlike segmentation masks, omnimattes can capture partially-transparent, soft effects such as reflections, splashes, or tire smoke. Like conventional mattes, omnimattes are RGBA images that can be manipulated using widely-available image or video editing tools, and can be used wherever conventional mattes are used, for example, to insert text into a video underneath a smoke trail.

Layered Decomposition of Video
To generate omnimattes, we split the input video into a set of layers: one for each moving subject, and one additional layer for stationary background objects. In the example below, there is one layer for the person, one for the dog, and one for the background. When merged together using conventional alpha blending, these layers reproduce the input video.

Besides reproducing the video, the decomposition must capture the correct effects in each layer. For example, if the person’s shadow appears in the dog’s layer, the merged layers would still reproduce the input video, but inserting an additional element between the person and dog would produce an obvious error. The challenge is to find a decomposition where each subject’s layer captures only that subject’s effects, producing a true omnimatte.

Our solution is to apply our previously developed layered neural rendering approach to train a convolutional neural network (CNN) to map the subject’s segmentation mask and a background noise image into an omnimatte. Due to their structure, CNNs are naturally inclined to learn correlations between image effects, and the stronger the correlation between the effects, the easier for the CNN to learn. In the above video, for example, the spatial relationships between the person and their shadow, and the dog and its shadow, remain similar as they walk from right to left. The relationships change more (hence, the correlations are weaker) between the person and the dog’s shadow, or the dog and the person’s shadow. The CNN learns the stronger correlations first, leading to the correct decomposition.

The omnimatte system is shown in detail below. In a preprocess, the user chooses the subjects and specifies a layer for each. A segmentation mask for each subject is extracted using an off-the-shelf segmentation network, such as MaskRCNN, and camera transformations relative to the background are found using standard camera stabilization tools. A random noise image is defined in the background reference frame and sampled using the camera transformations to produce per-frame noise images. The noise images provide image features that are random but consistently track the background over time, providing a natural input for the CNN to learn to reconstruct the background colors.

The rendering CNN takes as input the segmentation mask and the per-frame noise images and produces the RGB color images and alpha maps, which capture the transparency of each layer. These outputs are merged using conventional alpha-blending to produce the output frame. The CNN is trained from scratch to reconstruct the input frames by finding and associating the effects not captured in a mask (e.g., shadows, reflections or smoke) with the given foreground layer, and to ensure the subject’s alpha roughly includes the segmentation mask. To make sure the foreground layers only capture the foreground elements and none of the stationary background, a sparsity loss is also applied on the foreground alpha.

A new rendering network is trained for each video. Because the network is only required to reconstruct the single input video, it is able to capture fine structures and fast motion in addition to separating the effects of each subject, as seen below. In the walking example, the omnimatte includes the shadow cast on the slats of the park bench. In the tennis example, the thin shadow and even the tennis ball are captured. In the soccer example, the shadow of the player and the ball are decomposed into their proper layers (with a slight error when the player’s foot is occluded by the ball).

This basic model already works well, but one can improve the results by augmenting the input of the CNN with additional buffers such as optical flow or texture coordinates.

Applications
Once the omnimattes are generated, how can they be used? As shown above, we can remove objects, simply by removing their layer from the composition. We can also duplicate objects, by repeating their layer in the composition. In the example below, the video has been “unwrapped” into a panorama, and the horse duplicated several times to produce a stroboscopic photograph effect. Note that the shadow that the horse casts on the ground and onto the obstacle is correctly captured.

A more subtle, but powerful application is to retime the subjects. Manipulation of time is widely used in film, but usually requires separate shots for each subject and a controlled filming environment. A decomposition into omnimattes makes retiming effects possible for everyday videos using only post-processing, simply by independently changing the playback rate of each layer. Since the omnimattes are standard RGBA images, this retiming edit can be done using conventional video editing software.

The video below is decomposed into three layers, one for each child. The children’s initial, unsynchronized jumps are aligned by simply adjusting the playback rate of their layers, producing realistic retiming for the splashes and reflections in the water.

In the original video (left), each child jumps at a different time. After editing (right), everyone jumps together.

It’s important to consider that any novel technique for manipulating images should be developed and applied responsibly, as it could be misused to produce fake or misleading information. Our technique was developed in accordance with our AI Principles and only allows rearrangement of content already present in the video, but even simple rearrangement can significantly alter the effect of a video, as shown in these examples. Researchers should be aware of these risks.

Future Work
There are a number of exciting directions to improve the quality of the omnimattes. On a practical level, this system currently only supports backgrounds that can be modeled as panoramas, where the position of the camera is fixed. When the camera position moves, the panorama model cannot accurately capture the entire background, and some background elements may clutter the foreground layers (sometimes visible in the above figures). Handling fully general camera motion, such as walking through a room or down a street, would require a 3D background model. Reconstruction of 3D scenes in the presence of moving objects and effects is still a difficult research challenge, but one that has seen promising recent progress.

On a theoretical level, the ability of CNNs to learn correlations is powerful, but still somewhat mysterious, and does not always lead to the expected layer decomposition. While our system allows for manual editing when the automatic result is imperfect, a better solution would be to fully understand the capabilities and limitations of CNNs to learn image correlations. Such an understanding could lead to improved denoising, inpainting, and many other video editing applications besides layer decomposition.

Acknowledgements
Erika Lu, from the University of Oxford, developed the omnimatte system during two internships at Google, in collaboration with Google researchers Forrester Cole, Tali Dekel, Michael Rubinstein, William T. Freeman and David Salesin, and University of Oxford researchers Weidi Xie and Andrew Zisserman.

Thank you to the friends and families of the authors who agreed to appear in the example videos. The “horse jump low”, “lucia”, and “tennis” videos are from the DAVIS 2016 dataset. The soccer video is used by permission from Online Soccer Skills. The car drift video was licensed from Shutterstock.

Categories
Misc

New NVIDIA Updates for Unreal Engine Developers

Unreal Engine developers receive access to several NVIDIA updates. Our custom branch of Unreal Engine 4.27 (NvRTX) improves Deep Learning Super Sampling (DLSS), RTX Global Illumination (RTXGI), RTX Direct Illumination (RTXDI), and NVIDIA Real-Time Denoisers (NRD).

Custom RTX Branch of UE4.27 Drops, Global Illumination and Low Latency Solutions Level Up 

Today, Unreal Engine developers receive access to several NVIDIA updates. Our custom branch of Unreal Engine 4.27 (NvRTX) improves Deep Learning Super Sampling (DLSS), RTX Global Illumination (RTXGI), RTX Direct Illumination (RTXDI), and NVIDIA Real-Time Denoisers (NRD). The Unreal Engine RTXGI plugin has been updated, making it easy to add the latest version of this global illumination SDK (1.1.40) to your game. And NVIDIA Reflex is now a standard feature in Unreal Engine 4.27. Details for each product are below.

NvRTX 4.27 (Download here)

NVIDIA has made it easy for game developers to add leading-edge technologies to their Unreal Engine (UE4) games by providing custom UE4 branches for NVIDIA technologies on GitHub. NvRTX 4.27 will shorten your development cycles and help make your games look even more stunning.

In NvRTX 4.27, RTX Direct Illumination (RTXDI) and NVIDIA Real-Time Denoisers (NRD) improved support for Metahuman hair. Deep Learning Super Sampling (DLSS) added softening capability to the sharpness slider, and provided a workaround for packaged builds not initializing DLSS on d3d11 devices. RTXGI increased the number of volumes supported in the indirect lighting pass from 6 to 12, made screen-space indirect lighting resolution adjustable through cvars, and supported blending of more than 2 volumes.

RTXGI 1.1.40 UE4 plugin (Download here

Leveraging the power of ray tracing, RTXGI provides scalable solutions to compute multi-bounce indirect lighting without bake times, light leaks, or expensive per-frame costs. RTXGI is supported on any DXR-enabled GPU, and is an ideal starting point to bring the benefits of ray tracing to your existing tools, knowledge, and capabilities. 

As in the NvRTX update, this plugin update doubles the number of volumes supported in the indirect lighting pass, enables developers to adjust screen-space indirect lighting resolution, and supports the blending of more than 2 volumes.

Reflex (now available in Unreal Engine 4.27)

The NVIDIA Reflex SDK allows game developers to implement a low latency mode that aligns game engine work to complete just-in-time for rendering, eliminating the GPU render queue and reducing CPU back pressure in GPU-bound scenarios. As a developer, System Latency (click-to-display) can be one of the hardest metrics to optimize for. In addition to latency reduction functions, the SDK also features measurement markers to calculate both Game and Render Latency – great for debugging and in-game performance counters.

NVIDIA Reflex is now mainlined within UE4.27 and natively supported. To get started adding NVIDIA Reflex to your game, check out our easy step by step integration guide. 

Additional Resources: