Building World-Class AI Models with NVIDIA NeMo and DefinedCrowd

Speech is the most natural form of human communication. So, it’s not surprising that we’ve always wanted to interact with and command machines by voice. However, for conversational AI to provide a seamless, natural, and human-like experience, it needs to be trained on large amounts of data representative of the problem the model is trying … Continued

Speech is the most natural form of human communication. So, it’s not surprising that we’ve always wanted to interact with and command machines by voice. However, for conversational AI to provide a seamless, natural, and human-like experience, it needs to be trained on large amounts of data representative of the problem the model is trying to solve. The difficulty for machine learning teams is the scarcity of this high-quality, domain-specific data.

Companies are trying to solve this problem and accelerate the widespread adoption of conversational AI with innovative solutions that guarantee the scalability and internationality of models. NVIDIA and DefinedCrowd are two such companies. By providing machine learning engineers with a model-building toolkit and high-quality training data respectively, NVIDIA and DefinedCrowd integrate to create world-class AI simply, easily, and quickly.

DefinedCrowd, a one-stop shop for AI training data

I am the director of machine learning at DefinedCrowd, and our core business is providing high-quality AI training data to companies building world-class AI solutions. Our customers can access this data through DefinedData, an online marketplace of off-the-shelf AI training data available in multiple languages, domains, and recording types.

If you can’t find what you’re looking for in DefinedData, our workflows can serve as standalone or end-to-end data services to build any speech– or text-enabled AI architecture from scratch, to improve solutions already developed, or to evaluate models in production, all with the DefinedCrowd quality guarantee.

Creating conversational AI applications the easy way

NVIDIA NeMo is a toolkit built by NVIDIA for creating conversational AI applications. This toolkit includes collections of pretrained modules for automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS), enabling researchers and data scientists to easily compose complex neural network architectures and focus on designing their applications.

Video: Watch how simple and fast it is to create world class conversational AI with NVIDIA NeMo and DefinedCrowd. Build a world-class model quickly and easily with the NeMo toolkit, and train it for high-performance with high-quality training data from DefinedCrowd.

NeMo and DefinedCrowd integration

Here’s how to connect DefinedCrowd speech workflows to train and improve an ASR model using NVIDIA NeMo.  The code can also be accessed on this Google Colab link.

Step 1: Install NeMo Toolkit and dependencies

# First, install NeMo Toolkit and dependencies to run this notebook
!apt-get install -y libsndfile1 ffmpeg
!pip install Cython

## Install NeMo dependencies in the correct versions
!pip install torchtext==0.8.0 torch==1.7.1 pytorch-lightning==1.2.2

## Install NeMo
!python -m pip install nemo_toolkit[all]==1.0.0b3

Step 2: Obtain data using the DefinedCrowd API

Here’s how to connect to the DefinedCrowd API to obtain speech collected data. For more information, see DefinedCrowd API (v2).

# For the demo, use a sandbox environment
auth_url = ""
api_url = ""

# These variables should be obtained at the DefinedCrowd Enterprise Portal for your account.
client_id = ""
client_secret = ""
project_id = ""


payload = {
    "client_id": client_id,
    "client_secret": client_secret,
    "grant_type": "client_credentials",
    "scope": "PublicAPIv2",
files = []
headers = {}

# request the Auth 2.0 access token
response = requests.request(
    "POST", f"{auth_url}/connect/token", headers=headers, data=payload, files=files
if response.status_code == 200:
    print("Authentication success!")
    access_token = response.json()["access_token"]
    print("Authentication Failed")

Authentication success!

List of deliverables

# GET /projects/{project-id}/deliverables
headers = {"Authorization": "Bearer " + access_token}
response = requests.request(
    "GET", f"{api_url}/projects/{project_id}/deliverables", headers=headers

if response.status_code == 200:
    # Pretty print the response
    print(json.dumps(response.json(), indent=4))

    # Get the first deliverable ID
    deliverable_id = response.json()[0]["id"]

        "projectId": "eb324e45-c4f9-41e7-b5cf-655aa693ae75",
        "id": "258f9e15-2937-4846-b9c3-3ae1164b7364",
        "type": "Flat",
        "fileName": "",
        "createdTimestamp": "2021-03-22T14:34:37.8037259",
        "isPartial": false,
        "downloadCount": 2,
        "status": "Downloaded"

Final deliverable for speech data collection

# Name to give to the deliverable file
filename = ""

# GET /projects/{project-id}/deliverables/{deliverable-id}/download
headers = {"Authorization": "Bearer " + access_token}
response = requests.request(

if response.status_code == 200:
    # save the deliverable file
    with open(filename, "wb") as fp:
    print("Deliverable file saved with success!")

Deliverable file saved with success!

# Extract the contents from the downloaded file
!unzip &> /dev/null
!rm -f

Step 3: Analyze the speech dataset

Here’s how to analyze the data received from DefinedCrowd. The data is built of scripted speech data collected by the DefinedCrowd Neevo platform from several speakers in the UK (crowd members from DefinedCrowd).

Each row of the dataset contains information about the speech prompt, crowd member, device used, and the recording. The following data is found with this delivery:

  • Recording:
    • RecordingId
    • PromptId
    • Prompt
  • Audio File:
    • RelativeFileName
    • Duration
    • SampleRate
    • BitDepth
    • AudioCommunicationBand
    • RecordingEnvironment
  • Crowd Member:
    • SpeakerId
    • Gender
    • Age
    • Accent
    • LivingCountry
  • Recording Device:
    • Manufacturer
    • DeviceType
    • Domain

This data can be used for multiple purposes, but in this tutorial, I use it for improving an existent ASR model for British speakers.

import pandas as pd

# Look in the metadata file
dataset = pd.read_csv("metadata.tsv", sep="t", index_col=[0])

# Check the data for the first row

RecordingId                               165559628
PromptId                                   64977250
RelativeFileName                Audio/165559628.wav
Prompt                    The Avengers' extinction.
Duration                               00:00:02.815
SpeakerId                                    128209
Gender                                       Female
Age                                              26
Manufacturer                                  Apple
DeviceType                                iPhone 6s
Accent                                      Suffolk
Domain                                      generic
SampleRate                                    16000
BitDepth                                         16
AudioCommunicationBand                    Broadband
LivingCountry                        United Kingdom
Native                                         True
RecordingEnvironment                         silent
Name: 0, dtype: object

# How many rows do you have?


# Check some examples from the dataset
import librosa
import IPython.display as ipd

for index, row in dataset.sample(4, random_state=1).iterrows():

    print(f"Prompt: {dataset.iloc[index].Prompt}")
    audio_file = dataset.iloc[index].RelativeFileName

    # Load and listen to the audio file
    audio, sample_rate = librosa.load(audio_file)
    ipd.display(ipd.Audio(audio, rate=sample_rate))

For audio samples, see the DefinedCrowd x NeMo – ASR Training tutorial on Google Colab.

Step 4: Prepare the data

After downloading the speech data from DefinedCrowd API, you must adapt it for the format expected by NeMo for ASR training. For this, you create manifests for the training and evaluation data, including each audio file’s metadata.

NeMo requires that you adapt the data to a particular manifest format. Each line corresponding to one audio sample, so the line count equals the number of samples represented by the manifest. A line must contain the path to an audio file, the corresponding transcript, and the audio sample duration. For example, here is what one line might look like in a NeMo-compatible manifest:

{"audio_filepath": "path/to/audio.wav", "duration": 3.45, "text": "this is a nemo tutorial"}

For the creation of the manifest, also standardize the transcripts.

import os

# Function to build a manifest
def build_manifest(dataframe, manifest_path):
    with open(manifest_path, "w") as fout:
        for index, row in dataframe.iterrows():
            transcript = row["Prompt"]

            # The model uses lowercased data for training/testing
            transcript = transcript.lower()

            # Removing linguistic marks (they are not necessary for this demo)
            transcript = (
                transcript.replace("", "")
                .replace("", "")
                .replace("[b_s/]", "")
                .replace("[uni/]", "")
                .replace("[v_n/]", "")
                .replace("[filler/]", "")
                .replace('"', "")
                .replace("[n_s/]", "")

            audio_path = row["RelativeFileName"]

            # Get the audio duration
                duration = librosa.core.get_duration(filename=audio_path)
            except Exception as e:
                print("An error occurred: ", e)

            if os.path.exists(audio_path):
                # Write the metadata to the manifest
                metadata = {
                    "audio_filepath": audio_path,
                    "duration": duration,
                    "text": transcript,
                json.dump(metadata, fout)

Step 5: Train and test splits

To test the quality of the model, you must reserve some data for model testing. Evaluate the model performance on this data.

import json
from sklearn.model_selection import train_test_split
# Split 10% for testing (500 prompts) and 90% for training (4500 prompts)
trainset, testset = train_test_split(dataset, test_size=0.1, random_state=1)
# Build the manifests
build_manifest(trainset, "train_manifest.json")
build_manifest(testset, "test_manifest.json")

Step 6: Configure the model

Here’s how to use the QuartzNet15x5 model as a base model for fine-tuning with the data. To improve the recognition of the dataset, benchmark the model performance on the base model and later, on the fine-tuned version. Some of the following functions were retrieved from the Nemo Tutorial on ASR.

# Import Nemo and the functions for ASR
import torch
import nemo
import nemo.collections.asr as nemo_asr
import logging
from nemo.utils import _Logger
# Set up the log level by NeMo
logger = _Logger()

Step 7: Set training parameters

For training, NeMo uses a Python dictionary as data structure to keep all the parameters. For more information, see the NeMo ASR Config User Guide.

For this tutorial, load a preexisting file with the standard ASR configuration and change only the necessary fields.

## Download the config to use in this example
!mkdir configs
!wget -P configs/ &> /dev/null

# --- Config Information ---#
from ruamel.yaml import YAML

config_path = "./configs/config.yaml"

yaml = YAML(typ="safe")
with open(config_path) as f:
    params = yaml.load(f)

Step 8: Download the base model

For the ASR model, use a pretrained QuartzNet15x5 model from the NGC catalog.

QuartzNet15x5 model trained on six datasets: LibriSpeech, Mozilla Common Voice (validated clips from en_1488h_2019-12-10), WSJ, Fisher, Switchboard, and NSC Singapore English. It was trained with Apex/Amp optimization level O1 for 600 epochs. The model achieves a WER of 3.79% on LibriSpeech dev-clean, and a WER of 10.05% on dev-other.

# This line downloads the pretrained QuartzNet15x5 model from NGC and instantiates it for you
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="QuartzNet15x5Base-En", strict=False)

Step 9: Evaluate the base model performance

The word error rate (WER) is a valuable measurement tool for comparing different ASR model and evaluating improvements within one system. To obtain the results, assess how the model performs by using the testing set.

# Configure the model parameters for testing

# Parameters for training, validation, and testing are specified using the 
# train_ds, validation_ds, and test_ds sections of your configuration file

# Bigger batch-size = bigger throughput
params["model"]["validation_ds"]["batch_size"] = 8

# Set up the test data loader and make sure the model is on GPU
params["model"]["validation_ds"]["manifest_filepath"] = "test_manifest.json"

# Comment out this line if you don't want to use GPU acceleration
_ = quartznet.cuda()

# Compute the WER metric between the hypothesis and predictions.

wer_numerators = []
wer_denominators = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` gives you:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
with torch.no_grad():
    for test_batch in quartznet.test_dataloader():
        input_signal, input_signal_length, targets, targets_lengths = [x.cuda() for x in test_batch]
        log_probs, encoded_len, greedy_predictions = quartznet(
        # The model has a helper object to compute WER
        quartznet._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_numerator, wer_denominator = quartznet._wer.compute()

# First, sum all numerators and denominators. Then, divide.
print(f"WER = {sum(wer_numerators)/sum(wer_denominators)*100:.2f}%")

WER = 39.70%

Step 10: Fine-tune the model

The base model got 39.7% of WER, which is not so good. Maybe providing some data from the same domain and language dialects can improve the ASR model. For simplification, train for only one epoch using DefinedCrowd’s data.

import pytorch_lightning as pl
from omegaconf import DictConfig
import copy

# Before training, you must provide the train manifest for training
params["model"]["train_ds"]["manifest_filepath"] = "train_manifest.json"

# Use the smaller learning rate for fine-tuning
new_opt = copy.deepcopy(params["model"]["optim"])
new_opt["lr"] = 0.001

# Batch size depends on the GPU memory available
params["model"]["train_ds"]["batch_size"] = 8

# Point to the data to be used for fine-tuning as the training set

# Clean the torch cache

# Now you can create a PyTorch Lightning trainer.
trainer = pl.Trainer(gpus=1, max_epochs=1)

# The fit function starts the training

Step 11: Compare model performance

Compare the final model performance with the fine-tuned model that you received from training with additional data.

# Configure the model parameters for testing
params["model"]["validation_ds"]["batch_size"] = 8

# Set up the test data loader and make sure the model is on GPU
params["model"]["validation_ds"]["manifest_filepath"] = "test_manifest.json"
_ = quartznet.cuda()

# Compute the WER metric between the hypothesis and predictions.

wer_numerators = []
wer_denominators = []

# Loop over all test batches.
# Iterating over the model's `test_dataloader` gives you:
# (audio_signal, audio_signal_length, transcript_tokens, transcript_length)
# See the AudioToCharDataset for more details.
with torch.no_grad():
    for test_batch in quartznet.test_dataloader():
        input_signal, input_signal_length, targets, targets_lengths = [x.cuda() for x in test_batch]
        log_probs, encoded_len, greedy_predictions = quartznet(
        # The model has a helper object to compute WER
        quartznet._wer.update(greedy_predictions, targets, targets_lengths)
        _, wer_numerator, wer_denominator = quartznet._wer.compute()

# First, sum all numerators and denominators. Then, divide.
print(f"WER = {sum(wer_numerators)/sum(wer_denominators)*100:.2f}%")

WER = 24.36%

After training new epochs of the neural network ASR architecture, I achieved a WER of 24.36%, which is an improvement over the initial 39.7% from the base model using only one epoch for training. For better results, consider using more epochs in the training.


In this tutorial, I demonstrated how to load speech data collected by DefinedCrowd and how to use it to train and measure the performance of an ASR model. I hope I have shown you how easy it is to create world-class AI solutions with NVIDIA and DefinedCrowd.


Scaling Inference in High Energy Particle Physics at Fermilab Using NVIDIA Triton Inference Server

High-energy physics research aims to understand the mysteries of the universe by describing the fundamental constituents of matter and the interactions between them. Diverse experiments exist on Earth to re-create the first instants of the universe. Two examples of the most complex experiments in the world are at the Large Hadron Collider (LHC) at CERN … Continued

High-energy physics research aims to understand the mysteries of the universe by describing the fundamental constituents of matter and the interactions between them. Diverse experiments exist on Earth to re-create the first instants of the universe. Two examples of the most complex experiments in the world are at the Large Hadron Collider (LHC) at CERN and the Deep Underground Neutrino Experiment (DUNE) at Fermilab.

The LHC is home to the highest energy particle collisions in the world and the discovery of the Higgs boson. LHC detectors are like ultra–high-speed cameras that capture the remnants of those collisions every 25 nanoseconds to create a 5D image in space, time, and energy. LHC physicists collect huge datasets to find extremely rare events. Those events may give clues about the Higgs boson as a portal to new physics or the particle nature of dark matter.

The DUNE experiment sends a beam of particles called neutrinos from the west suburbs of Chicago to an underground mine 1,300 km away in South Dakota. There, a massive 40-kton detector is being constructed 1.5 km beneath the earth’s surface to observe these feebly interacting particles. Studying neutrinos can help us answer questions such as the origin of matter in the universe and the behavior of core-collapse supernova in the Milky Way galaxy.

These experiments consist of unique and cutting-edge particle detectors that create massive, complex, and rich datasets with billions of events. They require sophisticated algorithms to reconstruct and interpret the data.

Modern machine learning algorithms provide a powerful toolset to detect and classify particles, from familiar image-processing convolutional neural networks to newer graph neural network architectures. A full reconstruction of these particle collisions requires novel approaches to handle the computing challenge of processing so much raw data. In a series of studies, physicists from Fermilab, CERN, and university groups explored how to accelerate their data processing using NVIDIA Triton Inference Server.

Picture shows how the particle events are captured for ML processing.
Figure 2. A 6 GeV/c electron event recorded by the ProtoDUNE-SP detector (run 5770, event 59001). The x-axis shows the wire number. The y-axis shows the time tick in the unit of 0.5 μs. The color scale represents charge deposition. Source: DUNE Collaboration, JINST 15 (2020) P12004

The full offline reconstruction chain for the ProtoDUNE-SP detector is a good representative of event reconstruction in present and future accelerator-based neutrino experiments. For more information, see GPU-accelerated machine learning inference as a service for computing in neutrino experiments.

In each event, charged particles interact with the liquid argon in the detector, liberating ionization electrons that drift across the detector volume under the influence of an electric field.  These electrons induce signals as they pass through and are collected by a set of wire planes at the end of the drift path. Two spatial coordinates can be determined from the different angular orientations of the wires in each plane. The third coordinate can be determined from the drift time of the ionization electrons. As a result, a detailed 3D image of the neutrino interaction can be reconstructed.

The most computationally intensive step of the reconstruction process involves an ML algorithm that looks at 48×48 pixel cutouts, or patches. Those patches represent small sections of the full event and the algorithm identifies the particles in them. Importantly, over the entire ProtoDUNE-SP detector, there are thousands of 48×48 patches to be classified, such that a typical event may have approximately 55,000 patches to process. In the following section, we discuss the performance implications of this process and how using NVIDIA Triton Inference Server helps us to scale the deep learning inference.

Similarly, for the LHC, a series of neural networks can be used to process data from low-level cluster calibration and electron energy regression to jet (particle spray) classification.

Figure shows that the workflow is similar in LHC.
Figure 3. Calorimeter recorded hits combined into clusters. Source: Lindsey Gray, FNAL

Figure 3 shows how a similar paradigm is used for the LHC. Hits recorded by the calorimeter system are combined into clusters (zoomed-in section at right). These can then be further combined into higher-level reconstructed particle objects, such as the jet indicated at the bottom left. In simulated events such as this one, the reconstructed clusters can be related to the “truth” information from the simulation software (GEANT) to measure the accuracy of the algorithms.

Compute-intensive process

For the ProtoDUNE-SP detector, the reconstruction processing time is dominated by running convolutional neural network inference for the thousands of patches in each event. When you’re running inference on a typical CPU, this consumes 65% of the total time for reconstruction. The current dataset consists of 400 TB from hundreds of millions of neutrino events. The team decided to use NVIDIA T4 GPUs to speed up this most compute-intensive process. In the initial trial phase, they used T4 instances on Google Cloud.

In production, thousands of client nodes feed detector data (images) into the reconstruction process. The scale of computing is so large that a distributed worldwide grid of computing resources is needed. This poses challenges to coordinating and optimizing resources shared by different sites worldwide. To cope with these challenges, the team decided to use a novel inference-as-a-service computing paradigm for the first time.

Inference as a service with NVIDIA Triton Inference Server

The team implemented their generic approach, called SONIC (Services for Optimized Network Inference on Coprocessors), for inference as a service using NVIDIA Triton Inference Server. This technology is available from the NGC Catalog, a hub for GPU-optimized AI containers, models, and SDKs built to simplify and accelerate AI workflows.

NVIDIA Triton simplifies the deployment of AI models at scale in production. It’s an open-source inference serving software package that helps teams deploy trained AI models:

  • From any framework: TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework
  • From any storage: Local, Google Cloud Platform, Amazon S3, or Microsoft Azure Storage
  • On any GPU- or CPU-based infrastructure: Cloud, data center, or edge

The team deployed the NVIDIA Triton server as a container and used Kubernetes to orchestrate the various cloud resources. Each GPU server in the cluster runs an instance of the NVIDIA Triton server. The clients run on separate, CPU-only nodes and send inference requests using gRPC over the network. Kubernetes handles load balancing and resource scaling for the GPU cluster.


The use of T4 GPUs resulted in a 17x speed-up of the most time-consuming ML module of the workflow: track and particle shower hit identification. Overall workflow (event processing time) was accelerated by a factor of 2.7x.

The following are key benefits that the team achieved:

  • No disruption. The workflow was accelerated without disruption to any of the other algorithms or experiment software.
  • Allocation flexibility. In this deployment, many client nodes sent requests to a single GPU. This allowed heterogeneous resources to be allocated and reallocated based on demand and task, providing significant flexibility and potential cost reduction.
  • Reduced dependencies. There’s a reduced dependency on open-source ML frameworks in the experimental code base. Otherwise, the experiment would be required to integrate and support separate C++ APIs for every framework in use.
  • Concurrent use. NVIDIA Triton also used all available GPUs automatically when the servers had multiple GPUs, further increasing the flexibility of the server. In addition, NVIDIA Triton can execute multiple models from various ML frameworks concurrently.
  • Dynamic batching. NVIDIA Triton provides dynamic batching, which combines multiple requests into optimally sized batches to perform inference as efficiently as possible for the task at hand. This effectively enables simultaneous processing of multiple events without any changes to the experiment software framework.
To understand the relative position of the various components in the service
Figure 4. Architecture diagram of the NVIDIA Triton-based inference as a service.

To scale the NVIDIA T4 GPU throughput flexibly, we used a Google Kubernetes Engine (GKE) cluster for server-side workloads. Kubernetes Ingress was used as a load-balancing service to distribute incoming network traffic among the NVIDIA Triton pods. Prometheus-based monitoring was used for the following:

  • System metrics from the underlying virtual machine
  • Kubernetes metrics for the overall health and state of the cluster
  • Inference-specific metrics gathered from NVIDIA Triton through a built-in Prometheus publisher

All metrics were visualized through a Grafana instance, also deployed within the same cluster. The team kept the pod-to-node ratio at 1:1 throughout the studies, with each pod running an instance of NVIDIA Triton Inference Server (v20.02-py3) from NGC. The throughput was maximized when 68 CPU client processes sent requests to a single remote GPU. The exact ratio depends on the algorithm and workflow.


The offline neutrino reconstruction workflow was accelerated by deploying ML models on NVIDIA T4 GPUs. NVIDIA Triton and Kubernetes helped the team implement inference as a scalable service in a flexible and cost-effective way. Though we focused on a result specific to neutrino physics, a similar result was achieved for the LHC and constitutes a successful proof of concept. These results pave the way for deploying DL inference as a service at scale in high energy physics experiments.

For more information, see the following resources:


We would like to thank, globally, the multi-institutional team that performed these neutrino and LHC studies. For more information about their work, see Featured image of Protodune detector taken by Maximilien Brice from CERN.


Inception Spotlight: Assaia AI Ready for Takeoff at Kentucky Airport

Switzerland-based Assaia International AG is deploying a deep learning solution at Cincinnati/Northern Kentucky International Airport (CVG) to help airport employees monitor the turnaround time between flights.

Switzerland-based Assaia International AG, an NVIDIA Metropolis partner and member of the NVIDIA Inception acceleration platform for AI startups, is deploying a deep learning solution at Cincinnati/Northern Kentucky International Airport (CVG) to help airport employees monitor the turnaround time between flights. 

The Turnaround Control tool will help the airport work with its airline partners to improve turnaround transparency, identify situations that most often cause delayed flights, and notify employees of deviations from the schedule. 

“Assaia’s technology adds critical data points to CVG’s early-stage neural network for operational advancements,” said Brian Cobb, the airport’s chief innovation officer. “Structured data generated by artificial intelligence will provide information to make decisions, optimize airside processes, and improve efficiency and safety.”

The company uses NVIDIA Jetson AGX Xavier modules and the NVIDIA Metropolis intelligent video analytics platform to run image recognition and predictive analysis algorithms on video streams from multiple cameras around an airport. 

By installing cameras at several gates, airports can optimize the cleaning, restocking and servicing of planes — saving time for customers and costs for the airlines.

Assaia is also deploying AI solutions at London Gatwick Airport and Seattle-Tacoma International Airport. Watch a replay from the recent GPU Technology Conference for more:



Around the World in AI Ways: Video Explores Machine Learning’s Global Impact

You may have used AI in your smartphone or smart speaker, but have you seen how it comes alive in an artist’s brush stroke, how it animates artificial limbs or assists astronauts in Earth’s orbit? The latest video in the “I Am AI” series — the annual scene setter for the keynote at NVIDIA’s GTC Read article >

The post Around the World in AI Ways: Video Explores Machine Learning’s Global Impact appeared first on The Official NVIDIA Blog.


nvCOMP v2.0.0 Now Available: With New Compressors

Today, NVIDIA is announcing the availability of nvCOMP version 2.0.0. This software can be downloaded now free for members of the NVIDIA Developer Program.

Today, NVIDIA is announcing the availability of nvCOMP version 2.0.0. This software can be downloaded now free for members of the NVIDIA Developer Program.

Download Now

What’s New

  • Low-level light-weight C interface for expert users featuring batch compression/decompression support and fully asynchronous execution. 
  • High-level C/C++ interfaces for ease of use.
  • Remove old interfaces.
  • Added support for Snappy, Bitcomp, and GDeflate compressors

See the nvCOMP Release Notes for more information

About nvCOMP

nvCOMP is a CUDA library that features generic compression interfaces to enable developers to use high-performance GPU compressors in their applications.

nvCOMP 2.0.0 includes Cascaded, LZ4, and Snappy compression methods. It also adds support for the external Bitcomp and GDeflate methods. Cascaded compression methods demonstrate high performance with up to 500 GB/s throughput and a high compression ratio of up to 80x on numerical data from analytical workloads. Snappy and LZ4 methods can achieve up to 100 GB/s compression and decompression throughput depending on the dataset, and show good compression ratios for arbitrary byte streams.

Learn more:

Recent Developer Blog posts:

Optimizing Data Transfer Using Lossless Compression with NVIDIA nvcomp


NVIDIA Sets AI Inference Records, Introduces A30 and A10 GPUs for Enterprise Servers

NVIDIA AI Platform Smashes Every MLPerf Category, from Data Center to EdgeSANTA CLARA, Calif., April 21, 2021 (GLOBE NEWSWIRE) — NVIDIA today announced that its AI inference platform, newly …


Accelerating the Wide & Deep Model Workflow from 25 Hours to 10 Minutes Using NVIDIA GPUs

Recommender systems drive engagement on many of the most popular online platforms. As data volume grows exponentially, data scientists increasingly turn from traditional machine learning methods to highly expressive, deep learning models to improve recommendation quality. Often, the recommendations are framed as modeling the completion of a user-item matrix, in which the user-item entry is … Continued

Recommender systems drive engagement on many of the most popular online platforms. As data volume grows exponentially, data scientists increasingly turn from traditional machine learning methods to highly expressive, deep learning models to improve recommendation quality. Often, the recommendations are framed as modeling the completion of a user-item matrix, in which the user-item entry is the user’s interaction with that item.

Most current online recommender systems are implicit rating-based, clickthrough rate (CTR) prediction tasks. The model estimates the probability of positive action (click), given user and item characteristics. One of the most popular DNN-based methods is Google’s Wide & Deep Learning for Recommender Systems, which has emerged as a general tool for solving CTR prediction tasks, thanks to its power of generalization (Deep) and memorization (Wide).

The Wide & Deep model falls into a category of content-based recommender models that are like Facebook’s deep learning recommendation model (DLRM), where input to the model consists of characteristics of the User and Item and the output is some form of rating.

In this post, we detail the new TensorFlow2 implementation of the Wide & Deep model that was recently added to the NVIDIA Deep Learning Examples repository. It provides the end-to-end training for easily reproducible results in training the model, using the Kaggle Outbrain Click Prediction Challenge dataset. This implementation touches on two important aspects of building recommender systems: dataset preprocessing and model training.

First, we introduce the Wide & Deep model and the dataset. Then, we give details on the preprocessing completed in two variants, CPU and GPU. Finally, we discuss aspects of model convergence, training stability, and performance, both for training and evaluation.

Wide & Deep model overview

Wide & Deep refers to a class of networks that use the output of two parts working in parallel—a wide model and a deep model—to make binary prediction of CTR. The wide part is a linear model of features together with their transforms, responsible for the memorization of feature interactions. The deep part is a series of fully connected layers, allowing the model better generalization for unseen cross-features interactions. The model can handle both numerical continuous features as well as categorical features represented as dense embeddings. Figure 1 shows the architecture of the model. We changed the size of the deep part from the original of 1024, 512, 256 into five fully connected layers of 1024 neurons.

Architecture shows categorical features and numerical features of the model.
Figure 1. The architecture of the Wide & Deep model.

Outbrain dataset

The original Wide & Deep paper trains on the Google Play dataset. Because this data is proprietary to Google, we chose a publicly available dataset for easy reproduction. As a reference dataset, we used the Kaggle Outbrain Click Prediction Challenge data. This dataset is preprocessed to obtain a subset of the features engineered by the 19th-place finisher in the Kaggle Outbrain Click Prediction Challenge. This competition challenged competitors to predict the likelihood of a clickthrough for a particular website ad. Competitors were given information about the user, display, document, and ad to train their models. For more information, see Outbrain Click Prediction.

The Outbrain dataset is preprocessed to get feature input for the model. Each sample in the dataset consists of features of the Request (User) and Item, together with a binary output label. Request-level features describe the person and context to which to make recommendations, whereas Item-level features describe those objects to consider recommending. In the Outbrain dataset, these are ads. Request– and Item-level features contain both numerical features that you can input directly to the network. Categorical variables are represented as trainable embeddings of various dimensions. For more information about feature counts, cardinalities, embedding dimensions, and other dataset characteristics, see the WideAndDeep readme file on GitHub.


As in every other recommender system, preprocessing is a key for efficient recommendation here. We present and compare two dataset preprocessing workflows: Spark-CPU and NVTabular GPU. Both produce datasets of the same number, type, and meaning of features so that the model is agnostic to the type of dataset preprocessing used. The presented preprocessing aims to produce the dataset in a form of pre-batched TFRecords to be consumed by the data loader during model training.

Scope of preprocessing

The preprocessing is described in detail in the readme of the Deep Learning Examples repository. In this post, we give only the outlook on the scope of data wrangling to create the final 26 features: 13 categorical and 13 numerical, obtained from the original Outbrain dataset. Both of the workflows consist of the following operations:

  • Separating out the validation set for cross-validation.
  • Filling missing data with mode, median, or imputed values.
  • Joining click data, ad metadata, and document category, topic, and entity tables to create an enriched table.
  • Computing seven CTRs for ads grouped by seven features.
  • Computing the attribute cosine similarity between the landing page and featured ad.
  • Math transformations of the numeric features (logarithmic, scaling, and binning).
  • Categorizing data using hash-bucketing.
  • Storing the resulting set of features in pre-batched TFRecord format.

Comparison of preprocessing workflows

To compare the NVTabular and Spark workflows, we built both from a known-good Spark-CPU workflow, included in the NVIDIA Wide & Deep TensorFlow 2 GitHub repository. For simplicity, we limited the number of dataset features to calculate during preprocessing. We chose the most common ones used in recommender systems that both workflows (Spark and NVTabular) support. Because NVTabular is a relatively new framework still in active development, we limited the scope of comparison to features supported by the NVTabular library.

When comparing Spark and NVTabular, we extracted the most important metrics that influence the choice of framework in target preprocessing. Table 1 presents a snapshot comparison of two types of preprocessing using the following metrics:

  • Threshold result. The necessity of achieving MAP@12 greater than the arbitrary chosen threshold of 0.655 for the Outbrain dataset.
  • Source code lines. The lines needed to achieve the set of features that the model uses for training. This single metric tries to capture how difficult it is to create and maintain the production code. It also gives an intuition about the level of difficulty when experimenting with adding new dataset features or changing existing ones.
  • Total RAM consumption. This estimates the size and type of machine needed to perform preprocessing.
  • Preprocessing time that is critical for recommender systems. In production environments where you must retrain the model with new data, this metric is strictly bounded with the necessity of dataset preprocessing. You can remove a too-long preprocessing time for some applications. When that time is short, you can even include the preprocessing in end-to-end training. Test the hypothesis of variable importance with the hyperparameter tuning of the network.

We did not enforce 1:1 parity between the datasets, as convergence accuracy proves the validity of the features.

  CPU preprocessing: Spark on NVIDIA DGX-1 CPU preprocessing: Spark on NVIDIA DGX A100 GPU preprocessing: NVTabular on DGX-1 8-GPU GPU preprocessing: NVTabular DGX A100 8-GPU
Lines of code ~1,500 ~1,500 ~500 ~500
Top RAM consumption [GB] 167.0 223.4 48.7 50.6
Top VRAM consumption per GPU [GB] 0 0 13 67
Preprocessing time [min] 45.6 38.5 3.9 2.3
Table 1. A comparison of CPU preprocessing (Spark) and GPU preprocessing (NVTabular).

Convergence accuracy

On the chosen metric for the Outbrain dataset, Mean Average Precision at 12 (MAP@12), both the features produced by Spark-CPU and NVTabular achieve similar convergence accuracy, MAP@12>0.655.

Hardware requirements

You can run the NVTabular and Spark-CPU versions on DGX-1 V100 and DGX A100 supercomputers. Spark-CPU consumes around 170 GB of RAM while the RAM footprint of NVTabular is about 3x smaller. NVTabular can run successively even on a single-GPU machine and still be an order of magnitude faster than Spark-CPU without the need of memory-optimized computers.

End-to-end preprocessing time

The end-to-end preprocessing time is 17x faster on GPU for DGX A100 and 12x on GPU on DGX-1 comparing Spark CPU and NVTabular GPU preprocessing.

Bar chart shows that NVTabular processing time is much faster than Spark.
Figure 2. Outbrain dataset preprocessing time comparison.

Code brevity and legibility

The Spark code to generate the features spans over approximately 1,500 lines, while the NVTabular code is about 500 lines. The brevity in the NVTabular workflow also lends itself to legibility, as fewer lines of code and descriptive function signatures make it obvious what a given line is trying to accomplish.

The following list contains samples with side-to-side comparisons of Spark and NVTabular, showing the increase of code brevity and legibility in favor of NVTabular. The operation used in both is taking the TF-IDF cosine similarity between an ad and its landing page.

Model training and evaluation

Training the model is analyzed based on the following criteria:

  • Reaching evaluation metrics.
  • Fast and stable training (forward and backward pass): Constantly reaching an evaluation metric not dependent on initialization, hardware architecture, or other training features.
  • Fast scoring of the evaluation set: Reaching performance throughput to mimic model’s behavior in production.

We used the Mean Average Precision at 12 (MAP@12) metric, the same as used in the original Outbrain Kaggle competition. Direct comparison of the obtained accuracies is unjustified because, for the original Kaggle competition, there were data leaks that could be artificially used for post-processing of model results, resulting in higher MAP@12 score.

Figure 3. Learning curve of Wide & Deep model for NVTabular dataset for multiple training setups.

As there are multiple options for setups—two hardware architectures (A100 and V100), multiple floating-point number precisions (FP32, TF32, AMP), two versions of preprocessing (Spark-CPU and NVTabular), and the XLA optimizer, it is essential to be sure that the convergence is achieved in each setup. We performed multiple stability tests for accuracy that prove achieving MAP@12 above the selected threshold, regardless of training setup, training stability, and the impact of AMP on accuracy.

Training performance results

As stated earlier, we wanted the model to be fast in training. You can measure this in two ways: by the model throughput [samples/s] and time to train [min]. When training on CPU compared to GPU, you can experience speedups up to 108x for the NVIDIA Ampere Architecture and TF32 precision (Figure 4).

Speedup of CPU vs GPU training. CPU is 1x.
Figure 4. Training throughput speedup of GPU vs. CPU training.
Speedup of CPU vs GPU training is up to over 100x.
Figure 5. Training throughput for DGX-1 V100 and DGX A100 for AMP and FP32/TF32 precision for 1, 4, and 8 GPUs. Shown numbers are for XLA on.

Single-GPU configurations experience up to 1.2x speedup while using AMP for Ampere architecture. This number is even better for Volta, where the speedup is over 3x. Introducing multi-GPU training in a strong scaling mode ends up in speedups of 1.2x–4.6x in comparison to single-GPU training. Comparison of Ampere and Volta architectures for FP32 and TF32 training, respectively, shows a speedup of 2.2x (single GPU) to 4.5x (eight GPUs). Ampere is also 1.4x– 1.8x faster than Volta for AMP training. Bearing in mind that you don’t lose any accuracy with AMP, XLA and multi-GPU, this brings a huge value to recommender systems models.

Training time improves significantly when training on GPU in comparison to CPU and for best configuration is faster over 100x. TFRecords dataset consumes around 40GB of disk space. For best configuration of training (8x A100, TF32 precision, XLA on) this implementation of the Wide & Deep model performs a 20-epoch training within eight minutes, resulting in less than 25[s] per epoch during training.

Evaluating performance results

Having a model that trains with such throughput is beneficial. In fact, in offline scenarios, another parameter is important: how fast you can evaluate all pairs of users and items. If you have 106 distinct users and only 103 distinct items, that gives you 109 different user-item pairs. Fast evaluation on training models is a key concept. Figure 6 shows the evaluation performance for A100 and V100 varying batch size.

Evaluation throughput scales with batch size for all setups
Figure 6. Evaluation throughput for DGX-1 V100 and DGX A100 for AMP and FP32/TF32 precision for 1, 4, and 8 GPUs. Shown numbers are for XLA on.

Recommendation serving usually reflects the scenario that a single batch contains scoring all items for a single user. Using the presented native evaluation, you might expect over 1,000 users scored for 4,096 batch (items) for eight GPUs A100 in TF32 precision.

End-to-end training

We define the end-to-end training time to be the entire time to preprocess the data and train the model. It is important to account for these two steps, because the feature engineering steps and training with accuracy measurement are repeated. Shortening the end-to-end training is the equivalent of bringing the model to production faster or performing more experiments at the same time. With GPU preprocessing, you can experience a massive decrease in end-to-end training (Figure 7).

Four bar charts with the difference between preprocessing and preprocessing with model training, between one GPU and eight GPUs.
Figure 7. The comparison of end-to-end training (preprocessing with model training) for DGX-1 (1st row) and DGX A100 (2nd row) for Spark-CPU preprocessing and model training (1st column) and NVTabular GPU preprocessing. Orange bar accounts for preprocessing, blue one is the preprocessing with model training.

For both DGX-1 and DGX A100, the speedup in end-to-end training is tremendous. Because this setup involved training the model on GPU for both Spark and NVTabular, the speedup comes from the preprocessing steps. It results in up to 3.8x faster end-to-end training for DGX-1 and up to 5.4x for DGX A100. When using GPU for preprocessing the fraction, the important aspect is the decreased time that the preprocessing step takes in total end-to-end training, from ~75% for Spark GPU down to ~25% for NVTabular.


In this post, we demonstrated an end-to-end preprocessing and training pipeline for the Wide & Deep model. We showed you how to get at least a 10x reduction in dataset preprocessing time using GPU preprocessing with NVTabular. Such an incredible speedup enables you to quickly verify your hypotheses about the data and bring new features to production.

We also showed the stability of training while reaching the evaluation score of MAP@12 for multiple training setups:

  • NVIDIA Ampere Architecture
  • NVIDIA Volta Architecture
  • Multi-GPU training
  • AMP training
  • XLA

Thanks to the great speedup that these features provide, you can train on the 8-GB dataset in less than 25s/epoch. The model throughput compared to CPU is over 100x higher on GPU. Finally, we showed the evaluation throughput that achieves 21Mln [samples] from a model checkpoint.

Future work for the Wide & Deep TensorFlow 2 implementation will concentrate on inference in Triton Server, improving the data loader to support parquet input files, and upgrading preprocessing in NVTabular to a recently released API version.

We encourage you to check our implementation of the Wide & Deep model in the NVIDIA DeepLearningExamples GitHub repository. In the comments, please tell us how you plan to adopt and extend this project.


Flexible, Scalable, Differentiable Simulation of Recommender Systems with RecSim NG

Recommender systems are the primary interface connecting users to a wide variety of online content, and therefore must overcome a number of challenges across the user population in order to serve them equitably. To this end, in 2019 we released RecSim, a configurable platform for authoring simulation environments to facilitate the study of RL algorithms (the de facto standard ML approach for addressing sequential decision problems) in recommender systems. However, as the technology has progressed, it has become increasingly important to address the gap between simulation and real-world applications, ensuring that models are flexible and easily extendible, enabling probabilistic inference of user dynamics, and addressing computational efficiency.

To address these issues, we recently released RecSim NG, the “Next Generation” of simulators for recommender systems research and development. RecSim NG is a response to a set of use cases that have emerged as important challenges in the application of simulation to real-world problems. It addresses the gap between simulation and real-world applications, ensures the models are flexible and easily extendible, enables probabilistic inference of user dynamics, and addresses computational efficiency.

Overview of RecSim NG
RecSim NG is a scalable, modular, differentiable simulator implemented in Edward2 and TensorFlow. It offers a powerful, general probabilistic programming language for agent-behavior specification.

RecSim NG significantly expands the modeling capabilities of RecSim in two ways. First, the story API allows the simulation of scenarios where an arbitrary number of actors (e.g., recommenders, content consumers, content producers, advertisers) interact with one another. This enables the flexible modeling of entire recommender ecosystems, as opposed to the usual isolated user-recommender interaction setting. Second, we introduced a library of behavioral building blocks that, much like Keras layers, implement well-known modeling primitives that can be assembled to build complex models quickly. Following the object-oriented paradigm, RecSim NG uses entity patterns to encapsulate shared parameters that govern various agent behaviors, like user satisfaction, and uses templates to define large populations of agents concisely in a way that abstracts agent “individuality” without duplicating invariant behaviors.

Apart from the typical use of simulators to generate Monte Carlo samples, RecSim NG directly enables various other forms of probabilistic reasoning. While domain knowledge and intuition are key to modeling any recommendation problem, the simulation fidelity needed to bridge the so-called “sim2real” gap can only be achieved by calibrating the simulator’s model to observed data. For data-driven simulation, RecSim NG makes it easy to implement various model-learning algorithms, such as expectation-maximization (EM), generative adversarial training, etc.

Also available within RecSim NG are tools for probabilistic inference and latent-variable model learning, backed by automatic differentiation and tracing. RecSim NG exposes a small set of Edward2 program transformations tailored to simulation-specific tasks. Its log-probability module can evaluate the probabilities of trajectories according to the probabilistic graphical model induced by the simulation. This, together with the automatic differentiation provided by the TensorFlow runtime, enables the implementation of maximum-likelihood estimation and model learning within the simulation itself. RecSim NG can readily use the Markov-chain Monte Carlo (MCMC) machinery provided by TensorFlow Probability to power posterior inference and latent-variable model learning. For example, a simulation model that describes how latent user attributes (e.g., preferences, intents, satisfaction) are translated into observational data (e.g., clicks, ratings, comments) can be “run in reverse,” that is, real observational data generated by a recommender system can be used to identify the most likely configuration of latent user attributes, which in turn can be used to assess the quality of the user experience. This allows for a simulation model to be integrated directly into the full data-science and model-development workflow.

Assessing recommender ecosystem health, i.e., the long-term impact of recommendation strategies on aspects such as overall satisfaction, collective fairness, and safety, requires the simulation of large multi-agent systems in order to plausibly reproduce the interactions between the different participants of the ecosystem. This, along with the computational load of probabilistic inference tasks, requires an efficient simulation runtime. For computational performance, RecSim NG offers a TensorFlow-based runtime for running simulations on accelerated hardware. The simulation takes advantage of all optimizations offered by TensorFlow’s AutoGraph compiler, including accelerated linear algebra (XLA) if available. The simulation will automatically exploit all available cores on the host machine as well as specialized hardware (if run accordingly), such as Tensor Processing Units (TPUs). The core RecSim NG architecture is back-end independent, enabling applications to be developed within other computational frameworks (such as JAX or PyTorch).

Ecosystem Modeling as an Application
To demonstrate the capabilities of RecSim NG, we present a very simplified model of multi-agent interactions among users and content providers in a stylized recommender ecosystem1. The simulation captures the dynamics of a recommender system that mediates the interaction between users and content providers by recommending slates of those providers’ content items to users over time. We adopt a simplified user model whereby each user is characterized by a static, observable “user interest vector.” This vector determines a user’s affinity with a recommended item, which are then used as inputs to a choice model that determines a user’s item selection from a recommended slate. A user’s utility for any selected item is simply their affinity for the item, perturbed with Gaussian noise.

The aim of the recommender is to maximize cumulative user utility, over all users, over a fixed horizon. However, interesting ecosystem effects make this challenging, and emerge because of content provider behavior. Like users, each provider has an “interest vector” around which the content items it makes available are centered, reflecting that provider’s general expertise or tendencies. Providers have their own incentives for making content available: their utility is measured by the number of their items selected by any user over the recent past. Moreover, providers with higher utility generate or make available a greater number of items, increasing the “catalog” from which users (and the recommender) can choose.

We compare two different recommender policies in this setting. The first is a standard “myopic” policy that, for any user, always recommends the items that have the greatest predicted affinity for that user. Under such a policy, the behavior of providers has the potential to give rise to “rich-get-richer” phenomena: providers that initially attract users produce more items at subsequent periods, which increases the odds of attracting even further future engagement. This gradual concentration of available items around “mainstream” content providers has a negative impact on overall user utility over time. The second recommender policy is aware of these provider dynamics, which it counteracts by promoting under-served providers.2 While a simple heuristic, the provider-aware policy increases overall user utility over extended horizons.

The number of agents in the simulation is large and we templatize both users and content providers with reusable modeling blocks offered by RecSim NG. Determining how to execute the simulation in parallel is non-trivial, so it is critical to utilize TF’s AutoGraph and other computational optimizations.

Our hope is that RecSim NG will make it easier for both researchers and practitioners to develop, train and evaluate novel algorithms for recommender systems, especially algorithms intended to optimize system behavior over extended horizons, capture complex multi-agent interactions and incentives, or both. We are also investigating the release of increasingly realistic user models that can serve as benchmarks for the research community, as well as methods that can facilitate “sim2real” transfer using RecSim NG.

Further details regarding the RecSim NG framework can be found in the associated white paper, while code and colabs/tutorials are available here. A video about RecSim NG presented at RecSys-2020 is shown below:

We thank our collaborators and early adopters of RᴇᴄSɪᴍ NG, including the other members of the RecSim NG team: Vihan Jain, Eugene Ie, Chris Colby, Nicolas Mayoraz, Hubert Pham, Dustin Tran, Ivan Vendrov and Craig Boutilier.

1 This model is a much simpler version of that presented in this ICML-20 paper

2 This simple heuristic policy is used only to demonstrate RecSim NG’s capabilities. More sophisticated algorithms that compute policies that explicitly maximize long-term user utility are discussed in this ICML-20 paper


Aligning Time Series at the Speed of Light

To say it with the words of Eamonn Keogh: “Time series is a ubiquitous and increasingly prevalent type of data […]”. Virtually any incrementally measured signal, be it along a time axis or a linearly ordered set, can be treated as time series. Examples include electrocardiograms, temperature or voltage measurements, audio, server logs, but also … Continued

To say it with the words of Eamonn Keogh: “Time series is a ubiquitous and increasingly prevalent type of data […]”. Virtually any incrementally measured signal, be it along a time axis or a linearly ordered set, can be treated as time series. Examples include electrocardiograms, temperature or voltage measurements, audio, server logs, but also heavy-weight data such as video and time-resolved MRI volumes. Hence, the efficient yet exact processing of the ever-increasing amount of time series data is crucial for every data scientist.

In this blog, we introduce rapidAligner – a CUDA-accelerated library to align a short time series snippet (query) in an exceedingly long stream of time series (subject) using the following three popular lock-step measures for the local alignment of uniformly sampled time series:

  • Rolling Euclidean distance (sdist)
  • Rolling mean-adjusted Euclidean distance (mdist)
  • Rolling mean and amplitude-adjusted Euclidean distance (zdist)

The rapidAligner library is free software that can be integrated with a broad variety of popular data science and machine learning frameworks such as NumPy, CuPy, RAPIDS, Numba, and Pytorch. The source code is publicly available under NVIDIA rapidAligner.

The rest of the article is structured as follows: Section one provides a brief introduction on popular lock-step measures and (local) normalization techniques. Section two demonstrates the usage of the rapidAligner library. Section 3 concludes this blog post.

A brief introduction to time series data mining

Time series are sequences of pairs (t[i], x[i]) where the real-valued time stamps t[i] are linearly ordered and their corresponding values x[i] are quantities measured at time t[i]. If all timestamps are equally spaced, i.e., t[i+1]-t[i] = const for all i, then you can neglect time and call the sequence of measurements x[i] a uniformly sampled time series. In the following, we will simply refer to uniformly sampled time series with real-valued scalars x[i] as time series without fancy attributes.

Assume you want to compare two time series Q=(q[0], q[1], …, q[m-1]) and  S=(s[0], s[1], …, s[m-1]) of same length |Q|=|S|=m. An obvious way would be to interpret Q and S as m-dimensional vectors and compute the Lp norm of their difference.

Popular choices for the parameter p are p=2 for so-called Euclidean distance and p=1 for so-called Manhattan or taxicab distance (see Figure 1). In this blog post, we address similarity measures that compare residues q[i]-s[i] using a one-to-one assignment i->i of indices – so-called lock-step measures. In a future post, we will discuss CUDA-accelerated measures using dynamic assignments of indices such as q[i]-s[j], also known as the class of elastic measures.

: A plot consisting of two graphs depicting two similar and approximately aligned heart beats from an electrocardiogram (ECG) measurement with straight lines indicating the one-to-one correspondence between indices.
Figure 1: Two electrocardiogram (ECG) measurements Q (blue signal) and S (orange signal) both of length |Q|=|S|=421 and their index-wise residues (grey vertical bars) down-sampled by a factor of 4.

However, when aligning a short query Q of length |Q|=m in a long stream S of length |S|=n, i.e., 0

For each alignment position j you have to sum over m contributions. As a result, the asymptotic worst-case complexity to compute all lock-step alignments is proportional to the product of the time series lengths m and n — O((n-m+1) * m) to be precise. This number can be huge even for moderately sized queries and streams which may render large scale time series alignment computationally intractable when performed in a naïve way. In Section 3 we will discuss for the special case p=2 how to implement a CUDA-accelerated scheme which runs in blazingly fast log-linear time.

When looking at larger portion of an ECG stream (see Figure 2) you may observe a temporal drift in the average signal value, also known as baseline wandering. This artifact often occurs in continuously measured time series and may be caused by a broad variety of external factors such as change of skin conductivity due to sweat in ECGs, body movement affecting the electrodes in ECGs, drift of electric resistance and thus voltage due to temperature variation in power supplies, temperature drift when recording environmental quantities, seasonal effects such as Christmas, or the temporal drift of stock prices amidst a global pandemic.

A plot consisting of two graphs depicting a short heartbeat sequence aligned in a much longer stream of continuously measured heartbeats. The locally averaged stream values are drifting over time.
Figure 2: A short ECG query Q (blue signal) aligned in a longer stream S of heartbeats (orange signal) using Euclidean distance as rolling similarity measure. Note the temporal drift in the values of S.

Baseline wandering is problematic when mining a stream for similar shapes – two similar shapes with different offsets on the measurement axis may have a larger distance than two dissimilar ones with similar offsets. A surprisingly simple and effective countermeasure is to introduce a normalization procedure for the query and candidate sequences. As an example, you could compute the mean value of the query muQ and for each of the n-m+1 candidate sequences muS[j] to remove the offset in the corresponding window (see Figure 3). In the following, we will call locally mean-adjusted rolling Euclidean distance mdist:

A plot depicting two signals on with non-vanishing mean on the left and the same signal translated to have vanishing mean on the right to visualize mean adjustment.
Figure 3: A heartbeat with a non-vanishing mean (blue signal on the left) and its mean adjusted variant (orange signal on the right).

A closer look at Figure 1 and Figure 2 further reveals a temporal variation in amplitudes. The range of values in the blue query is significantly smaller than the amplitude of the orange candidate sequence in Figure 1. Temporal drift in the scale might lead to meaningless matches when mining shapes.  A straightforward solution is to normalize the scale by dividing the values by the standard deviation of the query sigmaQ and alignment candidates sigmaS[j], respectively. The proposed mean and amplitude adjustment is called z-normalization referring to z-scores of normal random variables with vanishing mean and unit variance (see Figure 4). The corresponding rolling measure shall be called zdist:

The library rapidAligner supports the CUDA-accelerated computation of the three aforementioned rolling measures sdist, mdist, and zdist in a massively parallel fashion. In the next section, you will see its simple usage from within JupyterLab.

rapidAligner in action

Let’s start crunching numbers. In this section, you will align a single heartbeat in a 22 hour ECG stream using the three discussed measures sdist, mdist, and zdist. The data set is part of the experiments listed on the  website of the award-winning UCR-Suite. After cloning the rapidAligner repository you immediately import the rapidAligner library alongside with CuPY, NumPy, and Matplotlib for later validation and visualization.

In the next step, you load the ECG data. The length of the query is rather short with 421 entries, yet the stream exhibits roughly 20 million time ticks. A first inspection of the full query (blue) and the initial 1000 values of the stream (orange) reveals temporal drift in both offsets and amplitudes.

In the following, you align the query in the stream using the sdist measure computing all 20,140,000-421+1 distance scores a subsequent argmin-reduction to determine the best alignment position. For our experiments we choose a single A100 GPU in a DGX A100 server. Note that if you are only interested in the best match one could further accelerate the already rapid computation by employing a lower-bound cascade as demonstrated here. In contrast, rapidAligner’s sdist call returns alignment scores for all positions to allow for later processing such as computing a non-overlapping partition of ranked matches. We further repeat the alignment a few times for robust runtime measurement. Both compute modes “fft” and “naive” are blazingly fast and return indistinguishable results:

  • “naive”: all n-m+1 alignment candidates are (optionally) normalized and compared individually with a minimal memory footprint but O(n*m) asymptotic computational complexity. This mode is still reasonably fast using warp-aggregated statistics and accumulation schemes.
  • “fft”: If m > log_2(n), we can exploit the Convolution Theorem to accelerate the computation significantly resulting in O(n * log n) runtime but a higher memory footprint. This compute mode is fully independent of the query’s length and thus advisable for large input. The higher memory usage is mainly caused by computationally fast but out-of-place primitives such as CUDA-accelerated Fast Fourier Transforms and Prefix Scans.

Both the query and stream (subject) are stored as plain NumPy arrays in double precision on the CPU. As a result, the measured runtimes include costly memory transfers between the CPU and GPU. rapidAligner allows for seamless interoperability with all CUDA array interface compliant frameworks such as PyTorch, CuPy, Numba, RAPIDS, and Jax. Hence, you can further reduce the runtime when caching the data in fast GPU RAM to avoid unnecessary memory movement between CPU and GPU. Now, it becomes even more obvious that the Fourier based mode outperforms the naïve one even on short queries

That corresponds to a whopping 2.5 billion full alignments per second on a single GPU. In the following, we report runtimes exclusively using input data residing on the GPU. The match produced by sdist is already a good one but let us check if mean-adjustment improves the result:

As expected, mean adjustment returns a better match since now candidates are considered with a non-vanishing relative offset along the measurement axis. The execution times remain low with 10 ms for 20 million alignment positions. That corresponds to 2 billion full alignments per second. You could still improve the amplitude mismatch using zdist:

Et voilà, z-normalized rolling Euclidean distance reveals a doppelgänger in the database which is almost indistinguishable from the query. Performance remains high in more than 1.6 billion alignments per second. A stunning property of the Fourier mode is that the runtime is effectively independent of the query length, i.e., the runtime is constant for fixed stream length and varying query size. This becomes handy when aligning extra-ordinarily long queries.


Finding shapes in long streams of time series data is a computationally demanding task frequently occurring as a standalone routine or embedded as a subroutine in high-level algorithms, e.g., for the detection of anomalies. Hence, massively parallel accelerators with their unprecedented memory bandwidth such as the NVIDIA A100 GPU are ideally suited to address this challenge. rapidAligner is lightweight library processing billions of alignments per second while supporting common normalization modes for the candidate sequences. You can further employ highly optimized FFT routines cuFFT and prefix scans CUB from the CUDA-X software stack to provide an alignment mode that is independent from the query length. The source code and notebooks are publicly available under NVIDIA rapidAligner

Happy time series mining!


Training MaskRCNN Models on Custom Dataset with Datature

Training MaskRCNN Models on Custom Dataset with Datature submitted by /u/xusty
[visit reddit] [comments]