Continuously Improving Recommender Systems for Competitive Advantage Using NVIDIA Merlin and MLOps

Recommendation systems must constantly evolve through the digestion of new data or algorithmic improvements of the model for its recommendations to stay effective and relevant. In this post, we focus on how NVIDIA Merlin components fit into a complete MLOps pipeline to operationalize a recommendation system, and continuously deliver improvements in production

Recommender systems are a critical resource for enterprises that are relentlessly striving to improve customer engagement. They work by suggesting potentially relevant products and services amongst an overwhelmingly large and ever-increasing number of offerings. NVIDIA Merlin is an application framework that accelerates all phases of recommender system development on NVIDIA GPUs, from experimentation (data processing, data loading, and model training) to production deployment either on-premises or in-cloud.

The term recommender systems implies that they are not just a mere model but an entire pipeline. It is important that all pieces work together like a well-oiled machine. More importantly, these are dynamic systems that need to constantly evolve and adapt (through digestion of new data or algorithmic improvement of the model). The ability to quickly and continuously integrate and deliver these improvements into production is critical for the recommendation system to stay effective.

According to Google Cloud, MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). MLOps takes both its name as well as some of the core principles and tooling from DevOps. This makes sense as the goals of MLOps and DevOps are practically the same: to reduce the time and effort required to develop, deploy, and maintain high-quality ML software in production.

In this post, we focus on how Merlin components fit into a complete MLOps pipeline and demonstrate with a hands-on example deployed with KubeFlow Pipelines on Google Kubernetes Engine (GKE). When we use the term Merlin MLOps in this post, we mean the act of operationalizing Merlin with MLOps tools and practices.

Reference architecture: MLOps for Merlin 

Here’s a quick review of the Merlin components, as well as different levels of MLOps. The Merlin application framework supports all phases of recommender system development on the GPUs.

  • Data preprocessing and feature engineering: Merlin NVTabular is a high-performance library designed for processing terabyte-scale tabular datasets. It scales seamlessly from single to multi-GPU systems. 
  • Model training: Merlin HugeCTR is a recommender system framework for training state-of-the-art deep learning recommendation models such as DLRM, Wide and Deep, Deep Cross Network (DCN), and so on. It scales seamlessly on multiple GPUs and multi-GPU nodes.
  • Production inference: The NVIDIA Triton Inference Server coupled with a HugeCTR inference backend provides a robust high-throughput and low-latency production environment. NVIDIA Triton can be deployed either on-premises or in-cloud, and it is fully compatible with the Kubernetes ecosystem.

Given the capabilities of Merlin, we now review the three levels of MLOps according to Google Cloud’s definition

  • Level 0: Manual process and pipeline.
  • Level 1: Pipeline with some automation, such as monitoring and triggers, automated retraining, and redeployment of ML models (continuous retraining).
  • Level 2: Fully automated pipeline with continuous integration and delivery (CI/CD).
Figure shows a high level overview of an MLOps pipeline for a recommender system built with NVIDIA Merlin components. It includes all the components from Data acquisition & validation, data preparation, training, model validation, deployment, Monitoring, logging and pipeline triggers.
Figure 1. A high-level overview of Merlin MLOps.

Figure 1 shows a Level 1 Merlin MLOps workflow, with a fully automated pipeline and continuous retraining. Look deeper into this architecture:

  • Data pipeline: Every recommender system starts with data about users, items, and their interactions. Data is collected and stored in a data lake. From the data lake, a subset of data (based on time range and number of features) is extracted and prepared for model training (preprocessing, feature engineering).  A data validation module ensures that the test data is as expected while also detecting data drift.
  • Continuous re-training: At first, the recommendation model is trained on a large amount of available data and deployed. Continuous incremental retraining ensures that the model stays up-to-date and captures the latest trends and user preferences. A model validation module ensures that the model meets a specified quality threshold. 
  • Deployment and serving: An automated redeployment pipeline puts the new qualified model into production in a seamless manner. The number of GPU inference servers automatically scales up and down as needed.
  • Logging and monitoring: Monitoring modules continuously monitor the quality of the recommendation in real-time through a range of KPIs, such as hit rate and conversion rate. The modules trigger full retraining should model drift happen, that is, if certain KPIs fall below known established baselines.

Merlin MLOps with Kubeflow Pipelines on Google Kubernetes Engine

In this section, we walk through a concrete example of realizing the workflow with Kubeflow pipelines and GKE.

GKE provides a managed environment for deploying, managing, and scaling containerized applications using Google Cloud infrastructure. Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. With an existing GKE cluster, Kubeflow pipelines can be installed easily with a push of a button. We selected Kubeflow Pipelines as the orchestrator that wields together the components of a Merlin MLOps pipeline.

In the Kubernetes world, applications are containerized. Merlin Docker containers are available on NGC, including the Merlin training and inference containers. These containers can be pulled, and then pushed to Google Cloud Container Registry, ready to be used with GKE.

Figure shows a reference architecture of a recommender system MLOps pipeline built with NVIDIA Merlin to accelerate all phases of recommender system development on GPUs. It uses Kubeflow to orchestrate the pipeline components on Google Kubernetes Engine (GKE).
Figure 2. Merlin Kubeflow pipelines architecture on GCP and GKE.

In Figure 2, we mapped the conceptual workflow components in Figure 1 to concrete GCP and GKE components:

  • Data pipeline: Data is collected and stored in a data store, which in this case is a Google Cloud Storage (GCS) bucket. A data extraction module extracts and copies the relevant data to a high-speed active working space. In this example, it is a GKE-persistent volume for preprocessing and model training. A data validation module based on TensorFlow Data Validation analyzes the training data to detect data drift.
  • Continuous re-training: A Merlin training pod is used for data preprocessing and model training.
    • NVTabular is responsible for data preprocessing, feature engineering, and persisting the preprocessed dataset into the pipeline-shared persistent volume.
    • Next, HugeCTR picks up the preprocessed data and trains a DCN model. The model can be updated either using incremental data or trained from scratch using all or a large amount of available data. 
  • Deployment and serving: The deployment module prepares the HugeCTR trained model for production. Prepared models are then stored in a model store in GCS. Depending on the application domains, model serving can involve two steps:
    • Candidate generation reduces the number of candidates from a space potentially as large as millions of items to a computationally manageable amount, for example, thousands of items.
    • The Merlin inference pod picks up and serves the latest HugeCTR trained model from the model store. This inference container contains the Triton Inference Server with a HugeCTR inference backend. The model re-ranks the generated candidates and serves the top scoring ones.
  • Logging and monitoring: The monitoring pod continuously monitors the quality of the recommendation in real-time (hit rate, conversion rate) and automatically triggers full retraining upon detecting significant model drift. NVIDIA Triton and the monitoring module log statistics into Prometheus and Grafana.

Criteo Terabyte click log dataset case study

In this example, we demonstrate the Merlin MLOps pipeline on Kubeflow pipelines and GKE using the Criteo Terabyte click log dataset, which is one of the largest public datasets in the recommendation domain. It contains ~1.3 TB of uncompressed click logs containing over four billion samples spanning 24 days, and can be used to train recommender system models that predict the ad clickthrough rate. Features are anonymized and categorical values are hashed to ensure privacy. Each record in this dataset contains 40 values:

  • A label indicating a click (value 1) or no click (value 0)
  • 13 values for numerical features
  • 26 values for categorical features

Because this data set contains only interaction data and no data on users, items, and their attributes, we skipped the candidate generation and final ranking parts and only implemented the deep learning scoring model to predict whether users will click on the ad.

Technical highlights

In this section, we discuss some of the major highlights pertaining to our implementation.

Multi-instance GPU on GKE

To maximize GPU usage, NVIDIA Triton is deployed on a GKE A100 MIG instance. NVIDIA Multi-instance GPU (MIG) technology partitions a single NVIDIA A100 GPU into as many as seven independent GPU instances. They run simultaneously, each with its own memory, cache, and streaming multiprocessors. That enables the A100 GPU to deliver guaranteed quality-of-service (QoS) at up to 7x higher utilization compared to prior GPUs. Small recommendation models that fit into the memory of a MIG instance can be deployed onto a GKE MIG instance of the appropriate size. That being said, we are working on relaxing this memory requirement through embedding table caching. Stay tuned!

GPU autoscaling

NVIDIA Triton deployment can be scaled using default metrics like CPU/GPU utilization, memory usage, and so on, and also using custom metrics. For this example, we use a custom metric exported to the Prometheus operator based on the average time spent by the incoming request in the inference queue. If the inference load on NVIDIA Triton increases, then the time spent by the incoming requests in the inference queue goes up as well.

To balance the increase in load, the Horizontal Pod Autoscaler (HPA) can schedule another NVIDIA Triton Pod on freely available GPU nodes. If no nodes are available in the GPU node pool, then the HPA kicks in the GKE node autoscaler that assigns a new GPU node to the GPU node pool. After a new node is available in the cluster, the Kubernetes Pod scheduler schedules a new instance of the NVIDIA Triton Pod on that GPU node. The load balancer can then route the pending incoming requests in the queue to the newly created NVIDIA Triton Pod. Subsequently, if the load decreases, the autoscaler can scale down the nodes.

Sending inference requests

An end user interacts with the inference server indirectly through a client application or recommendation API, which translates user requests and responses to inference requests. To this end, we include a test inference client app that can be used to read Criteo .parquet files and send inference gRPC requests to the NVIDIA Triton endpoint.


In an ML system, the relationship between the independent and the target variables can change over time. As a result, the model predictions can gradually become erroneous. In this example pipeline, we have a monitoring module that is tasked with tracking the performance (in this case, AUC score) and triggering another run of the pipeline if AUC drifts below a certain threshold. The monitoring module runs as a separate pod in the GKE cluster.

How does it get access to the request data? In the reference design, the test inference client is responsible for logging the inference requests using Cloud Pub/Sub, where the inference client publishes the requests and corresponding inference results to the Pub/Sub broker, and the monitoring module subscribes to it. Using this asynchronous mechanism, monitoring can assess the performance and take appropriate action like triggering the Kubeflow pipeline for retraining if required. It also writes these requests periodically to a volume, which a daemon job pushes to the GCS bucket for use in the next round of continuous training. This data collection closes the loop in the system, and allows the new incoming requests  as fresh data that the pipeline can use for incremental training from the previous checkpoint.

Scope for improvement

The high-level goal of this post was to show an example of a recommender system, built using Merlin components, running in the form of a Kubeflow pipeline. There are several pieces of this example that could be designed in an alternative way or further improved. For instance:

  • Cloud Pub/Sub is used for communicating request data from the inference client to the monitoring module. This gives you high scalability, reliability, and advantages of asynchronous behavior. However, this does add an additional dependency on GCP infrastructure. Alternatively, you could use other message queues, like Kafka.
  • Data drift could be monitored live, especially in cases where there is no user feedback for served recommendations to estimate model performance. You could plug in a solution similar to the data validation component in monitoring. Additionally, you should first filter outliers out from out-of-distribution samples.
  • The data validation component using TensorFlow Data Validation is a simple example showing where such a component could be plugged into the pipeline. There could be other appropriate actions on detecting drift, like notifications to users or taking corrective measures other than logging. There may be other libraries more suitable to your use case, like Great Expectations or Alibi Detect.


This example with Merlin components on a Kubeflow pipeline follows the reference architecture as described earlier. Most ML systems would follow a similar architecture, with components for data acquisition and cleaning, preprocessing, training, model serving, monitoring, and so on. As such, these blocks could be replaced with custom containers and code in the pipeline. Any additional modules could be either added to the pipeline itself (like data validation and training), or deployed as a separate pod in the cluster (like inference, and monitoring). This Merlin MLOps example serves as a reference on how you can create, compile, and run your pipelines.

The code and step-by-step instructions to run this Merlin MLOps example are available at the NVIDIA-Merlin/gcp-ml-ops GitHub repo. We’d love to hear about how this project relates to what you’re working on, especially if you have any questions or feedback! You can reach us through the repo or by leaving a comment here.


OSError: SavedModel file does not exist

Can anyone help me fix the error OSError: SavedModel file does not exist at: /mnt/Archive/Google_T5/11B/{saved_model.pbtxt|saved_model.pb} in this code:

import tensorflow.compat.v1 as tf import tensorflow_text as tf_text tf.reset_default_graph() sess = tf.Session() meta_graph_def = tf.saved_model.loader.load(sess, ["serve"], "/mnt/Archive/Google_T5/11B") signature_def = meta_graph_def.signature_def["serving_default"] 

submitted by /u/notooth1
[visit reddit] [comments]


Struggling to get Nvidia GPU to cooperate on Windows with TensorFlow

Hey, I’m using python 3.9.4 on Windows 10 and am trying to run an ML program in VScode, but despite all of the available resources I just can’t seem to get my GPU to run the program

When I run the following code:

import tensorflow as tf from tensorflow import keras print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU'))) 

It says:

Num GPUs Available: 0

Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use

GPU. Follow the guide at for how to download and setup the required libraries for your platform.

So I checked the site and added the necessary Cuda folders to the environment path and still no luck. If someone could give some guidance as to how they achieved this I’d be grateful


submitted by /u/Kubrickian75
[visit reddit] [comments]


Jetson Project of the Month: Robotic Arm Masters the Game of Cornhole

David Niewinski of Dave’s Armoury won the ‘Jetson Project of the Month’ for building a robot arm capable of playing a perfect game of cornhole.

David Niewinski of Dave’s Armoury won the ‘Jetson Project of the Month’ for building a robot arm capable of playing a perfect game of cornhole. The robot runs on an NVIDIA Jetson AGX Xavier Developer Kit and can throw a perfect cornhole game. 

For the uninitiated, Cornhole is a lawn game popular in the United States where players take turns using their aim and motor skills to throw bags of corn kernels at a raised platform which has a hole on the far side. Dave’s setup pairs the Jetson with a Kuka KR20 robot (fondly called ‘Susan’). A 1080p webcam serves as the eyes of Susan and a 2020 extrusion bar mimics the throwing arm of a player. The platform’s hole is colored red to make it easier for Susan to spot it from the background.

For the software, Dave used several OpenCV functions such as inRange to pick out the red hole from the scene, and findContours to establish the ring around the hole. Using the relative positions of the camera and the center of the hole, the angle and power for the throw are calculated on Jetson. Lastly, Jetson communicates these calculations to Susan through the network via the KUKA.ethernetKRL software package. 

In the demo video, Dave mentions that he enjoyed working on Jetson and added,“This [Jetson AGX Xavier] is an awesome computer — think of it like if a video card had a baby with a Raspberry Pi. It has a lot of parallel compute on it, so you can do neural networks, deep learning, machine vision, but it doesn’t actually draw all that much power and with a little mount, you can strap it directly onto Susan.”

This project demonstrates how Jetson AGX Xavier could be used as an intelligent robot controller and can be paired with robot arms for industrial applications. Summer is here in North America and we’ll take some inspiration from Susan for our next cornhole game. 

If you’re interested in learning more about the winning duo of Susan and Jetson, check out Dave’s code on GitHub.


Technology in Motion: NVIDIA and Volvo Cars Detail Software-Defined Future of Safe Transportation

All roads to the future of autonomous, electric and connected transportation run through one key innovation: software-defined, centralized computing. Today at Volvo Cars’ Tech Moment event, Ali Kani, NVIDIA vice president and general manager of Automotive, joined executives from Volvo Cars to outline the centralized compute architecture that will power these software-defined vehicles. This architecture, Read article >

The post Technology in Motion: NVIDIA and Volvo Cars Detail Software-Defined Future of Safe Transportation appeared first on The Official NVIDIA Blog.


Global Computer Makers Deliver Breakthrough MLPerf Results with NVIDIA AI

NVIDIA’s partners are delivering GPU-accelerated systems that train AI models faster than anyone on the planet, according to the latest MLPerf results released today. Seven companies put at least a dozen commercially available systems, the majority NVIDIA-Certified, to the test in the industry benchmarks. Dell, Fujitsu, GIGABYTE, Inspur, Lenovo, Nettrix and Supermicro joined NVIDIA to Read article >

The post Global Computer Makers Deliver Breakthrough MLPerf Results with NVIDIA AI appeared first on The Official NVIDIA Blog.


MLPerf v1.0 Training Benchmarks: Insights into a Record-Setting NVIDIA Performance

MLPerf v1.0 showcases the continuous innovation that is happening in the AI domain. In the last two-and-a-half years since the first MLPerf training benchmark launched, NVIDIA performance has increased by nearly 7x. In this post, we describe some of the major optimizations that enabled such improvements.

MLPerf is an industry-wide AI consortium tasked with developing a suite of performance benchmarks that cover a range of leading AI workloads widely in use. The latest MLPerf v1.0 training round includes vision, language and recommender systems, and reinforcement learning tasks. It is continually evolving to reflect the state-of-the-art AI applications.

NVIDIA submitted MLPerf v1.0 training results for all eight benchmarks, as is our tradition. In fact, systems built upon the NVIDIA AI platform are the only commercially available systems to make submissions across the board.

Compared to our previous MLPerf v0.7 submissions, we improved up to 2.1x on a chip-to-chip basis and up to 3.4x at scale. We set 16 performance records with eight on a per-chip basis and eight at-scale training in the commercially available solutions category.

Benchmark Max Scale Records (min)
DGX SuperPod
Per Accelerator Records (min)
Recommendation (DLRM) 0.99 15.3
NLP (BERT) 0.32 169.2
Image classification (ResNet-50 v1.5) 0.4 219.0
Speech recognition – Recurrent (RNN-T) 2.75 309.6
Image Segmentation (3D U-Net) 3.00 229.1
Object detection lightweight (SSD) 0.48 66.5
Object detection heavyweight (Mask R-CNN) 3.95 400.2
Reinforcement Learning 15.53 2156.3
Table 1. NVIDIA MLPerf AI records. (*)

(*) Per Accelerator performance for A100 computed using NVIDIA 8xA100 server time-to-train and multiplying it by 8 | Per Chip Performance comparisons to others arrived at by comparing performance at the closest similar scale.
Per-Accelerator Records:  BERT: 1.0-1033  | DLRM: 1.0-1037  |  Mask R-CNN: 1.0-1057  |  ResNet50 v1.5: 1.0-1038  |  SSD: 1.0-1038  |  RNN-T: 1.0-1060  |  3D U-Net: 1.0-1053  |  MiniGo: 1.0-1061
Max Scale Records:  BERT: 1.0-1077  | DLRM: 1.0-1067  |  Mask R-CNN: 1.0-1070  |  ResNet50 v1.5: 1.0-1076  |  SSD: 1.0-1072  |  RNN-T: 1.0-1074  |  3D U-Net: 1.0-1071  |  MiniGo: 1.0-1075
MLPerf name and logo are trademarks. For more information, see

This is the second MLPerf training round featuring NVIDIA A100 GPUs. Our continual year-over-year improvement on the same hardware is a lively testament to the strength of the NVIDIA platform and commitment to continuous software improvement. As in previous MLPerf rounds, NVIDIA engineers developed a host of innovations to achieve these new levels of performance:

  • Extending CUDA Graphs across all benchmarks. Neural networks are traditionally launched as individual kernels from the CPU to execute on the GPU. In MLPerf v1.0, we launched the entire sequence of kernels as a graph on the GPU, minimizing communication with the CPU.
  • Using SHARP to double the effective interconnect bandwidth between nodes. SHARP offloads collective operations from the CPU and GPU to the network and eliminates the need for sending data multiple times between endpoints.

The CUDA graph and SHARP enhancements enabled us to increase our scale to a record number of 4096 GPUs used to solve a single AI network.

  • Spatial parallelism enabled us to split a single image across eight GPUs for massive image segmentation networks like 3D U-Net and to use more GPUs for higher throughput.
  • Among hardware improvements, new HBM2e GPU memory on the NVIDIA A100 GPU increased memory bandwidth by nearly 30% to 2 TBps.

This post provides insights into many of the optimizations used to deliver the outstanding scale and performance. Many of these improvements are available on NGC, which is the hub for NVIDIA GPU-optimized software. You can realize the benefits of these optimizations in your real-world applications, instead of just observing better benchmark scores from the sideline.

At-scale training

Large-scale training requires system hardware and software to be precisely tuned to work together and support the unique performance requirements that arise at scale. NVIDIA made major advances on both dimensions, which are now available for production use.

On the system side, the key building block of our at-scale training is the NVIDIA DGX SuperPOD. DGX SuperPOD is the culmination of years of expertise in HPC and AI data centers. It is based on the NVIDIA DGX A100 with the latest NVIDIA A100 Tensor Core GPU, third-generation NVIDIA NVLink, NVSwitch, and the NVIDIA ConnectX-6 VPI 200 Gbps HDR InfiniBand. These were combined to make Selene a top 5 supercomputer in the Top 500 supercomputer list, with the following components:

  • 4480 NVIDIA A100 Tensor Core GPUs
  • 560 NVIDIA DGX A100 systems
  • 850 Mellanox 200G HDR InfiniBand switches

On the software side, the NGC container release v. 21.05 enhances and enables several capabilities:

  • Distributed optimizer support enhancement.
  • Improved communication efficiency with Mellanox HDR Infiniband and NCCL 2.9.9.
  • Added SHARP support. SHARP improves upon the performance of MPI and machine learning collective operations. SHARP support was added to NCCL to offload all-reduce collective operations into the network fabric, reducing the amount of data traversing between nodes.


In this section, we dive into the optimizations for selected individual MLPerf workloads.

Recommendation (DLRM)

Recommendation is arguably the most pervasive AI workload in data centers today. The NVIDIA MLPerf DLRM submission was based on HugeCTR, a GPU-accelerated recommendation framework that is part of the NVIDIA Merlin open beta framework. The HugeCTR v3.1 beta release added the following optimizations:

Hybrid embedding

One of the major challenges in scaling DLRM to multiple nodes is the ~10x difference in per-GPU all-to-all bandwidth between NVLink and Infiniband. This makes the embedding exchange between nodes a significant bottleneck during training.

To counteract this, HugeCTR implemented hybrid embedding, a novel embedding design that deduplicates the categories in a batch before doing the embedding weight exchange in the forward pass. It also reduces the gradients locally before doing gradient exchange in the backward pass.

For efficient deduplication, the hybrid embedding maps the categories to frequent and infrequent embeddings based on the statistical access frequency of categories. The frequent embedding is implemented in a data-parallel fashion that takes away most of the replicated categories in a batch, reducing the embedding exchange traffic. Infrequent embedding follows the distributed model parallel-embedding paradigm. This enables DLRM to scale to multiple nodes with unprecedented efficiency.

HugeCTR hybrid embedding deduplicates the categories in a batch before doing the embedding weight exchange.
Figure 1. Hybrid embedding in HugeCTR

Optimized collectives

All-to-all and all-reduce collective latencies play a significant role in scaling efficiency. Multinode all-to-all throughput for small message sizes was limited by the Infiniband message rate. To mitigate this, HugeCTR implemented fused NVLink aggregation using hierarchical all-to-all for embedding exchange.

HugeCTR implemented fused NVLink aggregation using hierarchical all-to-all for embedding exchange.
Figure 2. Collective optimizations in HugeCTR

You can optimize internode all-to-all and all-reduce latencies further:

  • Directly using the native IB verbs API and SHARP to mitigate library overheads.
  • Graph-capturable, GPU-initiated communication to reduce launch overheads.
  • One-sided eager protocol instead of two-sided rendezvous protocol to reduce network hops.
  • Eliminating redundant message buffer copies using persistent communication buffers.
  • Reduced NIC-GPU synchronization latency using NIC atomics directly on GPU memory instead of indirection through CPU.

Intranode all-reduce is also optimized using a single-shot reduction algorithm as opposed to ring.

Frequent embedding all-reduce and MLP all-reduce are fused into a single all-reduce operation to save on exposed all-reduce latency.

Optimized data reader

Input pipeline plays a significant role in training performance. To achieve peak I/O throughput, HugeCTR implemented a fully asynchronous data reader using the Linux asynchronous I/O library (AIO). Because hybrid embedding requires the whole batch to be present on all the GPUs, direct host-to-device (H2D) for each GPU would make PCIe a bottleneck. So, the data is copied onto the GPUs using a hierarchical approach, by first doing a H2D over PCIe and then a broadcast over NVLink.

The HugeCTR data reader is fully asynchronous using Linux asynchronous I/O.
Figure 3. HugeCTR multithread data reader

Moreover, H2D traffic from data readers may interfere with internode all-to-all and all-reduce traffic over PCIe. So, HugeCTR implements intelligent data-reader scheduling to avoid such interference.

Overlapping MLP with embedding

Because the bottom MLP has no data dependencies with embedding, several components of the bottom MLP could be overlapped with embedding for efficient utilization of GPU resources.

  • Bottom MLP forward is overlapped with the embedding forward pass
  • Frequent embedding local weight update is overlapped with all-reduce
  • MLP weight update is overlapped with internode all-to-all

cuBLASLt GEMM fusions

HugeCTR implemented a fused, fully connected layer that made use of cublasLt GEMM fusions:

  • GEMM + Relu + bias fusion for MLP fprop
  • GEMM + dRelu + dBias fusion for MLP bprop
cuBLASLt GEMM fusions speed up forward and backward passes.
Figure 4. Fused FC layer in HugeCTR using cuBLASLt GEMM fusions

Whole-iteration CUDA graph

To reduce launch latencies and prevent PCIe interference between kernel launches, data-reader, and communication traffic, all DLRM compute and communication kernels are designed to be stream-capturable. The whole training iteration is captured into a single CUDA graph.

With the preceding optimizations, we scaled to multiple nodes and completed the DLRM training task in just under a minute on 14 DGX-A100 nodes. This is a 3.3x speedup compared to the previous v0.7 submission.


BERT is arguably one of the most important workloads in the NLP domain today. In the MLPerf v1.0 round, we improved upon our v0.7 submission with the following optimizations:

Fused multihead attention

The size of the activation tensors inside the multihead attention block grows with the square of the sequence length. This results in increased memory footprint, as well as longer runtimes due to the accompanying memory access operations. We fused softmax, masking, and dropout operations into a single kernel both in forward pass and backward pass. By doing so, we avoided several memory access operations for large activation tensors inside the multihead attention block, which resulted in a meaningful performance boost.

For more information, see SelfMultiheadAttn in the NVIDIA Apex library.

Distributed LAMB

In this MLPerf round, we implemented distributed LAMB. In distributed LAMB, the gradients are first split across eight GPUs within each DGX-A100 node. This is followed by an all-reduce between the nodes in eight separate groups. After this operation, each GPU has one of eight chunks that constitute the all-reduced gradient tensor, and the LAMB optimizer is run on 1/8th of the full gradient tensor.

When necessary, gradient norms are computed by computing local norms and performing an all-reduce operation. After the optimizer, an intranode all-gather operation is performed at each node, so that each GPU has the full updated parameter tensor. Execution is continued with the forward pass of the next iteration.

Distributed LAMB substantially improves performance both for single-node and multinode configurations. For more information, see DistributedFusedLAMB in the Apex library.

Synchronization-free training

There are cases where the GPU execution depends on some value that is stored or calculated on the CPU. An example is when a specific tensor has a varying size that depends on the computation for each iteration. Because tensor size information is kept on the CPU, there must be a synchronization between GPU and CPU to pass the tensor size information for proper buffer allocation.

Our solution was using a tensor with fixed size, but indicating which elements are valid using a separate Boolean mask. With this approach, no CPU-GPU synchronization was needed, as the tensor sizes are known. When a subsequent computation must know the real size of the tensor, as for an averaging operation, the elements of the Boolean mask can be summed on the GPU.

Even though this approach resulted in slightly more access to GPU memory, it is much faster than having CPU synchronization in the critical path. This optimization resulted in a significant performance boost for small local batch size, which is the case for our max-scale configuration. This is because CPU synchronizations can’t keep up with fast GPU execution for small batch sizes.     

By using a fixed size tensor and a Boolean mask as described in the text, we are able to eliminate CPU synchronizations needed for dynamic-sized tensors.
Figure 5. Synchronization-free training eliminates CPU synchronization

Another source of CPU-GPU synchronization is the data that is kept on CPU, such as learning rate or potentially other optimizer states. We kept all the optimizer states on the GPU for distributed LAMB to achieve synchronization-free execution.

As a result of these optimizations, we eliminated all the synchronizations between CPU and GPU during a training cycle. The only synchronizations are the ones that happen at the evaluation points, to log the evaluation accuracy in a file in real time for every evaluation point.

CUDA Graphs in PyTorch

Traditionally, CPU launches each GPU kernel individually. In general, even though GPU kernels do more work for large batch sizes, CPU kernel launch work and related CPU overheads stay fixed, barring the variations in CPU scheduling. As a result, for small local batch sizes, CPU overhead can become a significant performance bottleneck. This is what happened in our max-scale BERT configuration in MLPerf.

On top of that, when CPU execution becomes a bottleneck, variations in CPU execution result in different runtimes across all GPUs for each iteration. This introduces a significant synchronization overhead when the workload is scaled to many GPUs (4096 GPUs in this case). Each GPU synchronizes every iteration for gradient reductions, and iteration time is determined by the slowest worker.     

CUDA Graphs is a feature that enables launching an entire sequence of kernels at one time, eliminating CPU overheads between kernel executions. CUDA Graphs recently became available in PyTorch. By graph capturing the model, we eliminated CPU overhead and the accompanying synchronization overhead. The CUDA Graphs implementation resulted in a 1.7x performance boost just by itself for our max-scale BERT configuration.

One-shot all-reduce with SHARP

SHARP improved the performance of collectives significantly for BERT, especially for our max-scale configuration. End-to-end performance boost from SHARP is 17% for this BERT configuration. 

Image classification (ResNet-50 v1.5)

ResNet-50 is the veteran among MLPerf workloads. In this edition of MLPerf, we continue to optimize ResNet with the following optimizations:

DALI optimizations

At large scales (>128 nodes) for ResNet-50, we reduced the local batch size per GPU to extremely small values. This often results in sub-20-ms iteration time. To reduce the overhead of the data pipeline, we introduced the input batch multiplier (IBM). DALI throughput is higher at large batch sizes than smaller batch sizes. To take advantage of this fact, we created super batches that are much larger than the local batch size. For each iteration, we then derived the needed samples from these super batches, increasing the DALI processing throughput and reducing the data pipeline overhead.

At these small iteration times, gapless and continuous execution is the key to perfect scaling. Pre-allocating DALI buffers through hints is another feature that we introduced to reduce the overhead of dynamic GPU memory allocation while exploring the dataset.

MXNet fused BN+ReLu and BN+Add+ReLu performance optimizations

For ResNet-50, batch norm (BN) is a significant portion of the network’s iteration time. We optimized the fused BN+ReLu and BN+Add+ReLu kernels in MXNet through vectorization, cache-friendly memory traversals, and reducing quantization.

Improved MXNet dependency engine improvements

The new MXNet dependency engine provides an asynchronous approach to scheduling work on the GPU, reducing the host (CPU) overhead and jitter such as overhead arising from MXNet and Horovord handshake.

In the new dependency engine, the operation updates the dependency as soon as the work is scheduled on the GPU, not when the work is finished. It is the subsequent operation that must perform the synchronization to ensure correctness. This is further enhanced by removing the need for synchronization and using cudaStreamWait events to manage dependencies.

Old MXNet dependency engine with significant CPU overhead and jitter such as overhead arising from MXNet and Horovord handshake.
Figure 6. ResNet-50 and the old MXNet dependency engine
cudaStreamWait Events manage dependencies and removes the need for synchronization.
Figure 7. ResNet-50 and the new dependency engine, cudaStreamWaitEvent

Image segmentation (3D U-Net)

U-Net3D is one of the two new workloads in this round of MLPerf training. We used the following optimizations:

Spatial parallelism

In 3D U-Net, because the sample number in the training dataset is relatively small, there is a fundamental limit to how much it can be scaled with naive data parallelism. To break that limit, we used spatial parallelism to split a single image across eight GPUs. At the end of the backward propagation, the gradients from each partition can be all-reduced as usual to get the resultant gradients, which can then be used to calculate the weight gradients.

The naive approach to implementing spatial parallel convolution is to transfer the halo information from the neighboring GPU before running the convolution. However, to increase efficiency, we implemented a different scheme, in which we transfer the halo from the neighboring GPU in parallel to running the main inner convolutions. The error term to this main convolution is calculated independently using the halo and added to get the result. By hiding the transfer costs, we saw much better scaling efficiency than with the naive approach.

spatial parallel convolution splits a single image across eight GPUs.
Figure 8. Spatial parallel convolution with fprop 

For the backward propagation, similarly, the halos needed for the dgrad operation are transferred in parallel with the computation of the weight gradients and data gradients. The halos transferred for the data gradients are then reused for computing the correction terms for both weight and data gradients.

Spatial parallel convolution splits a single image across eight GPUs.
Figure 9. Spatial parallel convolution with bprop 

3D U-Net has a bottleneck region in the middle of the network with much smaller activation sizes, which are not suited for spatial parallelism. We used a hybrid approach where we used spatial parallelism only for the layers that benefit from it. We gathered the activations for the sample from all GPUs right before this bottleneck region and executed these layers serially on each GPU. We split the work among the GPUs again when cooperating became beneficial. This made sure that we made the best choice separately for each region of the network.

Hybrid approach: Spatial parallelism in 3D U-Net splits the base of the pyramid but serializes the tip of the pyramids.
Figure 10. Application of spatial parallelism in 3D U-Net

Asynchronous evaluation

Evaluation contributes a significant amount of time in the reference code. Because evaluation can be run concurrently with training, we assigned dedicated nodes for running just evaluation.

To hide the evaluation behind the training cycle entirely, we used spatial parallelism to speed up the validation step. In addition, as evaluation uses the same set of images, the images were loaded only one time and then cached in the GPU memory.

Because the evaluation doesn’t start until a third of the way through the training, the evaluation nodes have enough time to load, process, and cache the dataset, as well as initialize all required libraries.

At the end of the training cycle, training nodes use InfiniBand to transfer the model quickly to the evaluation nodes and continue running subsequent training iterations. The evaluation nodes run evaluation after the model parameters are transferred. At the end of the evaluation, the evaluation node communicates to the training nodes if the target accuracy is reached.

The number of evaluation nodes added are just enough to hide the entire evaluation cycle behind the training cycle.

Asynchronous evaluation runs concurrently with training on dedicated nodes.]
Figure 11. Asynchronous evaluation schedule

Data loader

We optimized the data loader in two ways: optimizing the augmentations and caching the dataset.

Augmentations: 3D U-Net requires heavy augmentation due to the small size of the dataset. One of the most expensive operations is something that we call “biased crop”. On contrary to a random crop, biased crop selects regions with a positive label with a given probability. This requires heavy computations of 3D-connected components on labels every time the expensive path is selected. To avoid calculating the connected components every time that the sample is loaded, the result is cached in the host and reused so that it is calculated only one time.

Data loading: As the training gets faster with the new features, the I/O starts to show up as the bottleneck. To alleviate this, we cached the entire image dataset in the GPU memory. This removes the PCIe and I/O from the critical data loader path. While the images are loaded from the large and high-bandwidth GPU memory, the labels are loaded from the CPU to perform augmentations.

Data caching on GPU memory removes the PCIe and I/O from the critical data loader path.
Figure 12. Data caching on GPU memory

Channels-Last layout support

Because the Channels-Last layout is more efficient for convolution kernels, native support for the Channels-Last format was added in MXNet. This avoids any additional transposes needed in the model to take advantage of highly efficient GPU kernels.

Apart from these optimizations, 3D U-Net benefited from the optimized BatchNorm + ReLu activation kernel. The BatchNorm kernel was run repeatedly with a BatchSize value of 1 to get the Instance-Norm functionality. The asynchronous dependency engine implemented in MXNet, CUDA Graphs, and SHARP also helped performance significantly.

With the array of optimizations made for 3D U-Net, we scaled to 100 DGX A100 nodes (800 GPUs), with training running on 80 nodes (640 GPUs) and evaluation running on 20 nodes (160 GPUs). The max-scale configuration of 100 nodes got over 9.7x speedup as compared to the single-node configuration.

Object detection (lightweight) (SSD)

This is the fourth time that the lightweight SSD has been featured in MLPerf. In this round, the evaluation schedule was changed to happen every fifth epoch, starting from the first. In previous rounds, the evaluation scheduled started from the 40th epoch. Even with the extra computational requirement, we sped up our submissions time by more than x1.6.

SSD consists of many smaller convolution layers. The benchmark was particularly affected by the improvements to the MXNet dependency engine, CUDA Graphs, and the enablement of SHARP, as discussed earlier.

More efficient configs

The training time of a deep learning model is a multivariable function. In its most basic form, the equation is as follows:

Train; Time = Average; Iteration; Time; cdot; Number; of; Iterations

The goal is to minimize Train; Time where both it and Average; Iteration; Time and Number; of; Iterations are functions of the batch size.

Average; Iteration is a monotonically non-decreasing function. Batch sizes are more computationally efficient, but they take more time per iteration.

On the other hand, Number; of; Iterations, up to a certain batch size, is a monotonically nonincreasing function. Larger batch sizes require fewer iterations to converge because the model sees more images per iteration.

Batch size Iterations per epoch (1) Epochs to convergence (2) Total iterations
1024 115 50 5750
2048 58 65 3770
3072 39 75 2925
4096 29 90 2610
Table 2. Number of iterations required for convergence using different batch sizes
(1) MS-COCO epoch size is 117266 images; (2) empirical value

Compared to the v0.7 submission where we used a batch size of 2048, the v1.0 batch size was 3072, which required 22% fewer iterations. Because the larger iteration was only 20% slower, the result was an 8% faster time to convergence.

In this example, going to a batch size of 4096 instead of 3072 would’ve resulted in a longer training time. The 11% fewer iterations didn’t make up for the extra 20% run time per iteration.

Optimized evaluation

Evaluation can be broken into three phases:

  • Inference: Using the trained model to make predictions against the validation dataset. Runs on the GPU.
  • Scoring: Evaluating the inference results against the ground truth. Runs asynchronously on the CPU.

The new evaluation in v1.0 adds eight validation cycles to the base submission. Worse, improvements to the epoch train time means that scoring needs to take less than 2 seconds or the training time of five epochs. Otherwise, it won’t be fully hidden and any training time improvements are pointless.

To improve inference time, we made sure that the inference graph was static. We improved the nonmaximum suppression implementation and moved the Boolean mask, used to filter negative detections, to outside the graph. Static graphs save memory reallocation time and make switching between training and inference contexts faster.

For scoring, we used nv-cocoapi, which is a C++ implementation of cocoapi and 60x times faster. For v1.0, we improved the nv-cocoapi performance by 2x with multithreaded results accumulation, faster indices sorting, and caching the ground truth data structures.

Object detection (heavyweight) (Mask R-CNN)

We optimized object detection with the following techniques:

CUDA Graphs in PyTorch

Deep learning frameworks use GPUs to accelerate computations, but a significant amount of code still runs on CPU cores. CPU codes process meta-data like tensor shapes to prepare arguments needed to launch GPU kernels. Processing metadata is a fixed cost while the cost of the computational work done by the GPUs is positively correlated with batch size. For large batch sizes, CPU overhead is a negligible percentage of total run time cost. At small batch sizes, CPU overhead can become larger than GPU run time. When that happens, GPUs go idle between kernel calls.

This issue can be identified on an Nsight Systems timeline plot. The plot below shows the “backbone” portion of Mask R-CNN with per-GPU batch size of 1 before graphing. The green portion shows CPU load while the blue portion shows GPU load. In this profile, you see that the CPU is maxed out at 100% load while GPU is idle most of the time. There is a lot of empty space between GPU kernels.

Nsight timeline plot of Mask R-CNN shows that the CPU is maxed out at 100% load while GPU is idle most of the time, and empty space between GPU kernels.
Figure 13. Nsight timeline plot of Mask R-CNN

CUDA graph is a tool that can automatically eliminate CPU overhead when tensor shapes are static. A complete graph of all kernel calls is captured during the first step. In subsequent steps, the entire graph is launched with a single operation, eliminating all the CPU overhead. PyTorch now has support for CUDA graph, we used this to speed up Mask R-CNN for MLPerf 1.0.

With CUDA graph, the entire graph is launched with a single op, eliminating all the CPU overhead.]
Figure 14. CUDA graph optimization

With graphing, we see that the GPU kernels are tightly packed and GPU utilization remains high. The graphed portion now runs in 6 ms instead of 31 ms, a speedup of 5x. We mostly just graphed the ResNet backbone, not the entire model. Even then, we saw >2x uplift for the entire benchmark just from graphing.

Removing synchronization points

There are many PyTorch modules that make the main process wait until the GPU has finished all previously launched kernels. This can be detrimental to performance, because it makes the CPU sit idle when it could be working on launching more kernels. The CPU can get ahead of the GPU in low overhead segments and start launching kernels from succeeding segments. As long as total CPU overhead is less than total GPU kernel time, the CPU never becomes the bottleneck, but this breaks when sync points are introduced. Also, model segments that have sync points cannot be graphed with CUDA graph, so removing syncs is important.

We did some of this work for MLPerf 1.0. For instance, torch.randperm was rewritten to use CUB instead of Thrust because the latter is a synchronous C++ template library. These improvements are available in the latest NGC container.

Removing all the syncs improved the uplift that we saw from CUDA Graphs from 1.6x to 2.5x.

Asynchronous evaluation

Our MLPerf 0.7 submission did asynchronous evaluation, but it wasn’t fast enough to keep up with training after optimizations. Evaluation took 18 seconds per epoch, and 4 seconds of that was fully exposed time. Without changes to the evaluation code, our at-scale submission would have clocked in about 100 seconds slower.

Of the three evaluation phases, inference and prep account for all the exposed time. To speed up inference, we cached the test images in GPU memory, as they never change. We moved the prep phase to a pool of background processes, as each sample in the test dataset can be processed independently. We scored segmentation masks and boxes simultaneously in two background processes. These optimizations reduced evaluation time to ~4 seconds per epoch.

Dataloader optimization

This component loads and augments images during training. In our MLPerf 0.7 submission, all data loading work was done by CPU cores. The old dataloader was not fast enough to keep up with training after optimizations. To remedy that, we developed a hybrid dataloader.

The hybrid dataloader decodes the images on CPU and then does image augmentation work on GPU using Torchvision. To hide the cost of dataloading completely, we moved the load next image call in the main training loop after the loss backward call. The CPUs are idle for several milliseconds after the loss backward call because of theCUDA Graph launch. This is more than enough time to decode the next image. After the GPUs finish back propagation, they sit idle while the optimizer does all-reduce on the gradients. During this idle time, the dataloader does image augmentation work.

Speech recognition (RNN-T)

Speech recognition with RNN-T is the other new workload in this round of MLPerf training. We used the following optimizations:

Apex transducer loss

RNN-T uses a special loss function that we call transducer loss function. The algorithm that computes the loss is iterative in nature. A naive implementation is often inefficient due to the irregular memory access pattern and the exposed long memory read latency.

To overcome this difficulty, we developed apex.contrib.transducer.TransducerLoss. It uses a diagonal-wave-front-like computing paradigm to exploit the parallelism in the algorithm. Shared memory and registers are used extensively to cache the data exchanged between iterations. The loss function also employs prefetch to hide the memory access latency.

Apex transducer joint

Another component that is often found in a transducer-type network is the transducer joint operation. To accelerate this operation, we developed apex.contrib.transducer.TransducerJoint. This Apex extension is not only faster than its native PyTorch counterpart, but also enables output packing, reducing the workload seen by following layers.

Figure 15 shows the packing operation by the Apex transducer joint. In the baseline joint operation, the paddings from the input sequences are carried over to the output, as the joint operation is oblivious to the input padding. In the Apex transducer joint operation, the paddings are removed at the output, reducing the size of the tensor fed to the following operations.

Apex transducer joint packing operation explained: the padding created in the baseline joint operation is removed, hence reducing the size of the tensor fed to the following operations.
Figure 15. Apex transducer joint packing operation

Sequence splitting

To reduce computations of LSTMs that are wasted on paddings, we split batch processing into two phases (Figure 16). In the first pass, all the samples in the minibatch up to certain time steps (enclosed by the black boxes) are evaluated. Half of the samples in the minibatch are completed in the first pass. The remaining time steps of the other half of the samples (enclosed by the red boxes) are evaluated in the second pass. The regions enclosed by blue boxes represent the savings from batch splitting.

Sequence splitting avoids computations spent on the unnecessary computation.
Figure 16. Sequence splitting

The black dashed line in Figure 16 estimates the workload seen by the GPUs. Because the batch size is halved for the second pass, the workload seen by the GPU is roughly halved. In multi-GPU training, it is often the slowest GPU that limits the training throughput. The dashed line is obtained from the GPU with most workloads.

To mitigate this load imbalance, we employed a technique called presorting, where samples in a minibatch are sorted based on their sequence lengths. The longest and shortest sequences are placed on the same GPU to balance the workload. The intuition behind this is that GPUs with long sequences are likely to be the bottleneck. Therefore, short sequences should be placed on these GPUs as well to maximize the benefit of sequence splitting.

Batch splitting

RNN-T has an interesting network structure where the LSTMs deal with relatively small tensors, whereas the joint net takes much larger tensors. To enable LSTMs to run more efficiently with a large batch size while not exceeding the GPU memory capacity by having a huge tensor in the joint net, we employed a technique called batch splitting (Figure 17). We used a reasonably large batch size so that LSTMs achieved a decent GPU utilization. In contrast, joint net operates on a portion of the batch and loops through those subbatches one by one.

In Figure 17, a batch splitting factor of 2 is used. In this case, the batch sizes of the inputs to the LSTMs and the joint net are B and B/2, respectively. Because all the tensors generated by the joint net, except the gradients for the weights, are no longer needed after the backpropagation is completed, they can be released and create room for the next subbatch in the loop.

Batch splitting enables LSTMs to run more efficiently with a large batch size while not exceeding the GPU memory.]
Figure 17. Batch splitting

Batch evaluation with CUDA Graphs

Other than accelerating training, evaluation of RNN-T has also been scrutinized. The evaluation of RNN-T is iterative in nature and the evaluation of the predict network is performed step by step. Each sample in a batch might pick different code paths in the same time step, depending on the execution results. Because of these, a naive implementation leads to a low GPU utilization rate and a long evaluation time that is comparable to the training itself.

To overcome these difficulties, we performed two categories of optimizations in the evaluation. The first optimization performed evaluation in batch mode and take care of the different control flows in a batch with predicates. The second optimization graphed the main RNN-T evaluation loop, which consists of many short GPU kernels. We also used loop unrolling and overlapping CPU-GPU communication with GPU execution to amortize associated overheads. The optimized evaluation was more than 100x faster than the reference code for the single-node configuration, and more than 30x faster for the max-scale configuration.


MLPerf v1.0 showcases the continuous innovation happening in the AI domain. The NVIDIA AI platform delivers leadership performance with tight integration of hardware, data center technologies, and software to realize the full potential of AI.

In the last two-and-a-half years since the first MLPerf training benchmark launched, NVIDIA performance has increased by nearly 7x. The NVIDIA platform excels in both performance and usability, offering a single leadership platform from data center to edge to cloud.

All software used for NVIDIA submissions is available from the MLPerf repository, to enable you to reproduce our benchmark results. We constantly add these cutting-edge MLPerf improvements into our deep learning frameworks containers available on NGC, our software hub for GPU-optimized applications.


Taking it to the Street: NVIDIA DRIVE Ecosystem Brings AVs to Public Markets

Just like money, autonomous vehicles never sleep. And the companies developing them are working just as hard, rolling out transformative technology and growing into publicly traded entities. This ecosystem includes every aspect of the autonomous vehicle industry ― from sensors to software to mobility services ― all using the high-performance, energy-efficient NVIDIA DRIVE platform to Read article >

The post Taking it to the Street: NVIDIA DRIVE Ecosystem Brings AVs to Public Markets appeared first on The Official NVIDIA Blog.


Make Any Face Come to Life: NVIDIA’s Simon Yuen Talks Audio2Face

We all know about the applications for digital humans for films and video games, but at NVIDIA, Simon Yuen has discovered the vast need and potential for digital humans beyond the entertainment industry. Yuen spoke with NVIDIA AI Podcast host Noah Kravitz about how we’re getting to a point where the simulation of digital humans Read article >

The post Make Any Face Come to Life: NVIDIA’s Simon Yuen Talks Audio2Face appeared first on The Official NVIDIA Blog.



I may be losing my mind, but the tf website lists a 3D convolutional LSTM for keras in tf-nightly, but I can’t seem to find it after installing or updating.

Did it get dropped at some point? I’m hoping I can add the LSTM before flattening the data.

submitted by /u/QuantumSorcerer
[visit reddit] [comments]