
Faster Text Classification with Naive Bayes and GPUs

Accelerating classification on sparse data with GPUs and a splash of BayesNaive Bayes (NB) is a simple but powerful probabilistic classification technique that parallelizes well and can scale to datasets of massive size.  If you have been working…Accelerating classification on sparse data with GPUs and a splash of Bayes

Naive Bayes (NB) is a simple but powerful probabilistic classification technique that parallelizes well and can scale to datasets of massive size. 

If you have been working with text processing tasks in data science, you know that machine learning models can take a long time to train. Using GPU-accelerated computing on those models has often resulted in significant gains in time performance, and NB classifiers make no exception.

By using CUDA-accelerated operations, we reached a performance boost from 5–20x depending on the NB model used. A smart utilization of sparse data led to a 120x speedup for one of the models.

In this post, we present recent upgrades to the NB implementation in RAPIDS cuML and compare it to Scikit-learn’s implementation on the CPU. We provide benchmarks to demonstrate the performance benefits and walk through simple examples of each support variant of the algorithm to help you determine which is best for your use case.

What is naive Bayes?

NB uses Bayes’ theorem (Figure 1) to model the conditional probability distribution shown below to predict a label or category (y) given some input features (x). In its most simplest form, the Bayes theorem computes the conditional probability using the joint probability between the features and possible labels with the marginal probability of the features occurring across all possible labels.

Graphical representation of formula for Bayes’ Theorem with the conditional probability, joint probability, and marginal probability defined.
Figure 1. Bayes’ theorem represents the probability of a label (y) resulting from a set of features (x) as a conditional probability. It is computed using the joint probability of each label occurring with the set of features and the marginal probability of the features occurring across all possible labels

NB algorithms have been shown to work well on text classification use cases. They are often applied to tasks such as filtering spam emails; predicting categories and sentiment for tweets, web pages, blog posts, user ratings, and forum posts; or ranking documents and web pages. 

The NB algorithms simplify the conditional probability distribution by making the naive assumption that each feature (for example, each column in an input vector x) is statistically independent of all the other features. This makes the algorithm great because this naive assumption increases the ability to parallelize the algorithm. Also, the general approach of computing simple co-occurrence probabilities between features and class labels enables the model to be trained incrementally, supporting datasets that don’t fit into memory.

NB comes in several variants, which make certain assumptions about the joint distribution or the features co-occurring with respect to various class labels.

Naive Bayes assumptions

To predict classes for unseen sets of input features, different assumptions about the joint distribution enable several different variants of the algorithm, which model the distribution of features by learning parameters for different probability distributions.

Table 1 models a simple document/term matrix that could come from a collection of text documents. The terms along the columns represent a vocabulary. A simple vocabulary might break a document into the set of unique words that occur in total across all the documents. 

I love dogs hate and knitting is my hobby session
Doc 1 1 1 1
Doc 2 1 1 1 1 1
Doc 3 1 1 1 2 1 1
Table 1. A document/term matrix containing documents along the rows and the vocabulary terms that occur in each document along the columns

In Table 1, each element could be a count, such as what is shown here, a 0 or 1 to denote the existence of a feature, or some other value such as a ratio, spread, or measure of dispersion for each term occurring across the entire set of documents.

In practice, a sliding window is often run across either the entire documents or the terms, dividing them further into small chunks of word sequences, known as n-grams. For the first document of the following figure, the 2-gram (or bigram) would be “I love” and “love dogs”. It’s common for the vocabularies in these types of datasets to grow significantly large and become sparse. Preprocessing steps are often executed on the vocabulary to filter noise, for example, by removing common terms that appear in most documents. 

The process of converting a document into a document-term matrix is known as vectorization. There are tools to accelerate this process, such as the CountVectorizer, TdfidfVectorizer, or HashingVectorizer estimator objects in RAPIDS cuML. 

Multinomial and Bernoulli distributions

Table 1 represents a set of documents, which have been vectorized into term counts such that each element in the resulting matrix represents the number of times a particular word appears in its corresponding document. This simple representation can be effective for classification tasks.

Because the features represent a frequency distribution, the multinomial naive Bayes variant can effectively model the joint distribution of the features and their associated classes with a multinomial distribution. 

The frequency distributions for each term can be enhanced by incorporating a measure of dispersion, such as Term Frequency Inverse Document Frequency (TF-IDF), which takes into account the number of documents each occurs in. This can significantly improve performance by giving more weight to the terms that appear in fewer documents, and thus improve their discriminative abilities.

While the multinomial distribution works great when used directly with term frequencies, it has also been shown to have great performance on fractional values, like TF-IDF values. The multinomial naive Bayes variant covers a great number of use cases and so tends to be the most widely used. A similar variant is Bernoulli naive Bayes, which models the simple occurrence of each term rather than their frequency, resulting in a matrix of 0s and 1s (a Bernoulli distribution).

Unequal class distributions

It’s common to find imbalanced datasets in the real world. For example, you might have limited data samples for spam and malicious activity but an abundance of normal and benign samples.

The complement naive Bayes variant helps reduce the effects of unequal class distributions by using the complement of the joint distribution for each class during training, for example, the number of times a feature occurred in samples from all other classes.

Categorical distributions

You could also create bins for each of your features, maybe by quantizing some frequencies into a number of buckets such that frequencies of 0-5 go into bucket 0, frequencies of 6-10 go into bucket 1, and so on.

Another option could be to merge several terms together into a single feature, maybe by creating buckets for “animals” and “holidays,” where “animals” might have three buckets, zero for feline, one for canine, and two for rodents. “Holidays” might have two buckets, zero for personal holidays such as a birthday or wedding anniversary, and one for federal holidays.

The categorical naive Bayes variant assumes that the features follow a categorical distribution. The naive assumption works well for this case because it allows each feature to have a different set of categories and it model the joint distribution using, you guessed it, a categorical distribution.

Continuous distributions

Finally, the Gaussian naive Bayes variant works great when features are continuous and it can be assumed that the distribution of features in each class can be modeled with Gaussian distributions, that is, with a simple mean and variance.

While this variant might demonstrate good performance on some datasets after TF-IDF normalization, it can also be useful on general machine learning datasets.

Algorithm Multinomial Bernoulli Complement Categorical Gaussian
Type of input Frequencies,
Counts Categorical Continuous
Advantage Support
count data
binary data
Reduce impact
of imbalance data
categorical data
Support general
continuous data
Table 2. Comparison of the different NB algorithms

Real-world end-to-end examples

To demonstrate the benefits of each algorithm variant, as outlined in Table 2, we step through example notebooks of each algorithm variant. For a comprehensive end-to-end notebook that includes all the examples, see news_aggregator_a100.ipynb.

We used the News Aggregator dataset to demonstrate the performance of the NB variants. The dataset is available publicly from Kaggle and consists of 422K news headlines taken from multiple news sources. Each headline is labeled with one of four possible labels: business, science and technology, entertainment, and health. The data is loaded directly onto the GPU using RAPIDS cuDF and continues through preprocessing steps specific to each NB variant.

Gaussian naive Bayes

Starting with Gaussian naive Bayes, we ran a TD-IDF vectorizer to transform the text data into a real-valued vector that can be used for training.

By specifying ngram_range=(1,3) we indicated that we would learn on single words, as well as 2– and 3-grams. This increases significantly the number of terms or features to learn, from 15K words to 1.8M combinations. As most terms do not occur in most headlines, the resulting matrix is sparse with many values equal to zero. cuML supports special structures to represent data like this.

One additional benefit of NB classifiers is that they can be trained incrementally using a partial_fit method on the Estimator object. This technique is suited for massive datasets that might not fit into memory all at once or which must be distributed across multiple GPUs.

Our first example demonstrates incremental training using Gaussian naive Bayes by splitting the data into multiple chunks after preprocessing into continuous features with TF-IDF. The cuML version of Gaussian naive Bayes is 21x faster than Scikit-learn for training and 72x faster for inference.

Bernoulli naive Bayes

The next example demonstrates Bernoulli naive Bayes, without incremental training, using binary features that represent the presence or absence of each term. The CountVectorizer object can be used to accomplish this with the setting binary=True. We found a 14x speedup over Scikit-learn in this example.

Multinomial naive Bayes

Multinomial naive Bayes is the most versatile and widely used variant, as demonstrated in the following example. We used the TF-IDF vectorizer instead of CountVectorizer to achieve a 5x speedup over Scikit-learn.

Complement naive Bayes

We demonstrated the power of complement naive Bayes using CountVectorizer and showed that it yielded a better classification score than both the Bernoulli and multinomial NB variants on our imbalanced dataset.

Categorical naive Bayes

Last but definitely not least is an example of categorical naive Bayes, which we vectorized using k-means along with a model previously trained on another NB variant to group similar terms into the same categories based on their contribution to the resulting classes.

We found a 126x speedup over Scikit-learn to train a model with 315K news headlines and 23x speedup to perform inference and compute the model’s accuracy.


The charts in Figure 2 compare the performance of NB training and inference between RAPIDS cuML and Scikit-learn for all of the variants outlined in this post.

The benchmarks were performed on an a2-highgpu-8g Google Cloud Platform (GCP) instance provisioned with an NVIDIA Tesla A100 GPU and 96 Intel Cascade Lake vCPUs at 2.2Ghz.

Charts containing performance comparisons between RAPIDS cuML and Scikit-learn for the Naive Bayes variants outlined in this post. cuML is substantially faster than Scikit-learn during both training and testing phases.
Figure 2. Performance comparison between Scikit-learn (blue) and cuML (green)

GPU-accelerated naive Bayes

We were able to implement all the NB variants right in Python with CuPy, which is a GPU-accelerated near-drop-in replacement for NumPy and SciPy. CuPy also provides you with the capability to write custom CUDA kernels in Python. It uses the just-in-time (JIT) compilation abilities of NVRTC to compile and execute them on the GPU while the Python application is running.

At the core of all the NB variants lies two simple primitives written using CuPy’s JIT, to sum and count the features for each class.

When a single document-term matrix grows too large to process on a single GPU, the Dask library can make use of the incremental training feature to spread the processing over multiple GPUs and multiple nodes. Currently, the multinomial variant can be distributed with Dask in cuML.


NB algorithms should be in every data scientist’s toolkit. With RAPIDS cuML you can accelerate your implementations of NB on the GPU, without dramatically changing your code. These powerful and fundamental algorithms, combined with the speedup of cuML, provide everything you must perform classification on extremely large or sparse datasets. 

If you think that RAPIDS cuML can help accelerate your data science and machine learning workflows or is already doing so, then leave a comment because we’d love to hear about it.

As always, visit the rapidsai GitHub repo and let us know how we can help you. You can also follow us on Twitter at @rapidsai.

If you are new to RAPIDS, be sure to check out the Getting Started resources to get up and running quickly.


Accelerating Cloud-Native Applications at China Mobile Bigcloud

Cloud computing is designed to be agile and resilient to deliver additional value for businesses. China Mobile (CMCC), one of China’s largest telecom operators and cloud services…

Cloud computing is designed to be agile and resilient to deliver additional value for businesses. China Mobile (CMCC), one of China’s largest telecom operators and cloud services providers, offers precisely this with its Bigcloud public cloud offering.

Bigcloud provides PaaS and SaaS services tailored to the needs of enterprise cloud and hybrid-cloud solutions for mission-critical applications. CMCC understands that businesses rely on their networking and communication infrastructure to stay competitive in an increasingly always-on, digital world.

When they started experiencing enormous demand for their cloud-native services, CMCC turned to network abstraction and virtualization through Open vSwitch (OVS) to automate and gain dynamic network control of their network, assisting in handling their growing demand.

However, maintaining network performance due to the added east-west network traffic became a serious challenge.

Virtual sprawl produced an explosion of east-west traffic that the created increased network congestion.
Figure 1. Bigcloud networking solution

Identifying the network challenges

With the massive adoption of cloud services, CMCC experienced enormous growth in its virtualization environment. This virtual sprawl produced an explosion of east-west traffic between servers within their data centers.

Due to the rise in network traffic, they also saw an increase in network congestion, causing higher jitter and latency and hindering overall network throughput and application performance. This was causing insufficient effective bandwidth and they were unable to keep up with the large number of network flows during peak business times.

As CMCC investigated the cause of these challenges, they determined that the root of these problems stemmed from four main issues with the Open vSwitch:

  • Inefficient vSwitch capacity for VXLAN encapsulation and decapsulation rule handling due to the server CPUs being tasked with both application and networking requests.
  • Poor performance of kernel-based vSwitch forwarding caused by frequent context switching between user space, kernel space, and memory, which created data copying overhead.
  • DPDK-based vSwitch forwarding created competition for server CPU resources, which were already severely limited.
  • Limited vSwitch flow rule capabilities due to lowered throughput due to excessive packet loss, jitter, and latency.

These challenges became a bottleneck and prevented applications from receiving the high network traffic throughput they required at the lowest possible latency.

While OVS allows for packets and flow rules to be forwarded between hosts as well as the outside world, it’s CPU-intensive and affects system performance by consuming CPU cores that should be used for customer applications and prevents full utilization of available bandwidth.

CMCC wanted to ensure network application response times stayed low, that delivered bandwidth was consistent, and that they were able to meet peak demands.

CMCC used OVS and OVS DPDK to support a highly efficient SDN network.
Figure 2. CMCC faced challenges in their desire to support both OVS and OVS DPDK for their Bigcloud vSwitch Forwarding

CMCC turned to two experts in this area, NVIDIA and Nokia, who jointly provided a highly efficient, software-defined networking (SDN) solution. The solution combines the offloads, performance, and efficiency of NVIDIA ConnectX SmartNIC and the NVIDIA BlueField data processing unit (DPU) technology with the agility, elasticity, and automation of the Nuage Networks Virtualized Services Platform (VSP).

Together, NVIDIA and Nuage offload the computationally intensive packet processing operations associated with OVS and free costly compute resources so they can run applications instead of SDN tasks.

SmartNIC– and DPU-powered accelerated networking

The NVIDIA ConnectX series of SmartNICs and BlueField series of DPUs offer NVIDIA Accelerated Switching and Packet Processing (ASAP2) technology, which runs the OVS data plane within the NIC hardware while leaving the OVS control plane intact and completely transparent to applications.

ASAP2 has two modes. In the first mode, the hardware data plane is built on top of SR-IOV virtual functions (VFs) so that each network VF is connected directly to its corresponding VM.

An alternate approach that is also supported is VirtIO acceleration through virtual data path acceleration (vDPA). VirtIO allows virtual machines native access to hardware devices such as the network adapters, while vDPA allows the connection to the VM to be established with the OVS data plane built between the network device and the standard VirtIO driver through device queues called Virtqueue. This enables seamless integration between VMs and accelerated networking, with the control plane to be managed on the host whereas the VirtIO data plane is accelerated by smartNIC hardware.

BlueField DPUs provide hardware offload and acceleration to reduce network congestion
Figure 3. vDPA uses SmartNIC hardware to offload and accelerate traffic for each VM.

Seamless integration of Nuage Networks SDN with NVIDIA vDPA technology

Nuage Networks contribution to the solution is through their Virtualized Services Platform (VSP). VSP performs the virtual routing and switching and is the distributed forwarding module based on Open vSwitch, serving as a virtual endpoint for network services. VSP immediately recognizes any changes in the compute environment, triggering instantaneous policy-based responses in network connectivity and configuration to ensure application performance.

Nuage Networks’ VSP uses tunneling protocols such as VXLAN to encapsulate the original payload as an overlay SDN solution.

Because standard NICs don’t recognize new packet header formats, traditionally all packet manipulation operations must be performed by the CPU, potentially over-taxing the CPU and causing significant network I/O performance degradation, especially as server I/O speeds increase.

For this reason, overlay network processing needs to be offloaded to an I/O-specific hardware adapter that can handle VXLAN, like ConnectX or BlueField, to reduce CPU strain.

Performance advantages of vDPA

ASAP2 uses hardware acceleration to increase performance compared to OVS DPDK.
Figure 4. Performance comparison of OVS DPDK in software versus ASAP2 vDPA hardware acceleration.

China Mobile decided to go with the VirtIO solution for maximum compatibility, and they wanted the ability to choose either straight OVS or OVS DPDK, depending on the use case. Working together, Nuage Network and NVIDIA delivered an SDN solution for China Mobile’s public cloud that is agile, scalable, and hardware-accelerated and which supports both types of network virtualization.

The joint solution using Nuage Networks VSP with NVIDIA hardware-accelerated vDPA delivered significantly faster performance. The network throughput increased by 1.5x, the packet forwarding rate was 3x faster, and the Apache benchmark supported 7x more requests per second, compared to running OVS-DPDK in software alone.

Learn more

For more information about the differentiation between OVS offload technologies, why CMCC decided to use the VirtIO/vDPA solution, and how NVIDIA can help you improve efficiencies in cloud-native technologies, see the Turbocharge Cloud-Native Application with Virtual Data Plane Accelerated Networking joint GTC session between CMCC, Nuage Networks, and NVIDIA.


What’s New in NVIDIA AI Enterprise 2.1

Today, NVIDIA announced general availability of NVIDIA AI Enterprise 2.1. This latest version of the end-to-end AI and data analytics software suite is optimized, certified, and…

Today, NVIDIA announced general availability of NVIDIA AI Enterprise 2.1. This latest version of the end-to-end AI and data analytics software suite is optimized, certified, and supported for enterprises to deploy and scale AI applications across bare metal, virtual, container, and cloud environments. 

Release highlights: New containers, public cloud support

The NVIDIA AI Enterprise 2.1 release offers advanced data science with the latest NVIDIA RAPIDS and low code AI model development using the most recent release of NVIDIA TAO Toolkit

Making enterprise AI even more accessible across hybrid or multi-cloud environments, AI Enterprise 2.1 includes added support for Red Hat OpenShift running in the public cloud and the new Microsoft Azure NVads A10 v5 series. These are the first NVIDIA virtual GPU instances offered from the public cloud, which enables affordable GPU sharing.

Support for the latest AI frameworks

NVIDIA AI Enterprise enables you to stay current with the latest AI tools for development and deployment, along with enterprise support and regular updates from NVIDIA. Support will continue for those relying on earlier versions of NVIDIA AI frameworks, ensuring the flexibility to manage infrastructure updates.

NVIDIA TAO Toolkit 22.05

The NVIDIA TAO Toolkit is a low code solution of NVIDIA TAO, a framework that enables developers to create custom, production-ready models to power speech and vision AI applications.

The latest version of the TAO Toolkit is now supported through NVIDIA AI Enterprise, with new key features including REST APIs integration, pre-trained weights import, TensorBoard integration, and new pre-trained models.


The RAPIDS 22.04 release provides more support for data workflows through the addition of new models, techniques, and data processing capabilities across all the NVIDIA data science libraries. 

Red Hat OpenShift support in the public cloud 

Red Hat OpenShift, the industry’s leading enterprise Kubernetes platform with integrated DevOps capabilities, is now certified and supported for the public cloud with NVIDIA AI Enterprise, in addition to bare metal and VMware vSphere-based deployments. This enables a standardized AI workflow in a Kubernetes environment to scale across a hybrid-cloud environment.

Azure NVads A10 v5 support 

The Azure NVads A10 v5 series, powered by NVIDIA A10 Tensor Core GPUs, offers unprecedented GPU scalability and affordability with fractional GPU sharing for flexible GPU sizes ranging from one-sixth of an A10 GPU to two full A10 GPUs.

As part of the supported platforms, the NVads A10 v5 instances are certified with NVIDIA AI Enterprise to deliver optimized performance for deep learning inferencing, maximizing the utility and cost efficiency of at-scale deployments in the cloud.

Domino Data Lab Enterprise MLOps Platform Certification

NVIDIA AI Accelerated partner Domino Data Lab’s enterprise MLOps platform is now certified for NVIDIA AI Enterprise. This level of certification mitigates deployment risks and ensures reliable, high-performance integration with the NVIDIA AI platform.

This partnership pairs the Enterprise MLOps benefits of workload orchestration, self-serve infrastructure, and collaboration with cost-effective scale from virtualization on mainstream accelerated servers.

Try NVIDIA AI Enterprise 

NVIDIA LaunchPad provides organizations around the world with immediate, short-term access to the NVIDIA AI Enterprise software suite in a private accelerated computing environment that includes hands-on labs.

Experience the latest NVIDIA AI frameworks and tools, running on NVIDIA AI Enterprise, through new NVIDIA LaunchPad labs. Hosted on NVIDIA-accelerated infrastructure, the labs enable enterprises to speed up the development and deployment of modern, data-driven applications and quickly test and prototype the entire AI workflow on the same complete stack available for deployment.

 Check out these new LaunchPad labs for NVIDIA AI Enterprise 2.1:

  • Multi-Node Training for Image Classification on VMware vSphere with Tanzu
  • Deploy a Fraud Detection XGBoost Model using NVIDIA Triton
  • Develop a Custom Object Detection Model with NVIDIA TAO Toolkit and Deploy with NVIDIA DeepStream

Integrating NVIDIA Reflex: Q&A with Pathea Head of Technology Jingyang Xu

NVIDIA spoke with Chief Wizard of Pathea, Jingyang Xu, about himself, his company, and the process of implementing NVIDIA Reflex in My Time at Sandrock, the studio’s latest…

NVIDIA spoke with Chief Wizard of Pathea, Jingyang Xu, about himself, his company, and the process of implementing NVIDIA Reflex in My Time at Sandrock, the studio’s latest release.

For those who may not know you, could you tell us about yourself?

My name is Jingyang Xu. I am the Chief Wizard for Pathea. In other words, I’m the Head of Technology. I’ve had various jobs in the software industry for the last 23 years and recently joined Pathea.

Tell us about Pathea and the success of the company thus far?

Pathea has developed a few games that the readers may have heard of, including Planet Explorers and My Time at Portia. Currently, we are working on a few titles, including My Time at Sandrock, part of the “My Time” series. Like Portia, you take up the role of a builder in a post-apocalyptic world. We’ve gone early access on Steam, WeGame, Epic, and Bilibili.

Picture of a farm with blooming crops.
Figure 1. A farm in My Time at Sandrock

Why did you decide to integrate Reflex?

We are always open to the latest and greatest in gaming technology out there to provide the best experience for the players. So, when NVIDIA kept telling us how great Reflex is, we went ahead with it.

What challenge were you looking to solve with Reflex?

A few spots in the game require quick reflexes to complete the mission with the best result. So, we hoped that Reflex would allow the players to have great fun doing those missions.

How long did it take for you to get Reflex up and running in My Time in Sandrock?

NVIDIA provided us with a plugin for Unity; it only took a couple of hours to get Reflex up and running.

Picture of a girl and man having a conversation sitting down .
Figure 2. Gameplay of My Time at Sandrock

How difficult was the Reflex integration process with Unity?

The hard part is to locate the issue. Our testing didn’t initially find any problems. However, there was an issue with some missing DLLs on certain players’ machines. NVIDIA quickly fixed the issues. The problem caused some players’ experiences to be less than satisfactory, but NVIDIA responded quickly and helped Sandrock and us overcome it.

Any surprises or unexpected challenges?

We were surprised by how quickly NVIDIA responded when we had the issues. When we were trying to employ Reflex, we talked with them, and they gave us many solutions and suggestions to help us. The turnaround was super quick! More than that, NVIDIA spent time helping us to test to ensure that it runs well.

How has Reflex affected gameplay?

We have tested the performance and the results are very positive. We saw a 20%-30% increase in input responsiveness with our test. You can easily find the speed changes, which make us feel more confident that the gaming experience will be much better than before.

Any tips or lessons learned for other developers looking into Reflex?

We think it’s worth trying, and don’t worry about any problems, because NVIDIA will help you fix them. Just keep regular communications with NVIDIA. If it fits your gameplay, we believe it shows promise.

Do you plan on integrating NVIDIA Reflex in future titles?

Definitely. We have a couple of games in the pipeline that I think Reflex will be great for in the future. We believe Reflex is an excellent way to reduce game latency. It has been proven in Sandrock, and we believe it can be used in our other games to get the same results. Based on that, Reflex is very useful for making better games.

More resources

For more information, see Discover the full list of NVIDIA Reflex Compatible Products and for other NVIDIA resources, see Game Development


Training Generalist Agents with Multi-Game Decision Transformers

Current deep reinforcement learning (RL) methods can train specialist artificial agents that excel at decision-making on various individual tasks in specific environments, such as Go or StarCraft. However, little progress has been made to extend these results to generalist agents that would not only be capable of performing many different tasks, but also upon a variety of environments with potentially distinct embodiments.

Looking across recent progress in the fields of natural language processing, vision, and generative models (such as PaLM, Imagen, and Flamingo), we see that breakthroughs in making general-purpose models are often achieved by scaling up Transformer-based models and training them on large and semantically diverse datasets. It is natural to wonder, can a similar strategy be used in building generalist agents for sequential decision making? Can such models also enable fast adaptation to new tasks, similar to PaLM and Flamingo?

As an initial step to answer these questions, in our recent paper “Multi-Game Decision Transformers” we explore how to build a generalist agent to play many video games simultaneously. Our model trains an agent that can play 41 Atari games simultaneously at close-to-human performance and that can also be quickly adapted to new games via fine-tuning. This approach significantly improves upon the few existing alternatives to learning multi-game agents, such as temporal difference (TD) learning or behavioral cloning (BC).

A Multi-Game Decision Transformer (MGDT) can play multiple games at desired level of competency from training on a range of trajectories spanning all levels of expertise.

Don’t Optimize for Return, Just Ask for Optimality
In reinforcement learning, reward refers to the incentive signals that are relevant to completing a task, and return refers to cumulative rewards in a course of interactions between an agent and its surrounding environment. Traditional deep reinforcement learning agents (DQN, SimPLe, Dreamer, etc) are trained to optimize decisions to achieve the optimal return. At every time step, an agent observes the environment (some also consider the interactions that happened in the past) and decides what action to take to help itself achieve a higher return magnitude in future interactions.

In this work, we use Decision Transformers as our backbone approach to training an RL agent. A Decision Transformer is a sequence model that predicts future actions by considering past interactions between an agent and the surrounding environment, and (most importantly) a desired return to be achieved in future interactions. Instead of learning a policy to achieve high return magnitude as in traditional reinforcement learning, Decision Transformers map diverse experiences, ranging from expert-level to beginner-level, to their corresponding return magnitude during training. The idea is that training an agent on a range of experiences (from beginner to expert level) exposes the model to a wider range of variations in gameplay, which in turn helps it extract useful rules of gameplay that allow it to succeed under any circumstance. So during inference, the Decision Transformer can achieve any return value in the range it has seen during training, including the optimal return.

But, how do you know if a return is both optimal and stably achievable in a given environment? Previous applications of Decision Transformers relied on customized definitions of the desired return for each individual task, which required manually defining a plausible and informative range of scalar values that are appropriately interpretable signals for each specific game — a task that is non-trivial and rather unscalable. To address this issue, we instead model a distribution of return magnitudes based on past interactions with the environment during training. At inference time, we simply add an optimality bias that increases the probability of generating actions that are associated with higher returns.

To more comprehensively capture spatial-temporal patterns of agent-environment interactions, we also modified the Decision Transformer architecture to consider image patches instead of a global image representation. Patches allow the model to focus on local dynamics, which helps model game specific information in further detail.

These pieces together give us the backbone of Multi-Game Decision Transformers:

Each observation image is divided into a set of M patches of pixels which are denoted O. Return R, action a, and reward r follows these image patches in each input casual sequence. A Decision Transformer is trained to predict the next input (except for the image patches) to establish causality.

Training a Multi-Game Decision Transformer to Play 41 Games at Once
We train one Decision Transformer agent on a large (~1B) and broad set of gameplay experiences from 41 Atari games. In our experiments, this agent, which we call the Multi-Game Decision Transformer (MGDT), clearly outperforms existing reinforcement learning and behavioral cloning methods — by almost 2 times — on learning to play 41 games simultaneously and performs near human-level competency (100% in the following figure corresponds to the level of human gameplay). These results hold when comparing across training methods in both settings where a policy must be learned from static datasets (offline) as well as those where new data can be gathered from interacting with the environment (online).

Each bar is a combined score across 41 games, where 100% indicates human-level performance. Each blue bar is from a model trained on 41 games simultaneously, whereas each gray bar is from 41 specialist agents. Multi-Game Decision Transformer achieves human-level performance, significantly better than other multi-game agents, even comparable to specialist agents.

This result indicates that Decision Transformers are well-suited for multi-task, multi-environment, and multi-embodiment agents.

A concurrent work, “A Generalist Agent”, shows a similar result, demonstrating that large transformer-based sequence models can memorize expert behaviors very well across many more environments. In addition, their work and our work have nicely complementary findings: They show it’s possible to train across a wide range of environments beyond Atari games, while we show it’s possible and useful to train across a wide range of experiences.

In addition to the performance shown above, empirically we found that MGDT trained on a wide variety of experience is better than MDGT trained only on expert-level demonstrations or simply cloning demonstration behaviors.

Scaling Up Multi-Game Model Size to Achieve Better Performance
Argurably, scale has become the main driving force in many recent machine learning breakthroughs, and it is usually achieved by increasing the number of parameters in a transformer-based model. Our observation on Multi-Game Decision Transformers is similar: the performance increases predictably with larger model size. In particular, its performance appears to have not yet hit a ceiling, and compared to other learning systems performance gains are more significant with increases in model size.

Performance of Multi-Game Decision Transformer (shown by the blue line) increases predictably with larger model size, whereas other models do not.

Pre-trained Multi-Game Decision Transformers Are Fast Learners
Another benefit of MGDTs is that they can learn how to play a new game from very few gameplay demonstrations (which don’t need to all be expert-level). In that sense, MGDTs can be considered pre-trained models capable of being fine-tuned rapidly on small new gameplay data. Compared with other popular pre-training methods, it clearly shows consistent advantages in obtaining higher scores.

Multi-Game Decision Transformer pre-training (DT pre-training, shown in light blue) demonstrates consistent advantages over other popular models in adaptation to new tasks.

Where Is the Agent Looking?
In addition to the quantitative evaluation, it’s insightful (and fun) to visualize the agent’s behavior. By probing the attention heads, we find that the MGDT model consistently places weight in its field of view to areas of the observed images that contain meaningful game entities. We visualize the model’s attention when predicting the next action for various games and find it consistently attends to entities such as the agent’s on screen avatar, agent’s free movement space, non-agent objects, and key environment features. For example, in an interactive setting, having an accurate world model requires knowing how and when to focus on known objects (e.g., currently present obstacles) as well as expecting and/or planning over future unknowns (e.g., negative space). This diverse allocation of attention to many key components of each environment ultimately improves performance.

Here we can see the amount of weight the model places on each key asset of the game scene. Brighter red indicates more emphasis on that patch of pixels.

The Future of Large-Scale Generalist Agents
This work is an important step in demonstrating the possibility of training general-purpose agents across many environments, embodiments, and behavior styles. We have shown the benefit of increased scale on performance and the potential with further scaling. These findings seem to point to a generalization narrative similar to other domains like vision and language — we look forward to exploring the great potential of scaling data and learning from diverse experiences.

We look forward to future research towards developing performant agents for multi-environment and multi-embodiment settings. Our code and model checkpoints can soon be accessed here.

We’d like to thank all remaining authors of the paper including Igor Mordatch, Ofir Nachum Menjiao Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski.


Performance Boosts and Enhanced Features in New Nsight Graphics, Nsight Aftermath Releases

Nsight Graphics 2022.3 and Nsight Aftermath 2022.2 have just been released and are now available to download.  Nsight Graphics 2022.3 The Nsight Graphics 2022.3 release…

Nsight Graphics 2022.3 and Nsight Aftermath 2022.2 have just been released and are now available to download. 

Nsight Graphics 2022.3

The Nsight Graphics 2022.3 release focuses on performance gains, bug fixes, and Vulkan improvements.

Performance for the Ray Tracing Acceleration Structure Viewer has improved by up to 20x in some complex scenes, thanks to better occlusion culling. Additionally, the viewer received improved handling of large instance counts to increase performance and reduce memory usage in scenes with duplicate geometry.

With the new VK_KHR_graphics_pipeline_library extension, your Vulkan application can now precompile shaders and link them at runtime at a substantially reduced cost. This is important because large 3D graphics applications such as games utilize complex algorithms that result in a large number of shaders. 

These algorithms often require different permutations of the shaders to account for different effects or lighting environments. The end result is thousands or hundreds of thousands of shaders that, in many cases, are compiled at runtime. This can result in mid-frame stuttering which negatively impacts the user experience. 

Download Nsight Graphics 2022.3 >>

Nsight Aftermath 2022.2

In addition to the great improvements with structure viewer and shaders in Nsight Graphics, the Nsight Aftermath 2022.2 release enhances your ability to find the root cause of GPU crashes on a user’s system. 

GPU shaders make frequent accesses to memory, which all go through a dedicated hardware unit called the MMU. Nsight Aftermath 2022.2 adds enhanced MMU fault correlation which provides the line of shader source code that initiated the memory request from the shader units.

In the case where the fault is caused by a memory write with no outstanding dependencies, the shader unit would have retired the warp, leaving no contextual data to help in the debugging process. A new (debugging-only) setting in the API addresses this, preventing the shader units from retiring a warp while there is an outstanding instruction with the potential for an MMU fault. 

Nsight Aftermath helps you locate GPU crashes so that you can ship fast and stable 3D graphics applications. Look for even better correlation of GPU crashes in future releases, so you can find exactly where a crash occurred in your code.

Download Nsight Aftermath 2022.2 >>

Additional resources

Want to help us build better tools for you? Share your thoughts with this Nsight Graphics survey that takes less than one minute to complete. 


Differences between AI Servers and AI Workstations

If you’re wondering how an AI server is different from an AI workstation, you’re not the only one. Assuming strictly AI use cases with minimal graphics workload, obvious…

If you’re wondering how an AI server is different from an AI workstation, you’re not the only one. Assuming strictly AI use cases with minimal graphics workload, obvious differences can be minimal to none. You can technically use one as the other. However, the results from each will be radically different depending on the workload each is asked to perform. For this reason, it’s important to clearly understand the differences between AI servers and AI workstations.

Setting AI aside for a moment, servers in general tend to be networked and are available as a shared resource that runs services accessed across the network. Workstations are generally intended to execute the requests of a specific user, application, or use case. 

Can a workstation act as a server, or a server as a workstation? The answer is “yes,” but ignoring the design purpose of the workstation or server does not usually make sense. For example, both workstations and servers can support multithreaded workloads, but if a server can support 20x more threads than a workstation (all else being equal), the server will be better suited for applications that create many threads for a processor to simultaneously crunch.

Servers are optimized to scale in their role as a network resource to clients. Workstations are usually not optimized for massive scale, sharing, parallelism, and network capabilities.

Specific differences: Servers and workstations for AI

Servers often run an OS that is designed for the server use case, while workstations run an OS that is intended for workstation use cases. For example, consider Microsoft Windows 10 for desktop and individual use, whereas Microsoft Windows Server is run on dedicated servers for shared network services.

The principle is the same for AI servers and workstations. The majority of AI workstations used for machine learning, deep learning, and AI development are Linux-based. The same is true for AI servers. Because the intended use of workstations and servers is different, servers can be equipped with processor clusters, larger CPU and GPU memory resources, more processing cores, and greater multithreading and network capabilities.

Note that because of the extreme demands placed on servers as a shared resource, there is generally an associated greater demand on storage capacity, flash storage performance, and network infrastructure.

The GPU: An essential ingredient

The GPU has become an essential element in modern AI workstations and AI servers. Unlike CPUs, GPUs have the ability to increase the throughput of data and number of concurrent calculations within an application.

GPUs were originally designed to accelerate graphics rendering. Because GPUs can simultaneously process many pieces of data, they have found new modern uses in machine learning, video editing, autonomous driving, and more.

Although AI workloads can be run on CPUs, the time-to-results with a GPU may be 10x to 100x faster. The complexity of deep learning in natural language processing, recommender engines, and image classification, for example, benefits greatly from GPU acceleration.

Performance is needed for initial training of machine learning and deep learning models. Performance is also mandatory when real-time response (as for conversational AI) is running in inference mode.

Enterprise use

It’s important that AI servers and workstations work seamlessly together within an enterprise–and with the cloud. And each has a place within an enterprise organization.

AI servers

In the case of AI servers, large models are more efficiently trained on GPU-enabled servers and server clusters. They can also be efficiently trained using GPU-enabled cloud instances, especially for massive datasets and models that require extreme resolution. AI servers are often tasked to operate as dedicated AI inferencing platforms for a variety of AI applications.

AI workstations

Individual data scientists, data engineers, and AI researchers often use a personal AI or data science workstation in the process of building and maintaining AI applications. This tends to include data preparation, model design, and preliminary model training. GPU-accelerated workstations make it possible to build complete model prototypes using an appropriate subset of a large dataset. This is often done in hours to a day or two.

Certified hardware compatibility along with seamless compatibility across AI tools is very important. NVIDIA-Certified Workstations and Servers provide tested enterprise seamlessness and robustness across certified platforms.


Understanding the Need for Time-Sensitive Networking for Critical Applications

In the old days of 10 Mbps Ethernet, long before Time-Sensitive Networking became a thing, state-of-the-art shared networks basically required that packets would collide. For the…

In the old days of 10 Mbps Ethernet, long before Time-Sensitive Networking became a thing, state-of-the-art shared networks basically required that packets would collide. For the primitive technology of the time, this was eminently practical… computationally preferable to any solution that would require carefully managed access to the medium.

After mangling each other’s data, two competing stations would wait (randomly wasting even more time), before they would try to transmit again. This was deemed ok because the minimum-size frame was 64 bytes (512 bits) and a reasonable estimate of how long this frame would consume the wire was based on the network speed (10 million bits per second means that each bit takes ~0.1 microseconds) so 512 bits equals 51.2 microseconds, at least.

Ethernet technology has evolved from 10 Mbps in the early 80s to 400Gbps as of today with future plans for 800Gbps and 1.6 Tbps (Figure 1).

Graph showing Ethernet speed evolution from1980 and into the future towards 2030.
Figure 1. Ethernet speed evolution over time

It should be clear that wanting your networks to go faster is an ongoing trend! As such, any application that must manage events across those networks requires a well-synchronized, commonly understood, network-spanning sense of time, at time resolutions that get progressively narrower as networks become faster.

This is why the IEEE has been investigating how to support time-sensitive network applications since at least 2008, initially for audio and video applications but now for a much richer set of much more important applications.

Three use cases for time-sensitive networking

The requirements for precise and accurate timing extend beyond the physical and data link layers to certain applications that are highly dependent on predictable, reliable service from the network. These new and emerging applications leverage the availability of a precise, accurate, and high-resolution understanding of time.

5G, 6G, and beyond

Starting with the 5G family of protocols from 3GPP, some applications such as IoT or IIoT do not necessarily require extremely high bandwidth. They do require tight control over access to the wireless medium to achieve predictable access with low latency and low jitter. This is achieved by the delivery of precise, accurate, and high-resolution time to all participating stations.

Picture of a wireless tower with multiple antennas.
Figure 2. Wireless tower

In the time domain style of access, each station asks for and is granted permission to use the medium, after which a network scheduler informs the station of when and for how long they may use the medium.

5G and future networks deliver this accurate, precise, and high-resolution time to all participating stations to enable this kind of new high-value application. Previously, the most desirable attribute of new networks was speed. These new applications actually need control rather than speed.

Successfully enabling these applications requires that participating stations have the same understanding of the time in absolute terms so that they do not either start transmitting too soon, or too late, or for the wrong amount of time.

If a station were to transmit too soon or for too long, it may interfere with another station. If it were to begin transmitting too late, it may waste some of its precious opportunity to use the medium, as in situations where it might transmit for less time than it had been granted permission to transmit.

I should state that 5G clearly isn’t Ethernet, but Ethernet technology is how the 5G radio access network is tied together, through backhaul networks extending out from the metropolitan area data centers. The time-critical portion of the network extends from this Ethernet backhaul domain, both into the data centers and out into the radio access network.

What kinds of applications need this level of precision?

Applications such as telemetry need this precision. Future measurements can implicitly recover from a missed reading just by waiting for the next reading. For example, a meter reading might be generated once every 30 minutes.

What about robots that must have their position understood with submillisecond resolution? Missing a few position reports could lead to damaging the robot, damaging nearby or connected equipment, damaging the materials that the robot is handling, or even the death of a nearby human.

You might think this has nothing to do with 5G, as it’s clearly a manufacturing use case. This is a situation where 5G might be a better solution because the Precision Time Protocol (PTP) is built into the protocol stack from the get-go.

PTP (IEEE 1588-2008) is the foundation of a suite of protocols and profiles that enable highly accurate time to be synchronized across networked devices to high precision and at high resolution.

Time-sensitive networking technology enables 5G (or subsequent) networks to serve thousands or tens of thousands of nodes. It delivers an ever-shifting mix of high-speed, predictable latency, or low-jitter services, according to the demands of the connected devices.

Yes, these might be average users with mobile phones, industrial robots, or medical instruments. The key thing is that with time-sensitive networking built in, the network can satisfy a variety of use cases as long as bandwidth (and time) is available.

PTP implementations in products containing NVIDIA Cumulus Linux 5.0 and higher, regularly deliver deep submillisecond (even submicrosecond) precision, supporting the diverse requirements of 5G applications.

Media and entertainment

The majority of the video content in the television industry currently exists in serial digital interface (SDI) format. However, the industry is transitioning to an Internet Protocol (IP) model.

In the media and entertainment industry, there are several scenarios to consider such as studio (such as combining multiple camera feedback and overlays), video production, video broadcast (from a single point to multiple users), and multiscreen.

Time synchronization is critical for these types of activities.

Picture of a production studio with multiple camera and light sources.
Figure 3. Production studio

In the media and broadcast world, consistent time synchronization is of the upmost importance to provide the best viewing experience and to prevent frame alignment, lip syncing, and video and audio syncing issues.

In the baseband world, reference black or genlock was used to keep camera and other video source frames in sync and to avoid introducing nasty artifacts when switching from one source to another.

However, with IP adoption and more specifically, SMPTE-2110 (or SMPTE-2022-6 with AES67), you needed a different way to provide timing. Along came PTP, also referred to as IEEE 1588 (PTP V2).

PTP is fully network-based and can travel over the same data network connections that are already being used to transmit and receive essence streams. Various profiles, such as SMPTE 2059-2 and AES67, provide a standardized set of configurations and rules that meet the requirements for the different types of packet networks.

Spectrum fully supports PTP 1588 under SMPTE 2059-2 and other profiles.

Automotive applications

New generations of car-area networks (CANs) have evolved from shared/bus architectures toward architectures that you might find in 5G radio-access networks (RANs) or in IT environments: switched topologies.

Picture of an NVIDIA branded car with multiple graphics dashboards and sensors.
Figure 4. Autonomous vehicle

When switches are involved, there is an opportunity for packet loss or delay variability, arising from contention or buffering, which limits or eliminates predictable access to the network that might be needed for various applications in the automobile.

Self-driving cars must regularly, at fairly a high frequency, process video and other sensor inputs to determine a safe path forward for the vehicle. The guiding intelligence in the vehicle depends on regularly accessing its sensors, so the network must be able to guarantee that access to the sensors is frequent enough to support the inputs to the algorithms that must interpret them.

For instance, the steering wheel and brakes are reading friction, engaging antilock and antislip functions, and trading off regenerative energy capture compared to friction braking. The video inputs, and possibly radar and lidar (light detection and ranging), are constantly scanning the road ahead. They enable the interpretation algorithms to determine if new obstacles have become visible that would require steering, braking, or stopping the vehicle.

All this is happening while the vehicle’s navigation subsystem uses GPS to receive and align coarse position data against a map, which is combined with the visual inputs from the cameras to establish accurate positioning information over time, to determine the maximum legally allowed speed and to combine legal limits with local conditions to determine a safe speed.

These varied sensors and associated independent subsystems must be able to deliver their inputs to the main processors and their self-driving algorithms on a predictable-latency/low-jitter basis, while the network is also supporting non-latency-critical applications. The correct, predictable operation of this overall system is life-critical for the passengers (and for pedestrians!).

Beyond the sensors and software that support the safe operation of the vehicle, other applications running on the CAN are still important to the passengers, while clearly not life-critical:

  • Operating the ventilation or climate-control system to maintain a desirable temperature at each seat (including air motion, seat heating or cooling, and so on)
  • Delivering multiple streams of audio or video content to various passengers
  • Gaming with other passengers or passengers in nearby vehicles
  • Important mundane maintenance activities like measuring the inflation pressure of the tires, level of battery charge, braking efficiency (which could indicate excessive wear), and so on

Other low-frequency yet also time-critical sensor inputs provide necessary inputs to the vehicle’s self-diagnostics that determine when it should take itself back to the maintenance depot for service, or just to recharge its batteries.

The requirement for all these diverse applications to share the same physical network in the vehicle (to operate over the same CAN) is the reason why PTP is required.

Engineers will design the CAN to have sufficient instantaneous bandwidth to support the worst-case demand from all critical devices (such that contention is either rare or impossible), while dynamically permitting all devices to request access in the amounts and with the latency bounds that each requires, which can change over time. Pun intended.

In a world of autonomous vehicles, PTP is the key to enabling in-car technology, supporting the safe operation of vehicles while delivering rich entertainment and comfort.


You’ve seen three examples of applications where control over access to the network is as important as raw speed. In each case, the application defines the requirements for precise/accurate/high-resolution timing, but the network uses common mechanisms to deliver the required service.

As networks continue to get faster, the time resolution for discriminating events scale linearly as the reciprocal of the bandwidth.

Powerful PTP implementations, such as that in NVIDIA Cumulus Linux 5.0-powered devices, embody scalable protocol mechanisms that will adapt to the faster networks of the future. They will deliver timing accuracy and precision that adjusts to the increasing speeds of these networks.

Future applications can expect to continue to receive the predictable time-dependent services that they need. This will be true even though the networks continue to become more capable of supporting more users, at faster speeds, with even finer-grained time resolution.

For more information, see the following resources:


Choosing the Right Storage for Enterprise AI Workloads

Artificial intelligence (AI) is becoming pervasive in the enterprise. Speech recognition, recommenders, and fraud detection are just a few applications among hundreds being driven…

Artificial intelligence (AI) is becoming pervasive in the enterprise. Speech recognition, recommenders, and fraud detection are just a few applications among hundreds being driven by AI and deep learning (DL)

To support these AI applications, businesses look toward optimizing AI servers and performance networks. Unfortunately, storage infrastructure requirements are often overlooked in the development of enterprise AI. Yet for the successful adoption of AI, it is vital to consider a comprehensive storage deployment strategy that considers AI growth, future proofing, and interoperability.

This post highlights important factors that enterprises should consider when planning data storage infrastructure for AI applications to maximize business results. I discuss cloud compared to on-premise storage solutions as well as the need for higher-performance storage within GPU-enabled virtual machines (VMs).

Why AI storage decisions are needed for enterprise deployment

The popular phrase, “You can pay me now—or pay me later” implies that it’s best to think about the future when making current decisions. Too often, storage solutions for supporting an AI or DL app only meet the immediate needs of the app without full consideration of the future cost and flexibility.

Spending money today to future-proof your AI environment from a storage standpoint can be more cost-effective in the long run. Decision-makers must ask themselves:

  • Can my AI storage infrastructure adapt to a cloud or hybrid model?
  • Will choosing object, block, or file storage limit flexibility in future enterprise deployments?
  • Is it possible to use lower-cost storage tiers or a hybrid model for archiving, or for datasets that do not require expensive, fast storage?

The impact of enterprise storage decisions on AI deployment is not always obvious without a direct A/B comparison. Wrong decisions today can result in lower performance and the inability to efficiently scale-out business operations in the future.

Main considerations when planning AI storage infrastructure

Following are a variety of factors to consider when deploying and planning storage. Figure 1 shows an overview of data center, budget, interoperability, and storage type considerations.

Data center Budget Interoperability Storage type
DPU Existing vs. new Cloud and data center Object/Block/File
Network All Flash/HDD/Hybrid VM environments Flash/HDD/Hybrid
Table 1. Storage considerations for IT when deploying AI solutions on GPU-accelerated AI applications

AI performance and the GPU

Before evaluating storage performance, consider that a key element of AI performance is having high-performance enterprise GPUs to accelerate training for machine-learning, DL, and inferencing apps.

Many data center servers do not have GPUs to accelerate AI apps, so it’s best to first look at GPU resources when looking at performance.

Large datasets do not always fit within GPU memory. This is important because GPUs deliver less performance when the complete data set does not fit within GPU memory. In such cases, data is swapped to and from GPU memory, thus impacting performance. Model training takes longer, and inference performance can be impacted.

Certain apps, such as fraud detection, may have extreme real-time requirements that are affected when GPU memory is waiting for data.

Storage considerations

Storage is always an important consideration. Existing storage solutions may not work well when deploying a new AI app.

It may be that you now require the speed of NVMe flash storage or direct GPU memory access for desired performance. However, you may not know what tomorrow’s storage expectations will be, as demands for AI data from storage increase over time. For certain applications, there is almost no such thing as too much storage performance, especially in the case of real-time use cases such as pre-transaction fraud detection.

There is no “one-size-fits-all” storage solution for AI-driven apps. 

Performance is only one storage consideration. Another is scale-out ability. Training data is growing. Inferencing data is growing. Storage must be able to scale in both capacity and performance—and across multiple storage nodes in many cases. Simply put, a storage device that meets your needs today may not always scale for tomorrow’s challenges.

The bottom-line: as training and inference workloads grow, capacity and performance must also grow. IT should only consider scalable storage solutions with the performance to keep GPUs busy for the best AI performance.

Data center considerations

The data processing unit (DPU) is a recent addition to infrastructure technology that takes data center and AI storage to a completely new level.

Although not a storage product, the DPU redefines data center storage. It is designed to integrate storage, processing, and networks such that whole data centers act as a computer for enterprises.

It’s important to understand DPU functionality when planning and deploying storage as the DPU offloads storage services from data center processors and storage devices. For many storage products, a DPU interconnected data center enables a more efficient scale-out.

As an example, the NVIDIA BlueField DPU supports the following functionality:

  • NVMe over Fabrics (NVMe-oF)
  • GPUDirect Storage
  • Encryption
  • Elastic block storage
  • Erasure coding (for data integrity)
  • Decompression
  • Deduplication

Storage performance for remote storage access is as if the storage is directly attached to the AI server. The DPU helps to enable scalable software-defined storage, in addition to networking and cybersecurity acceleration.

Budget considerations

Cost remains a critical factor. While deploying the highest throughput and lowest latency storage is desirable, it is not always necessary depending on the AI app.

To extend your storage budget further, IT must understand the storage performance requirements of each AI app (bandwidth, IOPs, and latency).

For example, if an AI app has a large dataset but minimal performance requirements, traditional hard disk drives (HDD) may be sufficient while lowering storage costs substantially. This is especially true when the “hot” data of the dataset fits wholly within GPU memory.

Another cost-saving option is to use hybrid storage that uses flash as a cache to accelerate performance while lowering storage costs for infrequently accessed data residing on HDDs. There are hybrid flash/HDD storage products that perform nearly as well as all-flash, so exploring hybrid storage options can make a lot of sense for apps that don’t have extreme performance requirements.

Older, archived, and infrequently used data and datasets may still have future value, but are not cost-effective residing on expensive primary storage.

HDDs can still make a lot of financial sense, especially if data can be seamlessly accessed when needed. A two-tiered cloud and on-premises storage solution can also make financial sense depending on the size and frequency of access. There are many of these solutions on the market.

Interoperability factors

Evaluating cloud and data center interoperability from a storage perspective is important. Even within VM-driven data centers, there are interoperability factors to evaluate. 

Cloud and data center considerations

Will the AI app run on-premises, in the cloud, or both? Even if the app can be run in either place, there is no guarantee that the performance of the app won’t change with location. For example, there may be performance problems if the class of storage used in the cloud differs from the storage class used on-premises. Storage class must be considered.

Assume that a job retraining a large recommender model completes within a required eight-hour window using data center GPU-enabled servers that use high-performance flash storage. Moving the same application to the cloud with equivalent GPU horsepower may cause training to complete in 24 hours, well outside the required eight-hour window. Why?

Some AI apps require a certain class of storage (fast flash, large storage cache, DMA storage access, storage class memory (SCM) read performance, and so on) that is not always available through cloud services.

The point is that certain AI applications will yield similar results regardless of data center or cloud storage choices. Other applications can be storage-sensitive.

Just because an app is containerized and orchestrated by Kubernetes in the cloud, it does not guarantee similar data center results. When viewed in this way, containers do not always provide cross–data center and cloud interoperability when performance is considered. For effective data center and cloud interoperability, ensure that storage choices in both domains yield good results.

VM considerations

Today, most data center servers do not have GPUs to accelerate AI and creative workloads. Tomorrow, the data center landscape may look quite different. Businesses are being forced to use AI to be competitive, whether conversational AI, fraud detection, recommender systems, video analytics, or a host of other use cases.

GPUs are common on workstations, but the acceleration provided by GPU workstations cannot easily be shared within an organization.

The paradigm shift that enterprises must prepare for is the sharing of server-based, GPU-enabled resources within VM environments. The availability of solutions such as NVIDIA AI Enterprise enables sharing GPU-enabled VMs with anyone in the enterprise.

Put simply, it is now possible for anyone in an enterprise to easily run power-hungry AI apps within a VM in the vSphere environment.

So what does this mean for VM storage? Storage for GPU-enabled VMs must address the shared performance requirement of both the AI apps and users of the shared VM. This implies higher storage performance for a given VM than would be required in an unshared environment.

It also means that physical storage allocated for such VMs will likely be more scalable in capacity and performance. Within a heavily shared VM, it can make sense to use dedicated all-flash storage-class memory (SCM) arrays connected to the GPU-enabled servers through RDMA over Converged Ethernet for the highest performance and scale-out.

Storage type

An in-depth discussion on the choice of object, block, or file storage for AI apps goes beyond the scope of this post. That said, I mention it here because it’s an important consideration but not always a straightforward decision.

Object storage

If a desired app requires object storage, for example, the required storage type is obvious. Some AI apps take advantage of object metadata while also benefiting from the infinite scale of a flat address space object storage architecture. AI analytics can take advantage of rich object metadata to enable precision data categorization and organization, making data more useful and easier to manage and understand. 

Block storage

Although block storage is supported in the cloud, truly massive cloud datasets tend to be object-based. Block storage can yield higher performance for structured data and transactional applications.

Block storage lacks metadata information, which prevents the use of block storage for any app that is designed to provide benefit from metadata. Many traditional enterprise apps were built on a block storage foundation, but the advent of object storage in the cloud has caused many modern applications to be designed specifically for native cloud deployment using object storage.

File storage

When an AI app accesses data across common file protocols, the obvious storage choice will be file-based. For example, AI-driven image recognition and categorization engines may require access to file-based images.

Deployment options can vary from dedicated file servers to NAS heads built on top of an object or block storage architecture. NAS heads can export NFS or SMB file protocols for file access to an underlying block or object storage architecture. This can provide a high level of flexibility and future-proofing with block or object storage used as a common foundation for file storage access by AI and data center network clients.

Storage type decisions for AI must be based on a good understanding of what is needed today as well as a longer-term AI deployment strategy. Fully evaluate the pros and cons of each storage type. There is frequently no one-size-fits-all answer, and there will also be cases where all three storage types (object, block, and file) make sense.

Key takeaways on enterprise storage decision making

There is no single approach to addressing storage requirements for AI solutions. However, here are a few core principles by which wise AI storage decisions can be made:

  • Any storage choice for AI solutions may be pointless if training and inference are not GPU-accelerated.
  • Prepare for the possibility of needing IT resources and related storage that is well beyond current estimates.
  • Don’t assume that existing storage is “good enough” for new or expanded AI solutions. Storage with higher cost, performance, and scalability may actually be more effective and efficient, over time, compared to existing storage. 
  • Always consider interoperability with the cloud as on-premises storage options may not be available with your cloud provider.
  • Strategic IT planning should consider the infrastructure and storage benefits of DPUs.

As you plan for AI in your enterprise, don’t put storage at the bottom of the list. The impact of storage on your AI success may be greater than you think. For more information about setting up your enterprise for success with AI storage, see the following resources


Shifting Into High Gear: Lunit, Maker of FDA-Cleared AI for Cancer Analysis, Goes Public in Seoul

South Korean startup Lunit, developer of two FDA-cleared AI models for healthcare, went public this week on the country’s Kosdaq stock market. The move marks the maturity of the Seoul-based company — which was founded in 2013 and has for years been part of the NVIDIA Inception program that nurtures cutting-edge startups. Lunit’s AI software Read article >

The post Shifting Into High Gear: Lunit, Maker of FDA-Cleared AI for Cancer Analysis, Goes Public in Seoul appeared first on NVIDIA Blog.