Categories
Misc

Reach New Frontiers of Immersive Production with NVIDIA CloudXR and HTC VIVE Focus 3

Woman wearing the HTC VIVE Focus 3 headset.HTC released a CloudXR client to support their VIVE Focus 3, which provides a “best of both worlds” solution to the difficult tradeoffs VR developers face.Woman wearing the HTC VIVE Focus 3 headset.

Whether building immersive theatrical experiences or virtual training solutions, XR development continues to push the limits of both content and device performance. Often, this means having to compromise on either fidelity or mobility. But with the NVIDIA CloudXR streaming solution, this is no longer the case.

This month at NVIDIA GTC, HTC announced the release of an NVIDIA CloudXR client to support their VIVE Focus 3, available now on GitHub. This provides a “best of both worlds” solution to the difficult tradeoffs VR developers typically face.

“High fidelity, cloud-based VR streaming represents the next big evolution in the XR industry, and we’re excited to continue working closely with the teams at NVIDIA to keep pushing the industry forwards,” said Shen Ye, senior director and global head of products at HTC.

The VIVE Focus 3 is the first commercially available VR headset with a custom NVIDIA CloudXR client. With seamless remote rendering powered by NVIDIA CloudXR, creative studios like Agile Lens can design extremely high fidelity immersive experiences, which would otherwise be impossible to run on a mobile chipset.

Alex Coulombe, cofounder and creative director at Agile Lens, has wanted to bring the intimacy and power of theater to the masses using technologies like VR. His latest venture, Heavenue, is a platform that integrates NVIDIA CloudXR to deliver high fidelity immersive live performances to VR headsets from the cloud.

Heavenue’s first partner is the Actors Theatre of Louisville, which is producing an immersive rendition of the classic A Christmas Carol this December. This will be the first simultaneous live stage and virtual theater performance of its kind.

Image of actors using VR equipment to create avatars shown on a computer screen.
Figure 1. Actors Theater of Louisville combined live motion capture with virtual avatars using Heavenue for their production of A Christmas Carol.

The producers combined motion capture from live performers with facial and voice tracking from actors, overlaid onto a virtual avatar using Unreal Engine 4 to complete the experience. This is all hosted by CoreWeave, which offers powerful servers with a broad range of NVIDIA GPUs. It includes North America’s largest deployment of A40s, in the cloud and streamed to end users on standalone VR headsets such as the VIVE Focus 3, with the NVIDIA CloudXR. The result is a rich and immersive experience where users can freely move around the theater environment without tether restriction.

“NVIDIA CloudXR is like a bridge to the future,” Coulombe said. “This has never been possible before. It’s only now, with the advent of technologies like these that we can start to build a platform that democratizes the experience of an incredibly immersive, compelling, vivid, high fidelity live performance.”

With NVIDIA CloudXR, virtual productions and location-based experiences (LBE) are able to use the VIVE Focus 3 to deliver a more immersive experience. This shifts computing to centralized computers and removes the need to spend time navigating around cords. Having a centralized computing environment makes debugging and troubleshooting easy, creating a more user-friendly system and experience. All of this comes without impacting quality or graphical fidelity, taking full advantage of a 5K resolution and 120-degree field of view.

This application extends to any use case where mobility and high fidelity are required, such as in enterprise training, product and building design, or manufacturing floor planning.

HTC VIVE has open-sourced their CloudXR sample client for the VIVE Focus 3 on GitHub. Developers can extend the sample client source code to add bespoke features and customized user interface. Those less familiar with Android development, or just wanting to try it out, can download the prebuilt APK and install this directly into their headset to get started.

Learn more about NVIDIA CloudXR and download the HTC VIVE Focus 3 sample client today.

Categories
Offsites

Predicting Text Selections with Federated Learning

Smart Text Selection, launched in 2017 as part of Android O, is one of Android’s most frequently used features, helping users select, copy, and use text easily and quickly by predicting the desired word or set of words around a user’s tap, and automatically expanding the selection appropriately. Through this feature, selections are automatically expanded, and for selections with defined classification types, e.g., addresses and phone numbers, users are offered an app with which to open the selection, saving users even more time.

Today we describe how we have improved the performance of Smart Text Selection by using federated learning to train the neural network model on user interactions responsibly while preserving user privacy. This work, which is part of Android’s new Private Compute Core secure environment, enabled us to improve the model’s selection accuracy by up to 20% on some types of entities.

Server-Side Proxy Data for Entity Selections
Smart Text Selection, which is the same technology behind Smart Linkify, does not predict arbitrary selections, but focuses on well-defined entities, such as addresses or phone numbers, and tries to predict the selection bounds for those categories. In the absence of multi-word entities, the model is trained to only select a single word in order to minimize the frequency of making multi-word selections in error.

The Smart Text Selection feature was originally trained using proxy data sourced from web pages to which schema.org annotations had been applied. These entities were then embedded in a selection of random text, and the model was trained to select just the entity, without spilling over into the random text surrounding it.

While this approach of training on schema.org-annotations worked, it had several limitations. The data was quite different from text that we expect users see on-device. For example, websites with schema.org annotations typically have entities with more proper formatting than what users might type on their phones. In addition, the text samples in which the entities were embedded for training were random and did not reflect realistic context on-device.

On-Device Feedback Signal for Federated Learning
With this new launch, the model no longer uses proxy data for span prediction, but is instead trained on-device on real interactions using federated learning. This is a training approach for machine learning models in which a central server coordinates model training that is split among many devices, while the raw data used stays on the local device. A standard federated learning training process works as follows: The server starts by initializing the model. Then, an iterative process begins in which (a) devices get sampled, (b) selected devices improve the model using their local data, and (c) then send back only the improved model, not the data used for training. The server then averages the updates it received to create the model that is sent out in the next iteration.

For Smart Text Selection, each time a user taps to select text and corrects the model’s suggestion, Android gets precise feedback for what selection span the model should have predicted. In order to preserve user privacy, the selections are temporarily kept on the device, without being visible server-side, and are then used to improve the model by applying federated learning techniques. This technique has the advantage of training the model on the same kind of data that it sees during inference.

Federated Learning & Privacy
One of the advantages of the federated learning approach is that it enables user privacy, because raw data is not exposed to a server. Instead, the server only receives updated model weights. Still, to protect against various threats, we explored ways to protect the on-device data, securely aggregate gradients, and reduce the risk of model memorization.

The on-device code for training Federated Smart Text Selection models is part of Android’s Private Compute Core secure environment, which makes it particularly well situated to securely handle user data. This is because the training environment in Private Compute Core is isolated from the network and data egress is only allowed when federated and other privacy-preserving techniques are applied. In addition to network isolation, data in Private Compute Core is protected by policies that restrict how it can be used, thus protecting from malicious code that may have found its way onto the device.

To aggregate model updates produced by the on-device training code, we use Secure Aggregation, a cryptographic protocol that allows servers to compute the mean update for federated learning model training without reading the updates provided by individual devices. In addition to being individually protected by Secure Aggregation, the updates are also protected by transport encryption, creating two layers of defense against attackers on the network.

Finally, we looked into model memorization. In principle, it is possible for characteristics of the training data to be encoded in the updates sent to the server, survive the aggregation process, and end up being memorized by the global model. This could make it possible for an attacker to attempt to reconstruct the training data from the model. We used methods from Secret Sharer, an analysis technique that quantifies to what degree a model unintentionally memorizes its training data, to empirically verify that the model was not memorizing sensitive information. Further, we employed data masking techniques to prevent certain kinds of sensitive data from ever being seen by the model

In combination, these techniques help ensure that Federated Smart Text Selection is trained in a way that preserves user privacy.

Achieving Superior Model Quality
Initial attempts to train the model using federated learning were unsuccessful. The loss did not converge and predictions were essentially random. Debugging the training process was difficult, because the training data was on-device and not centrally collected, and so, it could not be examined or verified. In fact, in such a case, it’s not even possible to determine if the data looks as expected, which is often the first step in debugging machine learning pipelines.

To overcome this challenge, we carefully designed high-level metrics that gave us an understanding of how the model behaved during training. Such metrics included the number of training examples, selection accuracy, and recall and precision metrics for each entity type. These metrics are collected during federated training via federated analytics, a similar process as the collection of the model weights. Through these metrics and many analyses, we were able to better understand which aspects of the system worked well and where bugs could exist.

After fixing these bugs and making additional improvements, such as implementing on-device filters for data, using better federated optimization methods and applying more robust gradient aggregators, the model trained nicely.

Results
Using this new federated approach, we were able to significantly improve Smart Text Selection models, with the degree depending on the language being used. Typical improvements ranged between 5% and 7% for multi-word selection accuracy, with no drop in single-word performance. The accuracy of correctly selecting addresses (the most complex type of entity supported) increased by between 8% and 20%, again, depending on the language being used. These improvements lead to millions of additional selections being automatically expanded for users every day.

Internationalization
An additional advantage of this federated learning approach for Smart Text Selection is its ability to scale to additional languages. Server-side training required manual tweaking of the proxy data for each language in order to make it more similar to on-device data. While this only works to some degree, it takes a tremendous amount of effort for each additional language.

The federated learning pipeline, however, trains on user interactions, without the need for such manual adjustments. Once the model achieved good results for English, we applied the same pipeline to Japanese and saw even greater improvements, without needing to tune the system specifically for Japanese selections.

We hope that this new federated approach lets us scale Smart Text Selection to many more languages. Ideally this will also work without manual tuning of the system, making it possible to support even low-resource languages.

Conclusion
We developed a federated way of learning to predict text selections based on user interactions, resulting in much improved Smart Text Selection models deployed to Android users. This approach required the use of federated learning, since it works without collecting user data on the server. Additionally, we used many state-of-the-art privacy approaches, such as Android’s new Private Compute Core, Secure Aggregation and the Secret Sharer method. The results show that privacy does not have to be a limiting factor when training models. Instead, we managed to obtain a significantly better model, while ensuring that users’ data stays private.

Acknowledgements
Many people contributed to this work. We would like to thank Lukas Zilka, Asela Gunawardana, Silvano Bonacina, Seth Welna, Tony Mak, Chang Li, Abodunrinwa Toki, Sergey Volnov, Matt Sharifi, Abhanshu Sharma, Eugenio Marchiori, Jacek Jurewicz, Nicholas Carlini, Jordan McClead, Sophia Kovaleva, Evelyn Kao, Tom Hume, Alex Ingerman, Brendan McMahan, Fei Zheng, Zachary Charles, Sean Augenstein, Zachary Garrett, Stefan Dierauf, David Petrou, Vishwath Mohan, Hunter King, Emily Glanz, Hubert Eichner, Krzysztof Ostrowski, Jakub Konecny, Shanshan Wu, Janel Thamkul, Elizabeth Kemp, and everyone else involved in the project.

Categories
Misc

NVIDIA Merlin Extends Open Source Interoperability for Recommender Workflows with Latest Update

NVIDIA Merlin recommender workflow.Check out NVIDIA Merlin’s latest updates including Transformers4Rec and SparseOperations Kit.NVIDIA Merlin recommender workflow.

Data scientists and machine learning engineers use many methods, techniques, and tools to prep, build, train, deploy, and optimize their machine learning models. While technical leads cite the importance of leveraging open source software for recommender team workflows, the majority of popular machine learning methods, libraries, and frameworks are not designed to support and accelerate recommender workflows. 

NVIDIA Merlin is designed to streamline recommender workflows. The latest update includes Transformers4Rec, a new library that wraps HuggingFace Transformer Architectures to build pipelines for session-based recommendations. It also adds SparseOperationsKit (SOK), a new Python package that supports sparse training and inference with Deep Learning (DL). 

This latest release reaffirms the commitment of NVIDIA to help machine learning engineers and data scientists develop and optimize their recommender systems—with open source canonical building blocks. 

Merlin Transformers4Rec, designed for recommenders and solving cold-start problems

Recommender methods popularized in mainstream media often rely upon long-term user profiles or lifetime user behavior. Yet, ecommerce and media companies acquiring new ongoing active users must provide relevant recommendations to first-time and early-visit users. Relevant recommendations enable increased user engagement, retention, and conversion to subscription services. 

Utilizing session-based recommenders with Transformers4Rec, data scientists and machine learning engineers are able to solve the cold-start problem by leveraging contextual and recent user interactions to predict a user’s next action and provide relevant recommendations. The NVIDIA Merlin team designed Transformers4Rec to be used as a standalone solution or within an ensemble of recommendation models.

SparseOperationsKit, sparse training, and inference with deep learning 

Recommender teams that work with massive datasets benefit from using deep learning (DL) recommenders. Merlin HugeCTR is a DL training framework designed for recommender systems and the latest update includes SOK, a new open source Python package that supports sparse training and inference. 

It is also compatible with DL frameworks including TensorFlow. SOK provides embedding model parallelism functionality to use GPUs, including scaling from a single GPU to multiple GPUs. Most common DL frameworks do not support model-parallelism, which makes it challenging to use all available GPUs in a cluster. Yet, SOK being compatible with DL frameworks, including TensorFlow, helps fill that void. 

Download and try NVIDIA Merlin 

The latest update to NVIDIA Merlin, including Transformers4Rec and SOK, strengthens streamlining and accelerating recommender workflows with open-source interoperability and performance enhancements. 

For more information about the latest release download NVIDIA Merlin today.

Categories
Misc

‘Paint Me a Picture’: NVIDIA Research Shows GauGAN AI Art Demo Now Responds to Words

A picture worth a thousand words now takes just three or four words to create, thanks to GauGAN2, the latest version of NVIDIA Research’s wildly popular AI painting demo. The deep learning model behind GauGAN allows anyone to channel their imagination into photorealistic masterpieces — and it’s easier than ever. Simply type a phrase like Read article >

The post ‘Paint Me a Picture’: NVIDIA Research Shows GauGAN AI Art Demo Now Responds to Words appeared first on The Official NVIDIA Blog.

Categories
Misc

BERT FineTuning training time

Hello

I am trying to fine-tune a Bert model using the well-known Movie Review dataset on M1 Chip.

The ETA for an epoch is estimated at 10 hours to refine all 66M of parameters.

In order to reduce the ETA, I thought to set the first two layers as `trainable=False`, so the trainable parameters now are 2K.

Even if I dropped the trainable parameters, nothing is changed, ETA is still 10h.

Do you think it is normal or there is something wrong on my side?

Thanks

submitted by /u/i_cook_bits
[visit reddit] [comments]

Categories
Misc

find the maximum number in each feature map inside a tensor

Hi everybody,

Please, is there a way to find the maximum pixel value in each feature map inside a tensor?

for example, suppose we have this shape (None,48,48,32) of a tensor x. which consist of 32 feature maps of 48×48 size. How can i find the maximum pixel value in the fifth feature map?

Thanks.

submitted by /u/Ali_Q_Saeed
[visit reddit] [comments]

Categories
Misc

An Elevated Experience: Xpeng G9 Takes EV Innovation Higher with NVIDIA DRIVE Orin

You don’t need a private plane to be at the forefront of personal travel. Electric automaker Xpeng took the wraps off the G9 SUV this week at the international Auto Guangzhou show in China. The intelligent, software-defined vehicle is built on the high-performance compute of NVIDIA DRIVE Orin and delivers AI capabilities that are continuously Read article >

The post An Elevated Experience: Xpeng G9 Takes EV Innovation Higher with NVIDIA DRIVE Orin appeared first on The Official NVIDIA Blog.

Categories
Offsites

Decisiveness in Imitation Learning for Robots

Despite considerable progress in robot learning over the past several years, some policies for robotic agents can still struggle to decisively choose actions when trying to imitate precise or complex behaviors. Consider a task in which a robot tries to slide a block across a table to precisely position it into a slot. There are many possible ways to solve this task, each requiring precise movements and corrections. The robot must commit to just one of these options, but must also be capable of changing plans each time the block ends up sliding farther than expected. Although one might expect such a task to be easy, that is often not the case for modern learning-based robots, which often learn behavior that expert observers describe as indecisive or imprecise.

Example of a baseline explicit behavior cloning model struggling on a task where the robot needs to slide a block across a table and then precisely insert it into a fixture.

To encourage robots to be more decisive, researchers often utilize a discretized action space, which forces the robot to choose option A or option B, without oscillating between options. For example, discretization was a key element of our recent Transporter Networks architecture, and is also inherent in many notable achievements by game-playing agents, such as AlphaGo, AlphaStar, and OpenAI’s Dota bot. But discretization brings its own limitations — for robots that operate in the spatially continuous real world, there are at least two downsides to discretization: (i) it limits precision, and (ii) it triggers the curse of dimensionality, since considering discretizations along many different dimensions can dramatically increase memory and compute requirements. Related to this, in 3D computer vision much recent progress has been powered by continuous, rather than discretized, representations.

With the goal of learning decisive policies without the drawbacks of discretization, today we announce our open source implementation of Implicit Behavioral Cloning (Implicit BC), which is a new, simple approach to imitation learning and was presented last week at CoRL 2021. We found that Implicit BC achieves strong results on both simulated benchmark tasks and on real-world robotic tasks that demand precise and decisive behavior. This includes achieving state-of-the-art (SOTA) results on human-expert tasks from our team’s recent benchmark for offline reinforcement learning, D4RL. On six out of seven of these tasks, Implicit BC outperforms the best previous method for offline RL, Conservative Q Learning. Interestingly, Implicit BC achieves these results without requiring any reward information, i.e., it can use relatively simple supervised learning rather than more-complex reinforcement learning.

Implicit Behavioral Cloning
Our approach is a type of behavior cloning, which is arguably the simplest way for robots to learn new skills from demonstrations. In behavior cloning, an agent learns how to mimic an expert’s behavior using standard supervised learning. Traditionally, behavior cloning involves training an explicit neural network (shown below, left), which takes in observations and outputs expert actions.

The key idea behind Implicit BC is to instead train a neural network to take in both observations and actions, and output a single number that is low for expert actions and high for non-expert actions (below, right), turning behavioral cloning into an energy-based modeling problem. After training, the Implicit BC policy generates actions by finding the action input that has the lowest score for a given observation.

Depiction of the difference between explicit (left) and implicit (right) policies. In the implicit policy, the “argmin” means the action that, when paired with a particular observation, minimizes the value of the energy function.

To train Implicit BC models, we use an InfoNCE loss, which trains the network to output low energy for expert actions in the dataset, and high energy for all others (see below). It is interesting to note that this idea of using models that take in both observations and actions is common in reinforcement learning, but not so in supervised policy learning.

Animation of how implicit models can fit discontinuities — in this case, training an implicit model to fit a step (Heaviside) function. Left: 2D plot fitting the black (X) training points — the colors represent the values of the energies (blue is low, brown is high). Middle: 3D plot of the energy model during training. Right: Training loss curve.

Once trained, we find that implicit models are particularly good at precisely modeling discontinuities (above) on which prior explicit models struggle (as in the first figure of this post), resulting in policies that are newly capable of switching decisively between different behaviors.

But why do conventional explicit models struggle? Modern neural networks almost always use continuous activation functions — for example, Tensorflow, Jax, and PyTorch all only ship with continuous activation functions. In attempting to fit discontinuous data, explicit networks built with these activation functions cannot represent discontinuities, so must draw continuous curves between data points. A key aspect of implicit models is that they gain the ability to represent sharp discontinuities, even though the network itself is composed only of continuous layers.

We also establish theoretical foundations for this aspect, specifically a notion of universal approximation. This proves the class of functions that implicit neural networks can represent, which can help justify and guide future research.

Examples of fitting discontinuous functions, for implicit models (top) compared to explicit models (bottom). The red highlighted insets show that implicit models represent discontinuities (a) and (b) while the explicit models must draw continuous lines (c) and (d) in between the discontinuities.

One challenge faced by our initial attempts at this approach was “high action dimensionality”, which means that a robot must decide how to coordinate many motors all at the same time. To scale to high action dimensionality, we use either autoregressive models or Langevin dynamics.

Highlights
In our experiments, we found Implicit BC does particularly well in the real world, including an order of magnitude (10x) better on the 1mm-precision slide-then-insert task compared to a baseline explicit BC model. On this task the implicit model does several consecutive precise adjustments (below) before sliding the block into place. This task demands multiple elements of decisiveness: there are many different possible solutions due to the symmetry of the block and the arbitrary ordering of push maneuvers, and the robot needs to discontinuously decide when the block has been pushed far “enough” before switching to slide it in a different direction. This is in contrast to the indecisiveness that is often associated with continuous-controlled robots.

Example task of sliding a block across a table and precisely inserting it into a slot. These are autonomous behaviors of our Implicit BC policies, using only images (from the shown camera) as input.

A diverse set of different strategies for accomplishing this task. These are autonomous behaviors from our Implicit BC policies, using only images as input.

In another challenging task, the robot needs to sort blocks by color, which presents a large number of possible solutions due to the arbitrary ordering of sorting. On this task the explicit models are customarily indecisive, while implicit models perform considerably better.

Comparison of implicit (left) and explicit (right) BC models on a challenging continuous multi-item sorting task. (4x speed)

In our testing, implicit BC models can also exhibit robust reactive behavior, even when we try to interfere with the robot, despite the model never seeing human hands.

Robust behavior of the implicit BC model despite interfering with the robot.

Overall, we find that Implicit BC policies can achieve strong results compared to state of the art offline reinforcement learning methods across several different task domains. These results include tasks that, challengingly, have either a low number of demonstrations (as few as 19), high observation dimensionality with image-based observations, and/or high action dimensionality up to 30 — which is a large number of actuators to have on a robot.

Policy learning results of Implicit BC compared to baselines across several domains.

Conclusion
Despite its limitations, behavioral cloning with supervised learning remains one of the simplest ways for robots to learn from examples of human behaviors. As we showed here, replacing explicit policies with implicit policies when doing behavioral cloning allows robots to overcome the “struggle of decisiveness”, enabling them to imitate much more complex and precise behaviors. While the focus of our results here was on robot learning, the ability of implicit functions to model sharp discontinuities and multimodal labels may have broader interest in other application domains of machine learning as well.

Acknowledgements
Pete and Corey summarized research performed together with other co-authors: Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, and Jonathan Tompson. The authors would also like to thank Vikas Sindwhani for project direction advice; Steve Xu, Robert Baruch, Arnab Bose for robot software infrastructure; Jake Varley, Alexa Greenberg for ML infrastructure; and Kamyar Ghasemipour, Jon Barron, Eric Jang, Stephen Tu, Sumeet Singh, Jean-Jacques Slotine, Anirudha Majumdar, Vincent Vanhoucke for helpful feedback and discussions.

Categories
Misc

In Pursuit of Smart City Vision, Startup Two-i Keeps an AI on Worker Safety

When Julien Trombini and Guillaume Cazenave founded video-analytics startup Two-i four years ago, they had an ambitious goal: improving the quality of urban life by one day being able to monitor a city’s roads, garbage collection and other public services. Along the way, the pair found a wholly different niche. Today, the company’s technology — Read article >

The post In Pursuit of Smart City Vision, Startup Two-i Keeps an AI on Worker Safety appeared first on The Official NVIDIA Blog.

Categories
Misc

The Need for Speed: Edge AI with NVIDIA GPUs and SmartNICs Part 1

How to integrate the NVIDIA GPU and Network Operators

NVIDIA Operators simplify GPU and SmartNIC management on Kubernetes. This post shows how to integrate NVIDIA Operators into new edge AI platforms using preinstalled drivers. This is the first post in a two-part series. The next post describes how to integrate NVIDIA Operators using custom driver containers.

A Future-proof edge AI platform

Today every industry uses edge AI. Servers deployed in airplanes, stores, and factories respond to IoT sensors in real time. They predict weather, prevent theft, and guarantee manufacturing quality.

AI makes sensor data actionable. Trained AI models recognize patterns and trigger responses. A trained AI model represents a company’s business intelligence. Just as crude oil becomes valuable when refined into petroleum, AI transforms sensor data into insight.

But unlike oil, the amount of IoT sensor data is growing exponentially. The vast amount of data generated at the edge can overwhelm an edge server’s ability to process it.

That is why edge AI needs acceleration. NVIDIA GPUs and SmartNICs future-proof an edge AI platform against exponential data growth.

Edge AI is Cloud Native

This post describes how to integrate NVIDIA accelerators with Kubernetes. Why focus on Kubernetes? Because edge AI is cloud native. Most AI applications are container-based microservices. Kubernetes is the unofficial standard for container orchestration.

Edge AI platforms build on Kubernetes due to its flexibility. The Kubernetes API supports declarative automation and is extensible through custom resource definitions. A robust software ecosystem supports Kubernetes day one and day two operations.

NVIDIA Fleet Command is one example of a Kubernetes-based Edge AI platform. Fleet Command is a hybrid cloud service designed for security and performance. It manages AI application lifecycle on bare metal edge nodes. Fleet Command also integrates with NGC, NVIDIA’s curated registry of more than 700 GPU-optimized applications.

While Fleet Command supports NVIDIA GPUs and SmartNICs, many edge platforms do not. For those, NVIDIA provides open-source Kubernetes operators to enable GPU and SmartNIC acceleration. There are two operators: the NVIDIA GPU Operator and the NVIDIA Network Operator.

The NVIDIA GPU Operator automates GPU deployment and management on Kubernetes. The GPU Operator Helm Chart is available on NGC. It includes several components:

This chart represents the components that make up the NVIDIA GPU Operator. They include a Driver container, container toolkit, Kubernetes device plugin, NVIDIA DGCM Exporter, GPU Feature Discovery, and GPU MIG Manager.
Figure 1. These components make up the NVIDIA GPU Operator

The NVIDIA Network Operator automates CONNECTX SmartNIC configuration for Kubernetes pods that need fast networking. It is also delivered as a Helm chart. The Network Operator adds a second network interface to a pod using the Multus CNI plug-in. It supports both Remote Direct Memory Access (RDMA) and Shared Root I/O Virtualization (SRIOV).

The NVIDIA Network Operator includes the following components:

  • The NVIDIA OFED Driver container automates network driver and library installation.
  • The Kubernetes RDMA shared-device plug-in attaches RDMA devices to pods. It supports Infiniband and RDMA over Converged Ethernet (RoCE).
  • The SRIOV device plug-in attaches SRIOV Virtual Functions (VFs) to pods.
  • The Containernetworking CNI plug-in is a standard interface for extending Kubernetes networking capabilities.
  • The Whereabouts CNI plug-in manages cluster-wide automatic IP addresses creation and assignment.
  • The MACVLAN CNI functions as a virtual switch to connect pods to network functions.
  • The Multus CNI plug-in enables attaching multiple network devices to a Kubernetes pod.
  • The Host-device CNI plug-in moves an existing device (such as an SRIOV VF) from the host to network namespace the pod’s.
This image shows the components that make up the NVIDIA Network Operator. The components include MOFED Driver Container, RDMA Shared Device Plugin, SRIOV Device plugin, Container Network Plugin, Multus CNI Plugin and Ipalm CNI Plugin.
Figure 2. These components make up the NVIDIA Network Operator components

Both operators use Node Feature Discovery. This service identifies which cluster nodes have GPUs and SmartNICs.

The operators work together or separately. Deploying them together enables GPUDirect RDMA. This feature bypasses host buffering to increase throughput between the NIC and GPU.

The NVIDIA operators are open source software. They already support popular Kubernetes distributions running on NVIDIA Certified servers. But many edge platforms run customized Linux distributions the operators do not support. This post explains how to integrate NVIDIA operators with those platforms.

Two Paths, One Way

Preinstalled drivers are signed drivers that offer secure and measured boot. 

Custom driver containers offer immutable image for heterogenous clusters.
Figure 3. This image represents the are two methods for integrating NVIDIA Operators: preinstalled drivers or custom driver containers

Portability is one of the main benefits of cloud native software. Containers bundle applications with their dependencies. This lets them run, scale, and migrate across different platforms without friction.

NVIDIA operators are container-based, cloud native applications. Most of the operator services do not need any integration to run on a new platform. But both operators include driver containers, and drivers are the exception. Drivers are kernel-dependent. Integrating NVIDIA operators with a new platform involves rebuilding the driver containers for the target kernel. The platform may be running an unsupported Linux distribution or a custom-compiled kernel. 

There are two approaches to delivering custom drivers:

First, by installing the drivers onto the host before installing the operators. Many edge platforms deliver signed drivers in their base operating system image to support secure and measured boot. Platforms requiring signed drivers cannot use the driver containers deployed by the operators. NVIDIA Fleet Command follows this pattern. Both the Network and GPU operators support preinstalled drivers by disabling their own driver containers. 

The second approach is to replace the operator’s driver containers with custom containers. Edge platforms with immutable file systems prefer this method. Edge servers often run as appliances. They use read-only file systems to increase security and prevent configuration drift. Running driver and application containers in memory instead of adding them to the immutable image reduces its size and complexity. This also allows the same image to run on nodes with different hardware profiles.

This post explains how to set up both patterns. The first section of the post describes driver preinstallation. The second section describes how to build and install custom driver containers.

Apart from the driver containers, the remaining operator services generally run on new platforms without modification. NVIDIA tests both operators on leading container runtimes such as Docker Engine, CRI-O, and Containerd. The GPU Operator also supports the runtime class resource for per-pod runtime selection.

Preinstalled driver integration

The rest of this post shows how to integrate NVIDIA operators with custom edge platforms. It includes step-by-step procedures for both the driver preinstallation and driver container methods.

Table 1 describes the test system used to demonstrate these procedures.

TABLE 1: Test System Description

Linux Distribution Centos 7.9.2009 GPU Operator v1.8.2
Kernel version 3.10.0-1160.45.1.el7.custom GPU Driver (operator) 470.74
Container runtime Crio-21.3 Network Operator v1.0.0
Kubernetes 1.21.3-0 MOFED (operator) 5.4-1.0.3.0
Helm v3.3.3 CUDA 11.4
Cluster network Calico v3.20.2 GPU Driver (local) 470.57.02
Compiler GCC 4.8.5 2015062 MOFED (local) 5.4-1.0.3.0
Developer tools Elfutils 0.176-5 Node Feature Discovery v0.8.0
Server NVIDIA DRIVE Constellation GPU A100-PCIE-40GB
Server BIOS v5.12 SmartNIC ConnectX-6 Dx MT2892
CPU (2) Intel Xeon Gold 6148 SmartNIC Firmware 22.31.1014

The operating system, Linux kernel, and container runtime combination on the test system is not supported by either operator. The Linux kernel is custom compiled, so precompiled drivers are not available. The test system also uses the Cri-o container runtime, which is less common than alternatives like Containerd and Docker Engine.

Prepare the System

  1. First, verify that the CONNECTX SmartNIC and NVIDIA GPU are visible on the test system.
$ lspci | egrep 'nox|NVI'
23:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)
49:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
49:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
5e:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
e3:00.0 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
e3:00.1 Ethernet controller: Mellanox Technologies MT2892 Family [ConnectX-6 Dx]
e6:00.0 3D controller: NVIDIA Corporation Device 20f1 (rev a1)

2. View the operating system and Linux kernel versions. In this example, the Centos 7 3.10.0-1160.45.1 kernel was recompiled to 3.10.0-1160.45.1.el7.custom.x86_64.

$ cat /etc/redhat-release 
CentOS Linux release 7.9.2009 (Core)
 
$ uname -r
3.10.0-1160.45.1.el7.custom.x86_64

3. View the Kubernetes version, network configuration, and cluster nodes. This output shows a single node cluster, which is a typical pattern for edge AI deployments. The node is running Kubernetes version 1.21. 

$ kubectl get nodes
NAME     STATUS   ROLES           AGE   VERSION
cgx-20   Ready    control-plane   23d   v1.21.3

4. View the installed container runtime. This example shows the cri-o container runtime.

$ kubectl get node cgx-20 -o yaml | grep containerRuntime
    containerRuntimeVersion: cri-o://1.21.3

5. NVIDIA delivers operators through Helm charts. View the installed Helm version.

$ helm version
version.BuildInfo{Version:"v3.3.3", GitCommit:"55e3ca022e40fe200fbc855938995f40b2a68ce0", GitTreeState:"clean", GoVersion:"go1.14.9"}

Install the Network Operator with preinstalled Drivers

The Mellanox OpenFabrics Enterprise Distribution for Linux installs open source drivers and libraries for high-performance networking.  The NVIDIA Network Operator optionally installs a MOFED container to load these drivers and libraries on Kubernetes. This section describes the process for preinstalling MOFED drivers on the host in the event that the included driver container cannot be used.

  1. Install the prerequisites.
$ yum install -y perl numactl-libs gtk2 atk cairo gcc-gfortran tcsh libnl3 tcl tk python-devel pciutils make lsof redhat-rpm-config rpm-build libxml2-python ethtool iproute net-tools openssh-clients git openssh-server wget fuse-libs

2. Download and extract the MOFED archive for the Linux distribution.

$ wget https://www.mellanox.com/downloads/ofed/MLNX_OFED-5.4-1.0.3.0/MLNX_OFED_LINUX-5.4-1.0.3.0-rhel7.9-x86_64.tgz
 
$ tar zxf MLNX_OFED_LINUX-5.4-1.0.3.0-rhel7.9-x86_64.tgz

3. Install the kernel-space drivers using mlnxofedinstall. The install script may automatically update the CONNECTX SmartNIC firmware.

$ cd MLNX_OFED_LINUX-5.4-1.0.3.0-rhel7.9-x86_64
 
$ ./mlnxofedinstall --without-rshim-dkms --without-iser-dkms --without-isert-dkms --without-srp-dkms --without-kernel-mft-dkms --without-mlnx-rdma-rxe-dkms 

4. Reboot to load the new drivers.

$ sudo shutdown -r now

5. After reboot, make sure that the drivers are loaded.

$ /etc/init.d/openibd status
 
  HCA driver loaded
 
Configured Mellanox EN devices:
enp94s0
ens13f0
ens13f1
ens22f0
ens22f1
 
Currently active Mellanox devices:
enp94s0
ens13f0
ens13f1
ens22f0
ens22f1
 
The following OFED modules are loaded:
 
  rdma_ucm
  rdma_cm
  ib_ipoib
  mlx5_core
  mlx5_ib
  ib_uverbs
  ib_umad
  ib_cm
  ib_core
  mlxfw

Once MOFED is successfully installed and the drivers are loaded, proceed to installing the NVIDIA Network Operator.  

6. Identify the secondary network device name. This will be the device or devices plumbed into the pod as a secondary network interface.

$ ibdev2netdev
mlx5_0 port 1 ==> ens13f0 (Up)
mlx5_1 port 1 ==> ens13f1 (Down)
mlx5_2 port 1 ==> enp94s0 (Up)
mlx5_3 port 1 ==> ens22f0 (Up)
mlx5_4 port 1 ==> ens22f1 (Down)

7. By default the Network Operator does not deploy to a Kubernetes master. Remove the master label from the node to accommodate the all-in-one cluster deployment. 

$ kubectl label nodes --all node-role.kubernetes.io/master- --overwrite

Note this is a temporary workaround to allow Network Operator to schedule pods to the master node in a single node cluster. Future versions of the Network Operator will add toleration and nodeAffinity to avoid this workaround.

8. Add the Mellanox Helm chart repository.

$ helm repo add mellanox https://mellanox.github.io/network-operator
$ helm repo update
$ helm repo ls
NAME            URL                                             
mellanox        https://mellanox.github.io/network-operator

9. Create a values.yaml to specify Network Operator configuration. This example deploys the RDMA shared device plug-in and specifies ens13f0 as the RDMA-capable interface.

$ cat roce_shared_values.yaml 
nfd:
  enabled: true
deployCR: true
sriovDevicePlugin:
  deploy: false
rdmaSharedDevicePlugin:
  deploy: true
  resources:
    - name: rdma_shared_device_a
      vendors: [15b3]
      deviceIDs: [101d]
      ifNames: [ens13f0]

10. Install the Network Operator Helm chart, overriding the default values.yaml with the new configuration file.

$ helm install -f ./roce_shared_values.yaml -n network-operator --create-namespace --wait network-operator mellanox/network-operator

11. Verify that all Network Operator pods are in Running status.

$ kubectl get pods -n nvidia-network-operator-resources
NAME                      READY   STATUS    RESTARTS   AGE
cni-plugins-ds-fcrsq      1/1     Running   0          3m44s
kube-multus-ds-4n526      1/1     Running   0          3m44s
rdma-shared-dp-ds-5rq4x   1/1     Running   0          3m44s
whereabouts-9njxm         1/1     Running   0          3m44s

Note that some versions of Calico are incompatible with certain Multus CNI versions. Change the Multus API version after the Multus daemonset starts.

$ sed -i 's/0.4.0/0.3.1/' /etc/cni/net.d/00-multus.conf

12. The Helm chart creates a configMap that is used to label the node with the selectors defined in the values.yaml file. Verify that the node is correctly labeled by NFD and that the RDMA shared devices are created. 

$ kubectl describe cm -n nvidia-network-operator-resources rdma-devices | grep 15b3
{ "configList": [ { "resourceName": "rdma_shared_device_a", "rdmaHcaMax": 1000, "selectors": { "vendors": ["15b3"], "deviceIDs": ["101d"], "drivers": [], "ifNames": ["ens13f0"], "linkTypes": [] } } ] } 
 
$ kubectl describe node cgx-20 | egrep '15b3|rdma_shared'
                    feature.node.kubernetes.io/pci-15b3.present=true
                    feature.node.kubernetes.io/pci-15b3.sriov.capable=true
  rdma/rdma_shared_device_a:  1k
  rdma/rdma_shared_device_a:  1k
  rdma/rdma_shared_device_a  0           0

Install the GPU Operator with preinstalled Drivers

Follow the same process to install the GPU Operator with precompiled drivers.

  1. First, disable the nouveau GPU driver, blacklist it from loading, and rebuild the initial RAMdisk.
$ cat 

2. Download the NVIDIA GPU driver install script for Linux. In this example, we are using driver version 470.57.02.

$ wget https://us.download.nvidia.com/tesla/470.57.02/NVIDIA-Linux-x86_64-470.57.02.run

3. Run the install script. The script automatically compiles a driver for the target operating system kernel.

$ sh NVIDIA-Linux-x86_64-470.57.02.run -q -a -n -X -s

4. Verify the driver loads successfully.

$ modinfo -F version nvidia
470.57.02

5. Disable SELinux in the Cri-o container runtime configuration and restart the service. 

Note that SELinux is in Permissive mode on the test system. Additional steps are needed when SELinux is in Enforcing mode.

$ cat 

6. Remove the taint on scheduling to the master node.

$ kubectl taint nodes --all node-role.kubernetes.io/master-
node/cgx-20 untainted

7. Install the GPU Operator Helm chart repository.

$ helm repo add nvidia https://nvidia.github.io/gpu-operator
 
$ helm repo update
 
# helm repo ls
NAME    	URL                                        
nvidia  	https://nvidia.github.io/gpu-operator      
mellanox	https://mellanox.github.io/network-operator

8. Install the GPU Operator Helm chart. Overriding the driver.enabled parameter to false disables driver container installation. Also specify crio as the container runtime.

$ helm install --generate-name nvidia/gpu-operator --set driver.enabled=false --set toolkit.version=1.7.1-centos7 --set operator.defaultRuntime=crio
 
$ helm ls
NAME                   	NAMESPACE	REVISION	UPDATED                                	STATUS  	CHART              	APP VERSION
gpu-operator-1635194696	default  	1       	2021-10-25 16:44:57.237363636 -0400 EDT	deployed	gpu-operator-v1.8.2	v1.8.2 

9. View the GPU Operator resources. All pods should be in status Running or Completed.

$ kubectl get pods -n gpu-operator-resources
NAME                                       READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-6kpxt                1/1     Running     0          114s
nvidia-container-toolkit-daemonset-sprjb   1/1     Running     0          114s
nvidia-cuda-validator-ndc78                0/1     Completed   0          90s
nvidia-dcgm-exporter-n9xnp                 1/1     Running     0          114s
nvidia-dcgm-pfknx                          1/1     Running     0          114s
nvidia-device-plugin-daemonset-4qnh6       1/1     Running     0          114s
nvidia-device-plugin-validator-845pw       0/1     Completed   0          84s
nvidia-mig-manager-rf7vz                   1/1     Running     0          114s
nvidia-operator-validator-5ngbk            1/1     Running     0          114s

10. View the validation pod logs to verify validation tests completed.

$ kubectl logs -n gpu-operator-resources nvidia-device-plugin-validator-845pw
device-plugin workload validation is successful
 
$ kubectl logs -n gpu-operator-resources nvidia-cuda-validator-ndc78 
cuda workload validation is successful

11. Run nvidia-smi from within the validator container to display the GPU, driver, and CUDA versions. This also validates that the container runtime prestart hook works as expected.

$ kubectl exec -n gpu-operator-resources -i -t nvidia-operator-validator-5ngbk --container nvidia-operator-validator -- nvidia-smi
Mon Oct 25 20:57:28 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:23:00.0 Off |                    0 |
| N/A   26C    P0    32W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCI...  Off  | 00000000:E6:00.0 Off |                    0 |
| N/A   26C    P0    32W / 250W |      0MiB / 40536MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Test the Preinstalled Driver Integration

Test the preinstalled driver integration by creating test pods.

1. Create a network attachment definition. A network attachment definition is a custom resource that allows pods to connect to one or more networks. This network attachment definition defines a MAC VLAN Network that bridges multiple pods across a secondary interface. The Whereabouts CNI automates IP address assignments for pods connected to the secondary network.

$ cat 

2. Apply the network attachment definition.

$ kubectl create -f roce_shared_macvlan_net.yaml
 
$ kubectl describe network-attachment-definition roce-shared-macvlan-network | grep Config
  Config:  { "cniVersion":"0.4.0", "name":"roce-shared-macvlan-network", "type":"macvlan","master": "ens13f0","mode" : "bridge","mtu" : 1500,"ipam":{"type":"whereabouts","datastore":"kubernetes","kubernetes":{"kubeconfig":"/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"},"range":"192.168.2.225/28","exclude":["192.168.2.229/30","192.168.2.236/32"],"log_file":"/var/log/whereabouts.log","log_level":"info","gateway":"192.168.2.1"} }

3. Create the test pod spec file. The spec file should include an annotation for the network attachment and resources limits for the RDMA device.

$ cat 

4. Create the test pod.

$ kubectl create -f roce_shared_pod.yaml 
 
$ kubectl get pods | grep roce
roce-shared-pod               1/1     Running   0          6m46s

5. View the test pod logs to verify the network attachment. The secondary interface in this example is named net1.

$ kubectl describe pod roce-shared-pod | grep -B1 rdma
    Limits:
      rdma/rdma_shared_device_a:  1
    Requests:
      rdma/rdma_shared_device_a:  1
 
$ kubectl logs roce-shared-pod
/dev/infiniband:
total 0
crw-rw-rw-. 1 root root 231,  64 Oct 13 22:48 issm0
crw-rw-rw-. 1 root root  10,  56 Oct 13 22:48 rdma_cm
crw-rw-rw-. 1 root root 231,   0 Oct 13 22:48 umad0
crw-rw-rw-. 1 root root 231, 192 Oct 13 22:48 uverbs0
 
/sys/class/net:
total 0
lrwxrwxrwx. 1 root root 0 Oct 13 22:48 eth0 -> ../../devices/virtual/net/eth0
lrwxrwxrwx. 1 root root 0 Oct 13 22:48 lo -> ../../devices/virtual/net/lo
lrwxrwxrwx. 1 root root 0 Oct 13 22:48 net1 -> ../../devices/virtual/net/net1
lrwxrwxrwx. 1 root root 0 Oct 13 22:48 tunl0 -> ../../devices/virtual/net/tunl0

6. View the address assignment on net1.

$ kubectl exec -ti roce-shared-pod -- ifconfig net1
  mtu 1500
        inet 192.168.2.225  netmask 255.255.255.240  broadcast 192.168.2.239
        inet6 fe80::6871:9cff:fe1b:afe4  prefixlen 64  scopeid 0x20
        ether 6a:71:9c:1b:af:e4  txqueuelen 0  (Ethernet)
        RX packets 405  bytes 24300 (23.7 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 9  bytes 698 (698.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

7. The GPU Operator creates pods to validate the driver, container runtime, and Kubernetes device plug-in. Create an additional GPU test pod.

$ cat 

8. View the results.

$ kubectl get pod cuda-vectoradd
NAME             READY   STATUS      RESTARTS   AGE
cuda-vectoradd   0/1     Completed   0          34s
 
$ kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

9. Load the nvidia-peermem driver. It provides GPUDirect RDMA for CONNECTX SmartNICs. This driver is included in NVIDIA Linux GPU driver version 470 and greater. It is compiled automatically during Linux driver installation if both the ib_core  and NVIDIA GPU driver sources are present on the system. This means the MOFED driver should be installed before the GPU driver so the MOFED source is available to build the nvidia-peermem driver.

$ modprobe nvidia-peermem

$ lsmod | grep nvidia_peermem
nvidia_peermem         13163  0 
nvidia              35224507  113 nvidia_modeset,nvidia_peermem,nvidia_uvm
ib_core               357959  9 rdma_cm,ib_cm,iw_cm,mlx5_ib,ib_umad,nvidia_peermem,ib_uverbs,rdma_ucm,ib_ipoib

Learn more about the process for integrating NVIDIA accelerators using custom driver containers.

Part 2 of this series will be published on 11/22. It will describe how to integrate the NVIDIA GPU and Network Operators with custom driver containers.