Categories
Misc

Low-Code Building Blocks for Speech AI Robotics

When examining an intricate speech AI robotic system, it’s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, “Any…

When examining an intricate speech AI robotic system, it’s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, “Any sufficiently advanced technology is indistinguishable from magic.”

From accepting natural-language commands to safely interacting in real-time with its environment and the humans around it, today’s speech AI robotics systems can perform tasks to a level previously unachievable by machines.

Join experts from Google, Meta, NVIDIA, and more at the first annual NVIDIA Speech AI Summit. Register now.

Take Spot, a speech AI-enabled robot that can fetch drinks on its own, for example. To easily add speech AI skills, such as automatic speech recognition (ASR) or text-to-speech (TTS), many developers leverage simpler low-code building blocks when building complex robot systems.

Photograph of Spot, an intelligent robotic system, after it has successfully completed a drink order.
Figure 1. Spot, a robotic dog, fetches a drink in real time after processing an order using ASR and TTS skills provided by NVIDIA Riva.

For developers creating robotic applications with speech AI skills, this post breaks down the low-code building blocks provided by the NVIDIA Riva SDK.

By following along with the provided code examples, you learn how speech AI technology makes it possible for intelligent robots to take food orders, relay those orders to a restaurant employee, and finally navigate back home when prompted.

Design an AI robotic system using building blocks

Complex systems consist of several building blocks. Each building block is much simpler to understand on its own.

When you understand the function of each component, the end product becomes less daunting. If you’re using low-code building blocks, you can now focus on domain-specific customizations requiring more effort.

Our latest project uses “Spot,” a four-legged robot, and an NVIDIA Jetson Orin, which is connected to Spot through an Ethernet cable. This project is a prime example of using AI building blocks to form a complex speech AI robot system.

Architectural diagram of an AI robotics system with Riva low-code speech AI blocks shown as core components for platform, navigation, and interaction.
Figure 2. A speech AI robot system with Riva low-code speech AI blocks to add ASR and TTS skills

Our goal was to build a robot that could fetch us snacks on its own from a local restaurant, with as little intervention from us as possible. We also set out to write as little code as possible by using what we could from open-source libraries and tools. Almost all the software used in this project was freely available.

To achieve this goal, an AI system must be able to interact with humans vocally, perceive its environment (in our case, with an embedded camera), and navigate through the surroundings safely. Figure 2 shows how interaction, platform, and navigation represent our Spot robot’s three fundamental operation components, and how those components are further subdivided into low-code building blocks.

This post focuses solely on the human interaction blocks from the Riva SDK.

Add speech recognition and speech synthesis skills using Riva

We have so many interactions with people every day that it is easy to overlook how complex those interactions actually are. Speaking comes naturally to humans but is not nearly so simple for an intelligent machine to understand and talk.

Riva is a fully customizable, GPU-accelerated speech AI SDK that handles ASR and TTS skills, and is deployable on-premises, in all clouds, at the edge, and on embedded devices. It facilitates human-machine speech interactions.

Riva runs entirely locally on the Spot robot. Therefore, processing is secure and does not require internet access. It is also completely configurable with a simple parameter file, so no extra coding is needed.

Riva code examples for each speech AI task

Riva provides ready-to-use Python scripts and command-line tools for real-time transformation of audio data captured by a microphone into text (ASR, speech recognition, or speech-to-text) and for converting text into an audio output (TTS, or speech synthesis).

Adapting these scripts for compatibility with Open Robotics (ROS) requires only minor changes. This helps simplify the robotic system development process.

ASR customizations

The Riva OOTB Python client ASR script is named transcribe_mic.py. By default, it prints ASR output to the terminal. By modifying it, the ASR output is routed to a ROS topic and can be read by anything in the ROS network. The critical additions to the script’s main() function are shown in the following code example:

   inter_pub = rospy.Publisher('intermediate', String, queue_size=10)
   final_pub = rospy.Publisher('final', String, queue_size=10)
   rospy.init_node('riva_asr', anonymous=True)

The following code example includes more critical additions to main:

       for response in responses:
           if not response.results:
               continue
           partial_transcript = ""
           for result in response.results:
               if not result.alternatives:
                   continue
               transcript = result.alternatives[0].transcript
               if result.is_final:
                   for i, alternative in enumerate(result.alternatives):
                       final_pub.publish(alternative.transcript)
              else:
                  partial_transcript += transcript
           if partial_transcript:
               inter_pub.publish(partial_transcript)

TTS customizations

Riva also provides the talk.py script for TTS. By default, you enter text in a terminal or Python interpreter, from which Riva generates audio output. For Spot to speak, the input text talk.py script is modified so that the text comes from a ROS callback rather than a human’s keystrokes. The key changes to the OOTB script include this function for extracting the text:

def callback(msg):
   global TTS
   TTS = msg.data

They also include these additions to the main() function:

   rospy.init_node('riva_tts', anonymous=True)
   rospy.Subscriber("speak", String, callback)

These altered conditional statements in the main() function are also key:

       while not rospy.is_shutdown():
           if TTS != None:
               text = TTS

Voice interaction script

Simple scripts like voice_control.py consist primarily of the callback and talker functions. They tell Spot what words to listen for and how to respond. 

def callback(msg):
   global pub, order
   rospy.loginfo(msg.data)
   if "hey spot" in msg.data.lower() and "fetch me" in msg.data.lower():
       order_start = msg.data.index("fetch me")
       order = msg.data[order_start + 9:]
       pub.publish("Fetching " + order)

def talker():
   global pub
   rospy.init_node("spot_voice_control", anonymous=True)
   pub = rospy.Publisher("speak", String, queue_size=10)
   rospy.Subscriber("final", String, callback)
   rospy.spin()

In other words, if the text contains “Hey Spot, … fetch me…” Spot saves the rest of the sentence as an order. After the ASR transcript indicates that the sentence is finished, Spot activates the TTS client and recites the word “Fetching” plus the contents of the order. Other scripts then engage a ROS action server instructing Spot to navigate to the restaurant, while taking care to avoid cars and other obstacles.

When Spot reaches the restaurant, it waits for a person to take its order by saying “Hello Spot.” If the ASR analysis script detects this sequence, Spot recites the order and ends it with “please.” The restaurant employee places the ordered food and any change in the appropriate container on Spot’s back. Spot returns home after Riva ASR recognizes that the restaurant staffer has said, “Go home, Spot.”

The technology behind a speech AI SDK like Riva for building and deploying fully customizable real-time speech AI applications deployable on-premises, in all clouds, at the edge, and embedded, brings AI robotics into the real world.

When a robot seamlessly interacts with people, it opens up a world of new areas where robots can help without needing a technical person on a computer to do the translation.

Deploy your own speech AI robot with a low-code solution

Teams such as NVIDIA, Open Robotics, and the robotics community, in general, have done a fantastic job in solving speech AI and robotics problems and making that technology available and accessible for everyday robotics users.

Anyone eager to get into the industry or to improve the technology they already have can look to these groups for inspiration and examples of cutting-edge technology. These technologies are usable through free SDKs (Riva, ROS, NVIDIA DeepStream, NVIDIA CUDA) and capable hardware (robots, NVIDIA Jetson Orin, sensors).

I am thrilled to see this level of community support from technology leaders and invite you to build your own speech AI robot. Robots are awesome!

For more information, see the following related resources:

Categories
Misc

Inside AI: NVIDIA DRIVE Ecosystem Creates Pioneering In-Cabin Features with NVIDIA DRIVE IX

As personal transportation becomes electrified and automated, time in the vehicle has begun to resemble that of a living space rather than a mind-numbing commute. Companies are creating innovative ways for drivers and passengers to make the most of this experience, using the flexibility and modularity of NVIDIA DRIVE IX. In-vehicle technology companies Cerence, Smart Read article >

The post Inside AI: NVIDIA DRIVE Ecosystem Creates Pioneering In-Cabin Features with NVIDIA DRIVE IX appeared first on NVIDIA Blog.

Categories
Misc

Enhancing Digital Twin Models and Simulations with NVIDIA Modulus v22.09

The latest version of NVIDIA Modulus, an AI framework that enables users to create customizable training pipelines for digital twins, climate models, and…

The latest version of NVIDIA Modulus, an AI framework that enables users to create customizable training pipelines for digital twins, climate models, and physics-based modeling and simulation, is now available for download. 

This release of the physics-ML framework, NVIDIA Modulus v22.09, includes key enhancements to increase composition flexibility for neural operator architectures, features to improve training convergence and performance, and most importantly, significant improvements to the user experience and documentation. 

You can download the latest version of the Modulus container from DevZone, NGC, or access the Modulus repo on GitLab.

Neural network architectures

This update extends the Fourier Neural Operator (FNO), physics-informed neural operator (PINO), and DeepONet network architecture implementations to support customization using other built-in networks in Modulus. More specifically, with this update, you can:

  • Achieve better initialization, customization, and generalization across problems with improved FNO, PINO, and DeepONet architectures.
  • Explore new network configurations by combining any point-wise network within Modulus such as Sirens, Fourier Feature networks, and Modified Fourier Feature networks for the decoder portion of FNO/PINO with the spectral encoder.
  • Use any network in the branch net and trunk net of DeepONet to experiment with a wide selection of architectures. This includes the physics-informed neural networks (PINNs) in the trunk net. FNO can be used in the branch net of DeepONet as well.
  • Demonstrate DeepONet improvements with a new DeepONet example for modeling Darcy flow through porous media.

Model parallelism has been introduced as a beta feature with model-parallel AFNO. This enables parallelizing the model across multiple GPUs along the channel dimension. This decomposition distributes the FFTs and IFFTs in a highly parallel fashion. The matrix multiplies are partitioned so each GPU holds a different portion of each MLP layer’s weights with appropriate gather, scatter, reductions, and other communication routines implemented for the forward and backward passes.

In addition, support for the self-scalable tanh (Stan) activation function is now available. Stan has been known to show better convergence characteristics and increase accuracy for PINN training models.

Finally, support for kernel fusion of the Sigmoid Linear Unit (SiLU) through TorchScript is now added with upstream changes to the PyTorch symbolic gradient formula. This is especially useful for problems that require computing higher-order derivatives for physics-informed training, providing up to 1.4x speedup in such instances.

Modeling enhancements and training features

Each NVIDIA Modulus release improves the modeling aspects to better map the partial differential equations (PDEs) to neural network models as well as improve the training convergence. 

New recommended practices in Modulus are available to facilitate scaling and nondimensionalizing PDEs to help you properly scale your system’s units, including: 

  • Defining a physical quantity with its value and its unit
  • Instantiating a nondimensionalized object to scale the quantity 
  • Tracking the nondimensionalized quantity through the algebraic manipulations
  • Scaling back the nondimensionalized quantity to any target quantity with user-specified units for post-processing purposes

You now also have the ability to effectively handle different scales within a system with Selective Equations Term Suppression (SETS). This enables you to create different instances of the same PDE and freeze certain terms in a PDE. That way, the losses for the smaller scales are minimized, improving convergence on stiff PDEs in the PINNs. 

In addition, new Modulus APIs, configured in the Hydra configuration YAML file, enable the end user to terminate the training based on convergence criteria like total loss or individual loss terms or another metric that they can specify.

The new causal weighting scheme addresses the bias of continuous time PINNs that violate physical causality for transient problems. By reformulating the losses for the residual and initial conditions, you can get better convergence and better accuracy of PINNS for dynamic systems.

Modulus training performance, scalability, and usability 

Each NVIDIA Modulus release focuses on improving training performance and scalability. With this latest release, FuncTorch was integrated into Modulus for faster gradient calculations in PINN training. Regular PyTorch Autograd uses reverse mode automatic differentiation and has to calculate Jacobian and Hessian terms row by row in a for loop. FuncTorch removes unnecessary weight gradient computations and can calculate Jacobian and Hessian more efficiently using a combination of reverse and forward mode automatic differentiation, thereby improving the training performance.

The Modulus v22.09 documentation improvements provide more context and detail about the key concepts of the framework’s workflow to help new users. 

Enhancements have been made to the Modulus Overview with more example-guided workflows for physics-only driven, purely data-driven, and both physics– and data-driven modeling approaches. Modulus users can now follow improved introductory examples to build step-by-step in line with each workflow’s key concepts. 

Get more details about all Modulus functionalities by visiting the Modulus User Guide and the Modulus Configuration page. You can also provide feedback and contributions as part of the Modulus GitLab repo

Check out the NVIDIA Deep Learning Institute self-paced course, Introduction to Physics-Informed Machine Learning with ModulusJoin us for these GTC 2022 featured sessions to learn more about NVIDIA Modulus research and breakthroughs.

Categories
Misc

Achieving Supercomputing-Scale Quantum Circuit Simulation with the NVIDIA DGX cuQuantum Appliance

Quantum circuit simulation is critical for developing applications and algorithms for quantum computers. Because of the disruptive nature of known quantum…

Quantum circuit simulation is critical for developing applications and algorithms for quantum computers. Because of the disruptive nature of known quantum computing algorithms and use cases, quantum algorithms researchers in government, enterprise, and academia are developing and benchmarking novel quantum algorithms on ever-larger quantum systems.

In the absence of large-scale, error-corrected quantum computers, the best way of developing these algorithms is through quantum circuit simulation. Quantum circuit simulations are computationally intensive, and GPUs are a natural tool for calculating quantum states. To simulate larger quantum systems, it is necessary to distribute the computation across multiple GPUs and multiple nodes to leverage the full computational power of a supercomputer.

NVIDIA cuQuantum is a software development kit (SDK) that enables users to easily accelerate and scale quantum circuit simulations with GPUs, facilitating a new capacity for exploration on the path to quantum advantage. 

This SDK includes the recently released NVIDIA DGX cuQuantum Appliance, a deployment-ready software container, with multi-GPU state vector simulation support. Generalized multi-GPU APIs are also now available in cuStateVec, for easy integration into any simulator. For tensor network simulation, the slicing API provided by the cuQuantum cuTensorNet library enables accelerated tensor network contractions distributed across multiple GPUs or multiple nodes. This has allowed users to take advantage of DGX A100 systems with nearly linear strong scaling. 

The NVIDIA cuQuantum SDK features libraries for state vector and tensor network methods.  This post focuses on cuStateVec for multi-node state vector simulation and the DGX cuQuantum Appliance. If you are interested in learning more about cuTensorNet and tensor network methods, see Scaling Quantum Circuit Simulation with NVIDIA cuTensorNet.

What is a multi-node, multi-GPU state vector simulation

A node is a single package unit made up of tightly interconnected processors optimized to work together while maintaining a rack-ready form factor. Multi-node multi-GPU state vector simulations take advantage of multiple GPUs within a node and multiple nodes of GPUs to provide faster time to solution and larger problem sizes than would otherwise be possible. 

DGX enables users to take advantage of high memory, low latency, and high bandwidth. The DGX H100 system is made up of eight H100 Tensor Core GPUs, leveraging the fourth-generation NVLink and third-generation NVSwitch. This node is a powerhouse for quantum circuit simulation. 

Running on a DGX A100 node with the NVIDIA multi-GPU-enabled DGX cuQuantum Appliance on all eight GPUs resulted in 70x to 290x speedups over dual 64-core AMD EPYC 7742 processors for three common quantum computing algorithms: Quantum Fourier Transform, Shor’s Algorithm, and the Sycamore Supremacy circuit. This has enabled users to simulate up to 36 qubits with full state vector methods using a single DGX A100 node (eight GPUs). The results shown in Figure 1 are 4.4x higher since we last announced benchmarks for this capability, due to software-only enhancements our team has implemented.

Graph showing acceleration ratio of simulations for popular quantum circuits show a speed up of 70-294x with GPUs over data center- and HPC-grade CPUs. Performance of GPU simulations was measured on DGX A100 and compared to the performance of two sockets of the EPYC 7742 CPU.
Figure 1. DGX cuQuantum Appliance multi-GPU speedup over state-of-the-art dual socket CPU server

The NVIDIA cuStateVec team has intensively investigated a performant means of leveraging multiple nodes in addition to multiple GPUs within a single node. Because most gate applications are perfectly parallel operations, GPUs within and across nodes can be orchestrated to divide and conquer. 

During the simulation, the state vector is split and distributed among GPUs, and each GPU can apply a gate to its part of the state vector in parallel. In many cases this can be handled locally; however, gate applications to high-order qubits require communication among distributed state vectors. 

One typical approach is to first reorder qubits and then apply gates in each GPU without accessing other GPUs or nodes. This reordering itself needs data transfers between devices. To do this efficiently, high interconnect bandwidth becomes incredibly important. Efficiently taking advantage of this parallelizability is non-trivial across multiple nodes.

Introducing the multi-node DGX cuQuantum Appliance

The answer to performantly and arbitrarily scale state vector-based quantum circuit simulation is here. NVIDIA is pleased to announce the multi-node, multi-GPU capability delivered in the new DGX cuQuantum Appliance. In our next release, any cuQuantum container user will be able to quickly and easily leverage an IBM Qiskit frontend to simulate quantum circuits on the largest NVIDIA systems in the world. 

The cuQuantum mission is to enable as many users as possible to easily accelerate and scale quantum circuit simulation. To that end, the cuQuantum team is working to productize the NVIDIA multi-node approach into APIs, which will be in general availability early next year. With this approach, you will be able to leverage a wider range of NVIDIA GPU-based systems to scale your state vector quantum circuit simulations. 

The NVIDIA multi-node DGX cuQuantum Appliance is in its final stages of development, and you will soon be able to take advantage of the best-in-class performance available with NVIDIA DGX SuperPOD systems. This will be offered as an NGC-hosted container image that you can quickly deploy with the help of Docker and a few lines of code.

With the fastest I/O architecture of any DGX system, NVIDIA DGX H100 is the foundational building block for large AI clusters like NVIDIA DGX SuperPOD, the enterprise blueprint for scalable AI, and now, quantum circuit simulation infrastructure. The eight NVIDIA H100 GPUs in the DGX H100 use the new high-performance fourth-generation NVLink technology to interconnect through four third-generation NVSwitches. 

The fourth-generation NVLink technology delivers 1.5x the communications bandwidth of the prior generation and is up to 7x faster than PCIe Gen5. It delivers up to 7.2 TB/sec of total GPU-to-GPU throughput, an improvement of almost 1.5x compared to the prior generation DGX A100. 

Along with the eight included NVIDIA ConnectX-7 InfiniBand / Ethernet adapters, each running at 400 GB/sec, the DGX H100 system provides a powerful high-speed fabric saving overhead in global communications among state vectors distributed across multiple nodes. The combination of multi-node, multi-GPU cuQuantum with massive GPU-accelerated compute leveraging state-of-the-art networking hardware and software optimizations means that DGX H100 systems can scale to hundreds or thousands of nodes to meet the biggest challenges, such as scaling full state vector quantum circuit simulation past 50 qubits. 

To benchmark this work, the multi-node DGX cuQuantum Appliance is run on the NVIDIA Selene Supercomputer, the reference architecture for NVIDIA DGX SuperPOD systems. As of June 2022, Selene is ranked eighth on the TOP500 list of supercomputing systems executing the High Performance Linpack (HPL) benchmark with 63.5 petaflops, and number 22 on the Green500 list with 24.0 gigaflops per watt.  

NVIDIA ran benchmarks leveraging the multi-node DGX cuQuantum Appliance: Quantum Volume, the Quantum Approximate Optimization Algorithm (QAOA), and Quantum Phase Estimation. The Quantum Volume circuit ran with a depth of 10 and 30. QAOA is a common algorithm used to solve combinatorial optimization problems on, relatively, near-term quantum computers. We ran it with two parameters. 

Both weak and strong scaling are demonstrated in the preceding algorithms. It is clear that scaling to a supercomputer like the NVIDIA DGX SuperPOD is valuable for both accelerating time-to-solution and extending the phase space researchers can explore with state vector quantum circuit simulation techniques. 

Chart showing the scaling state vector based quantum circuit simulations from 32 to 40 qubits, of Quantum Volume with a depth of 30, Quantum Approximate Optimization Algorithm and Quantum Phase Estimation. All runs were conducted on multiple GPUs, going up to 256 total NVIDIA A100 GPUs on NVIDIA’s Selene supercomputer, made easy by the DGX cuQuantum Appliance multi-node capability.
Figure 2. DGX cuQuantum Appliance multi-node weak scaling performance from 32 to 40 qubits 

We are further enabling users to achieve scale with our updated DGX cuQuantum Appliance. By introducing multi-node capabilities, we are enabling users to move beyond 32 qubits on one GPU, and 36 qubits on one NVIDIA Ampere Architecture node. We simulate a total of 40 qubits with 32 DGX A100 nodes. Users will now be able to scale out even further depending upon system configurations, with a software limit of 56 qubits or millions of DGX A100 nodes. Our other preliminary tests on NVIDIA Hopper GPUs have shown us that these numbers will be even better on our next-generation architecture. 

We also measured the strong scaling of our multi-node capabilities. We focused on Quantum Volume for simplicity. Figure 3 describes the performance when we solved the same problem multiple times changing the number of GPUs. Compared to state-of-the-art dual-socket server CPU, we obtained 320x to 340x speedups when leveraging 16 DGX A100 nodes. This is also 3.5x faster than the previous state-of-the-art implementation of quantum volume (depth=10 for 36 qubits with just two DGX A100 nodes). When adding more nodes, this speedup becomes more dramatic.  

Graph showing accelerating 32 qubit implementations of Quantum Volume with a depth of 10 and a depth of 30, with GPUs in NVIDIA’s Selene, we show that you can easily take advantage of GPUs and scale to speed up quantum circuit simulations with cuStateVec when compared to CPUs. We leveraged up to 16 nodes of NVIDIA DGX A100s, which totaled 128 NVIDIA A100 GPUs.
Figure 3. DGX cuQuantum Appliance multi-node speedup for 32 qubit Quantum Volume compared to a state-of-the-art CPU server

Simulate and scale quantum circuits on the largest NVIDIA systems

The cuQuantum team at NVIDIA is scaling up state vector simulation to multi-node, multi-GPU. This enables end users to conduct quantum circuit simulation for full state vectors larger than ever before. Not only has cuQuantum enabled scale, but also performance, showing weak scaling and strong scaling across nodes. 

In addition, cuQuantum introduced the first cuQuantum-powered IBM Qiskit image. In our next release, you will be able to pull this container, making it easier and faster to scale up quantum circuit simulations with this popular framework.

While the multi-node DGX cuQuantum Appliance is in private beta today, NVIDIA expects to release it publicly over the coming months. The cuQuantum team intends to release multi-node APIs within the cuStateVec library by spring 2023.

Getting started with DGX cuQuantum Appliance

When the multi-node DGX cuQuantum Appliance is in general availability later this year, you will be able to pull the Docker image from the NGC catalog for containers

You can reach out to the cuQuantum team with questions through the Quantum Computing Forum. Contact us on the NVIDIA/cuQuantum GitHub repo with feature requests or to report a bug. 

For more information, see the following resources:

GTC 2022 and cuQuantum

Join us for these GTC 2022 sessions to learn more about NVIDIA cuQuantum and other advancements:

Categories
Offsites

View Synthesis with Transformers

A long-standing problem in the intersection of computer vision and computer graphics, view synthesis is the task of creating new views of a scene from multiple pictures of that scene. This has received increased attention [1, 2, 3] since the introduction of neural radiance fields (NeRF). The problem is challenging because to accurately synthesize new views of a scene, a model needs to capture many types of information — its detailed 3D structure, materials, and illumination — from a small set of reference images.

In this post, we present recently published deep learning models for view synthesis. In “Light Field Neural Rendering” (LFNR), presented at CVPR 2022, we address the challenge of accurately reproducing view-dependent effects by using transformers that learn to combine reference pixel colors. Then in “Generalizable Patch-Based Neural Rendering” (GPNR), to be presented at ECCV 2022, we address the challenge of generalizing to unseen scenes by using a sequence of transformers with canonicalized positional encoding that can be trained on a set of scenes to synthesize views of new scenes. These models have some unique features. They perform image-based rendering, combining colors and features from the reference images to render novel views. They are purely transformer-based, operating on sets of image patches, and they leverage a 4D light field representation for positional encoding, which helps to model view-dependent effects.

We train deep learning models that are able to produce new views of a scene given a few images of it. These models are particularly effective when handling view-dependent effects like the refractions and translucency on the test tubes. This animation is compressed; see the original-quality renderings here. Source: Lab scene from the NeX/Shiny dataset.

Overview
The input to the models consists of a set of reference images and their camera parameters (focal length, position, and orientation in space), along with the coordinates of the target ray whose color we want to determine. To produce a new image, we start from the camera parameters of the input images, obtain the coordinates of the target rays (each corresponding to a pixel), and query the model for each.

Instead of processing each reference image completely, we look only at the regions that are likely to influence the target pixel. These regions are determined via epipolar geometry, which maps each target pixel to a line on each reference frame. For robustness, we take small regions around a number of points on the epipolar line, resulting in the set of patches that will actually be processed by the model. The transformers then act on this set of patches to obtain the color of the target pixel.

Transformers are especially useful in this setting since their self-attention mechanism naturally takes sets as inputs, and the attention weights themselves can be used to combine reference view colors and features to predict the output pixel colors. These transformers follow the architecture introduced in ViT.

To predict the color of one pixel, the models take a set of patches extracted around the epipolar line of each reference view. Image source: LLFF dataset.

Light Field Neural Rendering
In Light Field Neural Rendering (LFNR), we use a sequence of two transformers to map the set of patches to the target pixel color. The first transformer aggregates information along each epipolar line, and the second along each reference image. We can interpret the first transformer as finding potential correspondences of the target pixel on each reference frame, and the second as reasoning about occlusion and view-dependent effects, which are common challenges of image-based rendering.

LFNR uses a sequence of two transformers to map a set of patches extracted along epipolar lines to the target pixel color.

LFNR improved the state-of-the-art on the most popular view synthesis benchmarks (Blender and Real Forward-Facing scenes from NeRF and Shiny from NeX) with margins as large as 5dB peak signal-to-noise ratio (PSNR). This corresponds to a reduction of the pixel-wise error by a factor of 1.8x. We show qualitative results on challenging scenes from the Shiny dataset below:

LFNR reproduces challenging view-dependent effects like the rainbow and reflections on the CD, reflections, refractions and translucency on the bottles. This animation is compressed; see the original quality renderings here. Source: CD scene from the NeX/Shiny dataset.
Prior methods such as NeX and NeRF fail to reproduce view-dependent effects like the translucency and refractions in the test tubes on the Lab scene from the NeX/Shiny dataset. See also our video of this scene at the top of the post and the original quality outputs here.

Generalizing to New Scenes
One limitation of LFNR is that the first transformer collapses the information along each epipolar line independently for each reference image. This means that it decides which information to preserve based only on the output ray coordinates and patches from each reference image, which works well when training on a single scene (as most neural rendering methods do), but it does not generalize across scenes. Generalizable methods are important because they can be applied to new scenes without needing to retrain.

We overcome this limitation of LFNR in Generalizable Patch-Based Neural Rendering (GPNR). We add a transformer that runs before the other two and exchanges information between points at the same depth over all reference images. For example, this first transformer looks at the columns of the patches from the park bench shown above and can use cues like the flower that appears at corresponding depths in two views, which indicates a potential match. Another key idea of this work is to canonicalize the positional encoding based on the target ray, because to generalize across scenes, it is necessary to represent quantities in relative and not absolute frames of reference. The animation below shows an overview of the model.

GPNR consists of a sequence of three transformers that map a set of patches extracted along epipolar lines to a pixel color. Image patches are mapped via the linear projection layer to initial features (shown as blue and green boxes). Then those features are successively refined and aggregated by the model, resulting in the final feature/color represented by the gray rectangle. Park bench image source: LLFF dataset.

To evaluate the generalization performance, we train GPNR on a set of scenes and test it on new scenes. GPNR improved the state-of-the-art on several benchmarks (following IBRNet and MVSNeRF protocols) by 0.5–1.0 dB on average. On the IBRNet benchmark, GPNR outperforms the baselines while using only 11% of the training scenes. The results below show new views of unseen scenes rendered with no fine-tuning.

GPNR-generated views of held-out scenes, without any fine tuning. This animation is compressed; see the original quality renderings here. Source: IBRNet collected dataset.
Details of GPNR-generated views on held-out scenes from NeX/Shiny (left) and LLFF (right), without any fine tuning. GPNR reproduces more accurately the details on the leaf and the refractions through the lens when compared against IBRNet.

Future Work
One limitation of most neural rendering methods, including ours, is that they require camera poses for each input image. Poses are not easy to obtain and typically come from offline optimization methods that can be slow, limiting possible applications, such as those on mobile devices. Research on jointly learning view synthesis and input poses is a promising future direction. Another limitation of our models is that they are computationally expensive to train. There is an active line of research on faster transformers which might help improve our models’ efficiency. For the papers, more results, and open-source code, you can check out the projects pages for “Light Field Neural Rendering” and “Generalizable Patch-Based Neural Rendering“.

Potential Misuse
In our research, we aim to accurately reproduce an existing scene using images from that scene, so there is little room to generate fake or non-existing scenes. Our models assume static scenes, so synthesizing moving objects, such as people, will not work.

Acknowledgments
All the hard work was done by our amazing intern – Mohammed Suhail – a PhD student at UBC, in collaboration with Carlos Esteves and Ameesh Makadia from Google Research, and Leonid Sigal from UBC. We are thankful to Corinna Cortes for supporting and encouraging this project.

Our work is inspired by NeRF, which sparked the recent interest in view synthesis, and IBRNet, which first considered generalization to new scenes. Our light ray positional encoding is inspired by the seminal paper Light Field Rendering and our use of transformers follow ViT.

Video results are from scenes from LLFF, Shiny, and IBRNet collected datasets.

Categories
Misc

HARMAN to Deliver Immersive In-Vehicle Experience With NVIDIA DRIVE IX

Breakthroughs in centralized, high performance computing aren’t just opening up new functionality for autonomous driving, but for the in-vehicle experience as well. With the introduction of NVIDIA DRIVE Thor, automakers can build unified AI compute platforms that combine advanced driver-assistance systems and in-vehicle infotainment. The centralized NVIDIA DRIVE architecture supports novel features in the vehicle, Read article >

The post HARMAN to Deliver Immersive In-Vehicle Experience With NVIDIA DRIVE IX appeared first on NVIDIA Blog.

Categories
Misc

Solving AI Inference Challenges with NVIDIA Triton

Deploying AI models in production to meet the performance and scalability requirements of the AI-driven application while keeping the infrastructure costs low…

Deploying AI models in production to meet the performance and scalability requirements of the AI-driven application while keeping the infrastructure costs low is a daunting task.

Join the NVIDIA Triton and NVIDIA TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more.

This post provides you with a high-level overview of AI inference challenges that commonly occur when deploying models in production, along with how NVIDIA Triton Inference Server is being used today across industries to solve these cases.

We also examine some of the recently added features, tools, and services in Triton that simplify the deployment of AI models in production, with peak performance and cost efficiency.

Challenges to consider when deploying AI inference

AI inference is the production phase of running AI models to make predictions. Inference is complex but understanding the factors that affect your application’s speed and performance will help you deliver fast, scalable AI in production.

Challenges for developers and ML engineers

  • Many types of models: AI, machine learning, and deep learning (neural network–based) models with different architectures and different sizes.
  • Different inference query types: Real-time, offline batch, streaming video and audio, and model pipelines make meeting application service level agreements challenging.
  • Constantly evolving models: Models in production must be updated continuously based on new data and algorithms, without business disruptions.

Challenges for MLOps, IT, and DevOps practitioners

  • Multiple model frameworks: There are different training and inference frameworks like TensorFlow, PyTorch, XGBoost, TensorRT, ONNX, or just plain Python. Deploying and maintaining each of these frameworks in production for applications can be costly.
  • Diverse processors: The models can be executed on a CPU or GPU. Having a separate software stack for each processor platform leads to unnecessary operational complexity.
  • Diverse deployment platforms: Models are deployed on public clouds, on-premises data centers, at the edge, and on embedded devices on bare metal, virtualized, or a third-party ML platform. Disparate solutions or less than optimal solutions to fit the given platform leads to poor ROI.  This might include slower rollouts, poor app performance, or using more resources.

A combination of these factors makes it challenging to deploy AI inference in production with the desired performance and cost efficiency.

New AI inference use cases using NVIDIA Triton

NVIDIA Triton Inference Server (Triton), is an open source inference serving software that supports all major model frameworks (TensorFlow, PyTorch, TensorRT, XGBoost, ONNX, OpenVINO, Python, and others). Triton can be used to run models on x86 and Arm CPUs, NVIDIA GPUs, and AWS Inferentia. It addresses the complexities discussed earlier through standard features.

Triton is used by thousands of organizations across industries worldwide. Here’s how Triton helps solves AI inference challenges for some customers.

NIO Autonomous Driving

NIO uses Triton to run their online services models in the cloud and data center. These models process data from autonomous driving vehicles. NIO used the Triton model ensemble feature to move their pre– and post-processing functions from client application to Triton Inference Server. The preprocessing was accelerated by 5x, increasing their overall inference throughput and enabling them to cost-efficiently process more data from the vehicles.

GE Healthcare

GE Healthcare uses Triton in their Edison platform to standardize inference serving across different frameworks (TensorFlow, PyTorch, ONNX, and TensorRT) for in-house models. The models are deployed on a variety of hardware systems from embedded (for example, an x-ray system) to on-premises servers.

Wealthsimple

The online investment management firm uses Triton on CPUs to run their fraud detection and other fintech models. Triton helped them consolidate their different serving software across applications into a single standard for multiple frameworks.

Tencent

Tencent uses Triton in their centralized ML platform for unified inference for several business applications. In aggregate, Triton helps them process 1.5M queries per day. Tencent achieved low cost of inference through Triton dynamic batching and concurrent model execution capabilities.

Alibaba Intelligent Connectivity

Alibaba Intelligent Connectivity is developing AI systems for their smart speaker applications.  They use Triton in the data center to run models that generate streaming text to speech for the smart speaker. Triton delivered the lowest first packet latency needed for a good audio experience.

Yahoo Japan

Yahoo Japan uses Triton on CPUs in the data center to run models to find similar locations for the “spot search” functionality in the Yahoo Browser app. Triton is used to run the complete image search pipeline and is also integrated into their centralized ML platform to support multiple frameworks on CPUs and GPUs.             

Airtel

The second largest wireless provider in India, Airtel uses Triton for automatic speech recognition (ASR) models for contact center application to improve customer experience. Triton helped them upgrade to a more accurate ASR model and still get a 2x throughput increase on GPUs compared to the previous serving solution.

How Triton Inference Server addresses AI inference challenges

From fintech to autonomous driving, all applications can benefit from out-of-box functionality to deploy models into production easily.

This section discusses a few key new features, tools, and services that Triton provides out-of-box that can be applied to deploy, run, and scale models in production.

Model orchestration with new management service

Triton brings a new model orchestration service for efficient multi-model inference. This software application, currently in early access, helps simplify the deployment of Triton instances in Kubernetes with many models in a resource-efficient way. Some of the key features of this service include the following:

  • Loading models on demand and unloading models when not in use.
  • Efficiently allocating GPU resources by placing multiple models on a single GPU server wherever possible
  • Managing custom resource requirements for individual models and model groups

For a short demo of this service, see Take Your AI Inference to the Next Level. The model orchestration feature is in private early access (EA). If you are interested in trying it out, sign up now.

Large language model inference

In the area of natural language processing (NLP), the size of models is growing exponentially (Figure 1). Large transformer-based models with hundreds of billions of parameters can solve many NLP tasks such as text summarization, code generation, translation, or PR headline and ad generation.

Chart shows different NLP models to show that sizes have been growing exponentially over the years.
Figure 1. Growing size of NLP models

But these models are so large that they cannot fit in a single GPU. For example, Turing-NLG with 17.2B parameters needs at least 34 GB of memory to store weights and biases in FP16 and GPT-3 with 175B parameters needs at least 350 GB. To use them for inference, you need multi-GPU and increasingly multi-node execution for serving the model.

Triton Inference Server has a backend called FasterTransformer that brings multi-GPU multi-node inference for large transformer models like GPT, T5, and others. The large language model is converted to FasterTransformer format with optimizations and distributed inference capabilities and is then run using Triton Inference Server across GPUs and nodes.

Figure 2 shows the speedup observed with Triton to run the GPT-J (6B) model on a CPU or one and two A100 GPUs.

Chart showing speedup achieved with running LLM on GPUs compared to CPU with FasterTransformer backend delivering the best performance.
Figure 2. Model speedup with the FasterTransformer backend

For more information about large language model inference with the Triton FasterTransformer backend, see Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server and Deploying GPT-J and T5 with NVIDIA Triton Inference Server.

Inference of tree-based models

Triton can be used to deploy and run tree-based models from frameworks such as XGBoost, LightGBM, and scikit-learn RandomForest on both CPUs and GPUs with explainability using SHAP values. It accomplishes that using the Forest Inference Library (FIL) backend that was introduced last year.

The advantage of using Triton for tree-based model inference is better performance and standardization of inference across machine learning and deep learning models. It is especially useful for real-time applications such as fraud detection, where bigger models can be used easily for better accuracy.

For more information about deploying a tree-based model with Triton, see Real-time Serving for XGBoost, Scikit-Learn RandomForest, LightGBM, and More. The post includes a fraud detection notebook.

Try this NVIDIA Launchpad lab to deploy an XGBoost fraud detection model with Triton.

Optimal model configuration with Model analyzer

Efficient inference serving requires choosing optimal values for parameters such as batch size, model concurrency, or precision for a given target processor. These values dictate throughput, latency, and memory requirements. It can take weeks to try hundreds of combinations manually across a range of values for each parameter.

The Triton model analyzer tool reduces the time that it takes to find the optimal configuration parameters, from weeks to days or even hours. It does this by running hundreds of simulations of inference with different values of batch size and model concurrency for a given target processor offline. At the end, it provides charts like Figure 3 that make it easy to choose the optimal deployment configuration. For more information about the model analyzer tool and how to use it for your inference deployment, see Identifying the Best AI Model Serving Configurations at Scale with NVIDIA Triton Model Analyzer.

Screenshot shows the sample output from the Triton model analyzer tool, showing the best configurations and memory usage for a BERT large model.
Figure 3. Output chart from model analyzer tool

Model pipelines with business logic scripting

Two diagrams depicting model execution pipelines with the business logic scripting feature. Left is conditional model execution and right is autoregressive modeling.
Figure 4. Model ensembles with business logic scripting

With the model ensemble feature in Triton, you can build out complex model pipelines and ensembles with multiple models and pre– and post-processing steps. Business logic scripting enables you to add conditionals, loops, and reordering of steps in the pipeline.

Using the Python or C++ backends, you can define a custom script that can call any other model being served by Triton based on conditions that you choose. Triton efficiently passes data to the newly called model, avoiding unnecessary memory copying whenever possible. Then the result is passed back to your custom script, from which you can continue further processing or return the result.

Figure 4 shows two examples of business logic scripting:

  • Conditional execution helps you use resources more efficiently by avoiding the execution of unnecessary models.
  • Autoregressive models, like transformer decoding, require the output of a model to be repeatedly fed back into itself until a certain condition is reached. Loops in business logic scripting enable you to accomplish that.

For more information, see Business Logic Scripting.

Auto-generation of model configuration

Triton can automatically generate config files for your models for faster deployment. For TensorRT, TensorFlow, and ONNX models, Triton generates the minimum required config settings to run your model by default when it does not detect a config file in the repository.

Triton can also detect if your model supports batched inference. It sets max_batch_size to a configurable default value.

You can also include commands in your own custom Python and C++ backends to generate model config files automatically based on the script contents. These features are especially useful when you have many models to serve, as it avoids the step of manually creating the config files. For more information, see Auto-generated Model Configuration.

Decoupled input processing

Diagram shows a client application sending a single large request to Triton Inference Server and receiving several smaller responses in return.
Figure 5. A one-request to many-response scenario enabled by decoupled input processing

While many inference settings require a one-to-one correspondence between inference requests and responses, this isn’t always the optimal data flow.

For instance, with ASR models, sending the full audio and waiting for the model to finish execution may not result in a good user experience. The wait can be long. Instead, Triton can send back the transcribed text in multiple short chunks (Figure 5), reducing the latency and time to the first response.

With decoupled model processing in the C++ or Python backend, you can send back multiple responses for a single request. Of course, you could also do the opposite: send multiple small requests in chunks and get back one big response. This feature provides flexibility in how you process and send your inference responses. For more information, see Decoupled Models.

For more information about recently added features, see the NVIDIA Triton release notes.

Get started with scalable AI model deployment

You can deploy, run, and scale AI models with Triton to effectively mitigate AI inference challenges that you may have with multiple frameworks, a diverse infrastructure, large language models, optimal model configurations, and more.

Triton Inference Server is open-source and supports all major model frameworks such as TensorFlow, PyTorch, TensorRT, XGBoost, ONNX, OpenVINO, Python, and even custom frameworks on GPU and CPU systems. Explore more ways to integrate Triton with any application, deployment tool, and platform, on the cloud, on-premises, and at the edge.

For more information, see the following resources:

Categories
Misc

Powering Ultra-High Speed Frame Rates in AI Medical Devices with the NVIDIA Clara Holoscan SDK

In the operating room, the latency and reliability of surgical video streams can make all the difference for patient outcomes. Ultra-high-speed frame rates from…

In the operating room, the latency and reliability of surgical video streams can make all the difference for patient outcomes. Ultra-high-speed frame rates from sensor inputs that enable next-generation AI applications can provide surgeons with new levels of real-time awareness and control.

To build real-time AI capabilities into medical devices for use cases like surgical navigation, image-guided intervention such as endoscopy, and medical robotics, developers need AI pipelines that allow for low-latency processing of combined sensor data from multiple channels.

As announced at GTC 2022, NVIDIA Clara Holoscan SDK v0.3 now provides a lightning-fast frame rate of 240 Hz for 4K video. This enables developers to combine data from more sensors and build AI applications that can provide surgical guidance. With faster data transfer enabled through high-speed Ethernet-connected sensors, developers have even more tools to build accelerated AI pipelines.

Real-time AI processing of frontend sensors

NVIDIA Clara Holoscan enables high-speed sensor input through the ConnectX SmartNIC and NVIDIA Rivermax SDK with GPUDirect RDMA that bypasses the CPU. This allows for high-speed Ethernet output of data from sensors into the AI compute system. The result is unmatched performance for edge AI.

While traditional GStreamer and OpenGL-based endoscopy pipelines have an end-to-end latency of 220 ms on a 1080p 60 Hz stream, high-speed pipelines with Clara Holoscan boast an end-to-end latency of only 10 ms on a 4K 240 Hz stream.

Streaming data at 4K 60 Hz at under 50 ms on NVIDIA RTX A6000, teams can run 15 concurrent AI video streams and 30 concurrent models.

NVIDIA Rivermax SDK

The NVIDIA Rivermax SDK, included with NVIDIA Clara Holoscan, enables direct data transfers to and from the GPU. Bypassing host memory and using the offload capabilities of the ConnectX SmartNIC, it delivers best-in-class throughput and latency with minimal utilization for streaming workloads. NVIDIA Clara Holoscan leverages the Rivermax functionalities to bring scalable connectivity for high-bandwidth network sensors and support very fast data transfer.

NVIDIA G-SYNC

NVIDIA G-SYNC enables high display performance by synchronizing display refresh rates to the GPU, thus eliminating screen tearing and minimizing display stutter and input lag. As a result, the AI inference can be shown with very low latency.

NVIDIA Clara HoloViz

Clara HoloViz is a module in Holoscan for visualizing data. Clara HoloViz composites real-time streams of frames with multiple different other layers like segmentation mask layers, geometry layers, and GUI layers.

For maximum performance, Clara HoloViz makes use of Vulkan, which is already installed as part of the NVIDIA driver.

Clara HoloViz uses the concept of the immediate mode design pattern for its API. No objects are created and stored by the application. This makes it easy to quickly build and change the visualization in an Holoscan application.

Improved developer experience

The NVIDIA Clara Holoscan SDK v0.3 release brings significant improvements to the development experience. First, the addition of a new C++ API for the creation of GXF extensions gives developers an additional pathway to building their desired applications. Second, the support for x86 processors allows developers to quickly get started with developing AI applications which can then be easily deployed on IGX development kits. Third, the Bring Your Own Model (BYOM) support has been enriched in this latest version.

Holoscan C++ API

The Holoscan C++ API provides a new convenient way to compose GXF workflows, without the need to write YAML files. The Holoscan C++ API enables a more flexible and scalable approach to creating applications. It has been designed for use as a drop-in replacement for the GXF Framework’s API and provides a common interface for GXF components.

Diagram showing the main components of the Holoscan API.
Figure 1. The main components of the Holoscan API

Application: An application acquires and processes streaming data. An application is a collection of fragments where each fragment can be allocated to execute on a physical node of a Holoscan cluster.

Fragment: A fragment is a building block of the application. It is a directed acyclic graph (DAG) of operators. A fragment can be assigned to a physical node of a Holoscan cluster during execution. The run-time execution manages communication across fragments. In a fragment, operators (graph nodes) are connected to each other by flows (graph edges).

Operator: An operator is the most basic unit of work in this framework. An operator receives streaming data at an input port, processes it, and publishes it to one of its output ports. A codelet in GXF would be replaced with an operator in the framework. An operator encapsulates receivers and transmitters of a GXF entity as I/O ports of the operator.

Resource: Resources such as system memory or a GPU memory pool that an operator needs to perform its job. Resources are allocated during the initialization phase of the application. The resource matches the semantics of the GXF Memory Allocator or any other components derived from the component class in GXF.

Condition: A condition is a predicate that can be evaluated at runtime to determine if an operator should execute. This matches the semantics of the GXF Scheduling Term class.

Port: An interaction point between two operators. Operators ingest data at input ports and publish data at output ports. Receiver, transmitter, and MessageRouter in GXF are replaced with the concept of I/O port of the operator.

Executor: An executor manages the execution of a fragment on a physical node. The framework provides a default executor that uses a GXF scheduler to execute an application.

You can find more information about the new C++ API in the SDK documentation. See an example of a full AI application for endoscopy tool tracking using the new C++ API in the public source code repository.

Support for x86 systems

The NVIDIA Clara Holoscan SDK has been designed with various hardware systems in mind. It supports using the SDK on x86 systems, in addition to the NVIDIA IGX DevKit and the Clara AGX DevKit. With x86 support, researchers and developers who do not have a DevKit can use the Holoscan SDK on their x86 machines to quickly build AI applications for medical devices.

Bring Your Own Model

The Holoscan SDK provides AI libraries and pretrained AI models to jump-start the timeline to build your own AI applications. You can also reference applications for endoscopy and ultrasound with Bring Your Own Model (BYOM) support.

As a developer, you can quickly build AI pipelines by dropping your own models in the reference applications provided as part of the SDK. Finally, the SDK also includes sensor I/O integration options and performance tools that optimize AI applications for production deployment.

Software stack updates

The NVIDIA Clara Holoscan SDK v0.3 release also integrates an upgrade from NVIDIA JetPack HP1 to Holopack 1.1, running the Tegra Board Support Package (BSP) version 34.1.3, as well as an upgrade for GXF from version 2.4.2 to version 2.4.3.

Get started building AI for medical devices

From training AI models to verifying and validating AI applications and ultimately deploying for commercial production, Clara Holoscan helps streamline AI development and deployment. 

Visit the Clara Holoscan SDK web page to access healthcare-specific acceleration libraries, pretrained AI models, sample applications, documentation, and more to get started building software-defined medical devices.

You can also request a free hands-on lab with NVIDIA LaunchPad to experience how Clara Holoscan simplifies the development of AI pipelines for endoscopy and ultrasound.

Categories
Misc

Building Cloud-Native, AI-Powered Avatars with NVIDIA Omniverse ACE

Explore the AI technology powering Violet, the interactive avatar showcased this week in the NVIDIA GTC 2022 keynote. Learn new details about NVIDIA Omniverse…

Explore the AI technology powering Violet, the interactive avatar showcased this week in the NVIDIA GTC 2022 keynote. Learn new details about NVIDIA Omniverse Avatar Cloud Engine (ACE), a collection of cloud-native AI microservices for faster, easier deployment of interactive avatars, and NVIDIA Tokkio, a domain-specific AI reference application that leverages Omniverse ACE for creating fully autonomous interactive customer service avatars.

AI-powered avatars in the cloud

Digital assistants and avatars can take many different forms and shapes, from the common text-driven chatbots to fully animated digital humans and physical robots that can see and hear people. These avatars will populate virtual worlds to help us create and build things, be a brand ambassador and customer service agent, help you find something on a website, take your order at a drive-through, or recommend a retirement or insurance plan. 

A real-time interactive 3D avatar can deliver a natural, engaging experience that makes people feel more comfortable. AI-based virtual assistants can also use non-verbal cues like your facial expressions and eye contact, to enhance communication and understanding of your requests and intent.

Photo shows a woman and child interact with an avatar in a restaurant ordering kiosk.
Figure 1. NVIDIA Omniverse ACE powers AI-powered avatars like Violet, from the NVIDIA Tokkio reference application showcased at GTC

But building these avatar applications at scale requires a broad range of expertise, including computer graphics, AI, and DevOps. Most current methods for animating avatars leverage traditional motion capturing solutions, which are challenging to use for real-time applications.

Cutting-edge NVIDIA AI technologies, such as Omniverse Audio2Face, NVIDIA Riva, and NVIDIA Metropolis, change the game by enabling avatar motion to be driven by audio and video. Connecting character animation directly to an avatar’s conversational intelligence enables faster, easier engineering and deployment of interactive avatars at scale.

When an avatar is created, it must also be integrated into an application and deployed. This requires powerful GPUs to drive both the rendering of sophisticated 3D characters and the AI intelligence that brings them to life. Monolithic solutions are optimized for specific endpoints, while cloud-native solutions are more scalable across all endpoints, including mobile, web, and limited compute devices such as augmented reality headsets.

NVIDIA Omniverse Avatar Cloud Engine (ACE) helps address these challenges by delivering all the necessary AI building blocks to bring intelligent avatars to life, at scale.

Omniverse ACE and AI microservices

Omniverse ACE is a collection of cloud-native AI models and microservices for building, customizing, and deploying intelligent and engaging avatars easily. These AI microservices power the backend of interactive avatars, making it possible for these virtual robots to see, perceive, intelligently converse, and provide recommendations to users.

Omniverse ACE uses Universal Scene Description (USD) and the NVIDIA Unified Compute Framework (UCF), a fully accelerated framework that enables you to combine optimized and accelerated microservices into real-time AI applications.

Every microservice has a bounded domain context (animation AI, conversational AI, vision AI, data analytics, or graphics rendering) and can be independently managed and deployed from UCF Studio.

The AI microservices include the following:

  • Animation AI: Omniverse Audio2Face simplifies the animation of a 3D character to match any voice-over track, helping users animate characters for games, films, or real-time digital assistants.
  • Conversational AI: Includes the NVIDIA Riva SDK for speech AI and the NVIDIA NeMo Megatron framework for natural language processing. These tools enable you to quickly build and deploy cutting-edge applications that deliver high-accuracy, expressive voices, and real-time responses.
  • Vision AI: NVIDIA Metropolis enables computer vision workflows—from model development to deployment—for individual developers, higher education and research, and enterprises.
  • Recommendation AI: NVIDIA Merlin is an open-source framework for building high-performing recommender systems at scale. It includes libraries, methods, and tools that streamline recommender builds.

NVIDIA UCF includes validated deployment-ready microservices to accelerate application development. The abstraction of each domain from the application alleviates the need for low-level domain and platform knowledge. New and custom microservices can be created using NVIDIA SDKs.

Sign up to be notified about the UCF Studio Early Access program.

No-code design tools for cloud deployment

Application developers will be able to bring all these UCF-based microservices together using NVIDIA UCF Studio, a no-code application builder tool to create, manage, and deploy applications to a private or public cloud of choice.

Designs are visualized as a combination of microservice processing pipelines. Using drag-and-drop operations, you can quickly create and combine these pipelines to build powerful applications that incorporate different AI modalities, graphics, and other processing functions.

Diagram shows how Omniverse ACE microservices connect together to form an avatar AI workflow pipeline.
Figure 2. Example of an avatar AI workflow pipeline built with Omniverse ACE

Built-in design rules and verification are part of the UCF Studio development environment to ensure that applications built there are correct-by-construction. When they’re complete, applications can be packaged into NVIDIA GPU-enabled containers and deployed to the cloud easily, using Helm charts.

Building Violet, the NVIDIA Tokkio avatar

NVIDIA Tokkio, showcased in the GTC keynote, represents the latest evolution of avatar development using Omniverse ACE. In the demo, Jensen Huang introduces Violet, a cloud-based, interactive customer service avatar that is fully autonomous.

Violet was developed using the NVIDIA Tokkio application workflow, which enables interactive avatars to see, perceive, converse intelligently, and provide recommendations to enhance customer service, both online and in places like restaurants and stores.

While the user interface and specific AI microservice components will continue to be refined within UCF Studio, the core process of how to create an avatar AI workflow pipeline and deploy it will remain the same. You will be able to quickly select, drag-and-drop, and switch between microservices to easily customize your avatars.

Video 1. The NVIDIA GTC demo showcased Violet, an AI-powered avatar that responds to natural speech and makes intelligent recommendations

You start with a fully rigged avatar and some basic animation that was rendered in Omniverse. With UCF Studio, you can select the necessary components to make the Violet character interactive. This example includes Riva automatic speech recognition (ASR) and text-to-speech (TTS) features to make her listen and speak, and Omniverse Audio2Face to provide the necessary animation.

Then, connect Violet to a food ordering dataset to enable her to handle customer orders and queries. When you’re done, UCF Studio generates a Helm chart that can be deployed onto a Kubernetes cluster through a series of CLI commands. Now, the Violet avatar is running in the cloud and can be interacted with through a web-based application or a physical food service kiosk.

Next, update her language model so that she can answer questions that don’t relate to food orders. The NVIDIA Tokkio application framework includes a customizable pretrained natural language processing (NLP) model built using NVIDIA NeMo Megatron. Her language model can be updated, in this case to a predeployed Megatron large language model (LLM) microservice, by going back into UCF Studio and updating the inference settings. Violet is redeployed and can now respond to broader, open-domain questions.

Omniverse ACE microservices will also support avatars rendered in third-party engines. You can switch out the avatar that this NVIDIA Tokkio pipeline is driving. Back in UCF Studio, replace the current microservice output of Omniverse Audio2Face to drive UltraViolet, an avatar created using Epic’s MetaHuman in Unreal Engine 5.

Learn more about Omniverse ACE

The more companies rely on AI-assisted virtual agents, the more they will want to ensure that users are relaxed, trusting, and comfortable interacting with these virtual agents and AI-assisted cloud applications.

With Omniverse ACE and domain-specific AI reference applications like NVIDIA Tokkio, you can more easily meet the demand for intelligent and responsive avatars like Violet. Take 3D models built and rendered with popular platforms like Unreal Engine and then connect these characters to AI microservices from Omniverse ACE, to bring them to life.

Interested in Omniverse ACE and getting early access when it becomes available?

To learn more about Omniverse ACE and how to build and deploy interactive avatars on the cloud, add the Building the Future of Work with AI-powered Digital Humans GTC session to your calendar.

Categories
Misc

Now You’re Speaking My Language: NVIDIA Riva Sets New Bar for Fully Customizable Speech AI

Whether for virtual assistants, transcriptions or contact centers, voice AI services are turning words and conversations into bits and bytes of business magic. Today at GTC, NVIDIA announced new additions to NVIDIA Riva, a GPU-accelerated software development kit for building and deploying speech AI applications. Riva’s pretrained models are now offered in seven languages, including Read article >

The post Now You’re Speaking My Language: NVIDIA Riva Sets New Bar for Fully Customizable Speech AI appeared first on NVIDIA Blog.