Categories
Misc

Solving AI Inference Challenges with NVIDIA Triton

Deploying AI models in production to meet the performance and scalability requirements of the AI-driven application while keeping the infrastructure costs low…

Deploying AI models in production to meet the performance and scalability requirements of the AI-driven application while keeping the infrastructure costs low is a daunting task.

Join the NVIDIA Triton and NVIDIA TensorRT community and stay current on the latest product updates, bug fixes, content, best practices, and more.

This post provides you with a high-level overview of AI inference challenges that commonly occur when deploying models in production, along with how NVIDIA Triton Inference Server is being used today across industries to solve these cases.

We also examine some of the recently added features, tools, and services in Triton that simplify the deployment of AI models in production, with peak performance and cost efficiency.

Challenges to consider when deploying AI inference

AI inference is the production phase of running AI models to make predictions. Inference is complex but understanding the factors that affect your application’s speed and performance will help you deliver fast, scalable AI in production.

Challenges for developers and ML engineers

  • Many types of models: AI, machine learning, and deep learning (neural network–based) models with different architectures and different sizes.
  • Different inference query types: Real-time, offline batch, streaming video and audio, and model pipelines make meeting application service level agreements challenging.
  • Constantly evolving models: Models in production must be updated continuously based on new data and algorithms, without business disruptions.

Challenges for MLOps, IT, and DevOps practitioners

  • Multiple model frameworks: There are different training and inference frameworks like TensorFlow, PyTorch, XGBoost, TensorRT, ONNX, or just plain Python. Deploying and maintaining each of these frameworks in production for applications can be costly.
  • Diverse processors: The models can be executed on a CPU or GPU. Having a separate software stack for each processor platform leads to unnecessary operational complexity.
  • Diverse deployment platforms: Models are deployed on public clouds, on-premises data centers, at the edge, and on embedded devices on bare metal, virtualized, or a third-party ML platform. Disparate solutions or less than optimal solutions to fit the given platform leads to poor ROI.  This might include slower rollouts, poor app performance, or using more resources.

A combination of these factors makes it challenging to deploy AI inference in production with the desired performance and cost efficiency.

New AI inference use cases using NVIDIA Triton

NVIDIA Triton Inference Server (Triton), is an open source inference serving software that supports all major model frameworks (TensorFlow, PyTorch, TensorRT, XGBoost, ONNX, OpenVINO, Python, and others). Triton can be used to run models on x86 and Arm CPUs, NVIDIA GPUs, and AWS Inferentia. It addresses the complexities discussed earlier through standard features.

Triton is used by thousands of organizations across industries worldwide. Here’s how Triton helps solves AI inference challenges for some customers.

NIO Autonomous Driving

NIO uses Triton to run their online services models in the cloud and data center. These models process data from autonomous driving vehicles. NIO used the Triton model ensemble feature to move their pre– and post-processing functions from client application to Triton Inference Server. The preprocessing was accelerated by 5x, increasing their overall inference throughput and enabling them to cost-efficiently process more data from the vehicles.

GE Healthcare

GE Healthcare uses Triton in their Edison platform to standardize inference serving across different frameworks (TensorFlow, PyTorch, ONNX, and TensorRT) for in-house models. The models are deployed on a variety of hardware systems from embedded (for example, an x-ray system) to on-premises servers.

Wealthsimple

The online investment management firm uses Triton on CPUs to run their fraud detection and other fintech models. Triton helped them consolidate their different serving software across applications into a single standard for multiple frameworks.

Tencent

Tencent uses Triton in their centralized ML platform for unified inference for several business applications. In aggregate, Triton helps them process 1.5M queries per day. Tencent achieved low cost of inference through Triton dynamic batching and concurrent model execution capabilities.

Alibaba Intelligent Connectivity

Alibaba Intelligent Connectivity is developing AI systems for their smart speaker applications.  They use Triton in the data center to run models that generate streaming text to speech for the smart speaker. Triton delivered the lowest first packet latency needed for a good audio experience.

Yahoo Japan

Yahoo Japan uses Triton on CPUs in the data center to run models to find similar locations for the “spot search” functionality in the Yahoo Browser app. Triton is used to run the complete image search pipeline and is also integrated into their centralized ML platform to support multiple frameworks on CPUs and GPUs.             

Airtel

The second largest wireless provider in India, Airtel uses Triton for automatic speech recognition (ASR) models for contact center application to improve customer experience. Triton helped them upgrade to a more accurate ASR model and still get a 2x throughput increase on GPUs compared to the previous serving solution.

How Triton Inference Server addresses AI inference challenges

From fintech to autonomous driving, all applications can benefit from out-of-box functionality to deploy models into production easily.

This section discusses a few key new features, tools, and services that Triton provides out-of-box that can be applied to deploy, run, and scale models in production.

Model orchestration with new management service

Triton brings a new model orchestration service for efficient multi-model inference. This software application, currently in early access, helps simplify the deployment of Triton instances in Kubernetes with many models in a resource-efficient way. Some of the key features of this service include the following:

  • Loading models on demand and unloading models when not in use.
  • Efficiently allocating GPU resources by placing multiple models on a single GPU server wherever possible
  • Managing custom resource requirements for individual models and model groups

For a short demo of this service, see Take Your AI Inference to the Next Level. The model orchestration feature is in private early access (EA). If you are interested in trying it out, sign up now.

Large language model inference

In the area of natural language processing (NLP), the size of models is growing exponentially (Figure 1). Large transformer-based models with hundreds of billions of parameters can solve many NLP tasks such as text summarization, code generation, translation, or PR headline and ad generation.

Chart shows different NLP models to show that sizes have been growing exponentially over the years.
Figure 1. Growing size of NLP models

But these models are so large that they cannot fit in a single GPU. For example, Turing-NLG with 17.2B parameters needs at least 34 GB of memory to store weights and biases in FP16 and GPT-3 with 175B parameters needs at least 350 GB. To use them for inference, you need multi-GPU and increasingly multi-node execution for serving the model.

Triton Inference Server has a backend called FasterTransformer that brings multi-GPU multi-node inference for large transformer models like GPT, T5, and others. The large language model is converted to FasterTransformer format with optimizations and distributed inference capabilities and is then run using Triton Inference Server across GPUs and nodes.

Figure 2 shows the speedup observed with Triton to run the GPT-J (6B) model on a CPU or one and two A100 GPUs.

Chart showing speedup achieved with running LLM on GPUs compared to CPU with FasterTransformer backend delivering the best performance.
Figure 2. Model speedup with the FasterTransformer backend

For more information about large language model inference with the Triton FasterTransformer backend, see Accelerated Inference for Large Transformer Models Using NVIDIA Triton Inference Server and Deploying GPT-J and T5 with NVIDIA Triton Inference Server.

Inference of tree-based models

Triton can be used to deploy and run tree-based models from frameworks such as XGBoost, LightGBM, and scikit-learn RandomForest on both CPUs and GPUs with explainability using SHAP values. It accomplishes that using the Forest Inference Library (FIL) backend that was introduced last year.

The advantage of using Triton for tree-based model inference is better performance and standardization of inference across machine learning and deep learning models. It is especially useful for real-time applications such as fraud detection, where bigger models can be used easily for better accuracy.

For more information about deploying a tree-based model with Triton, see Real-time Serving for XGBoost, Scikit-Learn RandomForest, LightGBM, and More. The post includes a fraud detection notebook.

Try this NVIDIA Launchpad lab to deploy an XGBoost fraud detection model with Triton.

Optimal model configuration with Model analyzer

Efficient inference serving requires choosing optimal values for parameters such as batch size, model concurrency, or precision for a given target processor. These values dictate throughput, latency, and memory requirements. It can take weeks to try hundreds of combinations manually across a range of values for each parameter.

The Triton model analyzer tool reduces the time that it takes to find the optimal configuration parameters, from weeks to days or even hours. It does this by running hundreds of simulations of inference with different values of batch size and model concurrency for a given target processor offline. At the end, it provides charts like Figure 3 that make it easy to choose the optimal deployment configuration. For more information about the model analyzer tool and how to use it for your inference deployment, see Identifying the Best AI Model Serving Configurations at Scale with NVIDIA Triton Model Analyzer.

Screenshot shows the sample output from the Triton model analyzer tool, showing the best configurations and memory usage for a BERT large model.
Figure 3. Output chart from model analyzer tool

Model pipelines with business logic scripting

Two diagrams depicting model execution pipelines with the business logic scripting feature. Left is conditional model execution and right is autoregressive modeling.
Figure 4. Model ensembles with business logic scripting

With the model ensemble feature in Triton, you can build out complex model pipelines and ensembles with multiple models and pre– and post-processing steps. Business logic scripting enables you to add conditionals, loops, and reordering of steps in the pipeline.

Using the Python or C++ backends, you can define a custom script that can call any other model being served by Triton based on conditions that you choose. Triton efficiently passes data to the newly called model, avoiding unnecessary memory copying whenever possible. Then the result is passed back to your custom script, from which you can continue further processing or return the result.

Figure 4 shows two examples of business logic scripting:

  • Conditional execution helps you use resources more efficiently by avoiding the execution of unnecessary models.
  • Autoregressive models, like transformer decoding, require the output of a model to be repeatedly fed back into itself until a certain condition is reached. Loops in business logic scripting enable you to accomplish that.

For more information, see Business Logic Scripting.

Auto-generation of model configuration

Triton can automatically generate config files for your models for faster deployment. For TensorRT, TensorFlow, and ONNX models, Triton generates the minimum required config settings to run your model by default when it does not detect a config file in the repository.

Triton can also detect if your model supports batched inference. It sets max_batch_size to a configurable default value.

You can also include commands in your own custom Python and C++ backends to generate model config files automatically based on the script contents. These features are especially useful when you have many models to serve, as it avoids the step of manually creating the config files. For more information, see Auto-generated Model Configuration.

Decoupled input processing

Diagram shows a client application sending a single large request to Triton Inference Server and receiving several smaller responses in return.
Figure 5. A one-request to many-response scenario enabled by decoupled input processing

While many inference settings require a one-to-one correspondence between inference requests and responses, this isn’t always the optimal data flow.

For instance, with ASR models, sending the full audio and waiting for the model to finish execution may not result in a good user experience. The wait can be long. Instead, Triton can send back the transcribed text in multiple short chunks (Figure 5), reducing the latency and time to the first response.

With decoupled model processing in the C++ or Python backend, you can send back multiple responses for a single request. Of course, you could also do the opposite: send multiple small requests in chunks and get back one big response. This feature provides flexibility in how you process and send your inference responses. For more information, see Decoupled Models.

For more information about recently added features, see the NVIDIA Triton release notes.

Get started with scalable AI model deployment

You can deploy, run, and scale AI models with Triton to effectively mitigate AI inference challenges that you may have with multiple frameworks, a diverse infrastructure, large language models, optimal model configurations, and more.

Triton Inference Server is open-source and supports all major model frameworks such as TensorFlow, PyTorch, TensorRT, XGBoost, ONNX, OpenVINO, Python, and even custom frameworks on GPU and CPU systems. Explore more ways to integrate Triton with any application, deployment tool, and platform, on the cloud, on-premises, and at the edge.

For more information, see the following resources:

Categories
Misc

Powering Ultra-High Speed Frame Rates in AI Medical Devices with the NVIDIA Clara Holoscan SDK

In the operating room, the latency and reliability of surgical video streams can make all the difference for patient outcomes. Ultra-high-speed frame rates from…

In the operating room, the latency and reliability of surgical video streams can make all the difference for patient outcomes. Ultra-high-speed frame rates from sensor inputs that enable next-generation AI applications can provide surgeons with new levels of real-time awareness and control.

To build real-time AI capabilities into medical devices for use cases like surgical navigation, image-guided intervention such as endoscopy, and medical robotics, developers need AI pipelines that allow for low-latency processing of combined sensor data from multiple channels.

As announced at GTC 2022, NVIDIA Clara Holoscan SDK v0.3 now provides a lightning-fast frame rate of 240 Hz for 4K video. This enables developers to combine data from more sensors and build AI applications that can provide surgical guidance. With faster data transfer enabled through high-speed Ethernet-connected sensors, developers have even more tools to build accelerated AI pipelines.

Real-time AI processing of frontend sensors

NVIDIA Clara Holoscan enables high-speed sensor input through the ConnectX SmartNIC and NVIDIA Rivermax SDK with GPUDirect RDMA that bypasses the CPU. This allows for high-speed Ethernet output of data from sensors into the AI compute system. The result is unmatched performance for edge AI.

While traditional GStreamer and OpenGL-based endoscopy pipelines have an end-to-end latency of 220 ms on a 1080p 60 Hz stream, high-speed pipelines with Clara Holoscan boast an end-to-end latency of only 10 ms on a 4K 240 Hz stream.

Streaming data at 4K 60 Hz at under 50 ms on NVIDIA RTX A6000, teams can run 15 concurrent AI video streams and 30 concurrent models.

NVIDIA Rivermax SDK

The NVIDIA Rivermax SDK, included with NVIDIA Clara Holoscan, enables direct data transfers to and from the GPU. Bypassing host memory and using the offload capabilities of the ConnectX SmartNIC, it delivers best-in-class throughput and latency with minimal utilization for streaming workloads. NVIDIA Clara Holoscan leverages the Rivermax functionalities to bring scalable connectivity for high-bandwidth network sensors and support very fast data transfer.

NVIDIA G-SYNC

NVIDIA G-SYNC enables high display performance by synchronizing display refresh rates to the GPU, thus eliminating screen tearing and minimizing display stutter and input lag. As a result, the AI inference can be shown with very low latency.

NVIDIA Clara HoloViz

Clara HoloViz is a module in Holoscan for visualizing data. Clara HoloViz composites real-time streams of frames with multiple different other layers like segmentation mask layers, geometry layers, and GUI layers.

For maximum performance, Clara HoloViz makes use of Vulkan, which is already installed as part of the NVIDIA driver.

Clara HoloViz uses the concept of the immediate mode design pattern for its API. No objects are created and stored by the application. This makes it easy to quickly build and change the visualization in an Holoscan application.

Improved developer experience

The NVIDIA Clara Holoscan SDK v0.3 release brings significant improvements to the development experience. First, the addition of a new C++ API for the creation of GXF extensions gives developers an additional pathway to building their desired applications. Second, the support for x86 processors allows developers to quickly get started with developing AI applications which can then be easily deployed on IGX development kits. Third, the Bring Your Own Model (BYOM) support has been enriched in this latest version.

Holoscan C++ API

The Holoscan C++ API provides a new convenient way to compose GXF workflows, without the need to write YAML files. The Holoscan C++ API enables a more flexible and scalable approach to creating applications. It has been designed for use as a drop-in replacement for the GXF Framework’s API and provides a common interface for GXF components.

Diagram showing the main components of the Holoscan API.
Figure 1. The main components of the Holoscan API

Application: An application acquires and processes streaming data. An application is a collection of fragments where each fragment can be allocated to execute on a physical node of a Holoscan cluster.

Fragment: A fragment is a building block of the application. It is a directed acyclic graph (DAG) of operators. A fragment can be assigned to a physical node of a Holoscan cluster during execution. The run-time execution manages communication across fragments. In a fragment, operators (graph nodes) are connected to each other by flows (graph edges).

Operator: An operator is the most basic unit of work in this framework. An operator receives streaming data at an input port, processes it, and publishes it to one of its output ports. A codelet in GXF would be replaced with an operator in the framework. An operator encapsulates receivers and transmitters of a GXF entity as I/O ports of the operator.

Resource: Resources such as system memory or a GPU memory pool that an operator needs to perform its job. Resources are allocated during the initialization phase of the application. The resource matches the semantics of the GXF Memory Allocator or any other components derived from the component class in GXF.

Condition: A condition is a predicate that can be evaluated at runtime to determine if an operator should execute. This matches the semantics of the GXF Scheduling Term class.

Port: An interaction point between two operators. Operators ingest data at input ports and publish data at output ports. Receiver, transmitter, and MessageRouter in GXF are replaced with the concept of I/O port of the operator.

Executor: An executor manages the execution of a fragment on a physical node. The framework provides a default executor that uses a GXF scheduler to execute an application.

You can find more information about the new C++ API in the SDK documentation. See an example of a full AI application for endoscopy tool tracking using the new C++ API in the public source code repository.

Support for x86 systems

The NVIDIA Clara Holoscan SDK has been designed with various hardware systems in mind. It supports using the SDK on x86 systems, in addition to the NVIDIA IGX DevKit and the Clara AGX DevKit. With x86 support, researchers and developers who do not have a DevKit can use the Holoscan SDK on their x86 machines to quickly build AI applications for medical devices.

Bring Your Own Model

The Holoscan SDK provides AI libraries and pretrained AI models to jump-start the timeline to build your own AI applications. You can also reference applications for endoscopy and ultrasound with Bring Your Own Model (BYOM) support.

As a developer, you can quickly build AI pipelines by dropping your own models in the reference applications provided as part of the SDK. Finally, the SDK also includes sensor I/O integration options and performance tools that optimize AI applications for production deployment.

Software stack updates

The NVIDIA Clara Holoscan SDK v0.3 release also integrates an upgrade from NVIDIA JetPack HP1 to Holopack 1.1, running the Tegra Board Support Package (BSP) version 34.1.3, as well as an upgrade for GXF from version 2.4.2 to version 2.4.3.

Get started building AI for medical devices

From training AI models to verifying and validating AI applications and ultimately deploying for commercial production, Clara Holoscan helps streamline AI development and deployment. 

Visit the Clara Holoscan SDK web page to access healthcare-specific acceleration libraries, pretrained AI models, sample applications, documentation, and more to get started building software-defined medical devices.

You can also request a free hands-on lab with NVIDIA LaunchPad to experience how Clara Holoscan simplifies the development of AI pipelines for endoscopy and ultrasound.

Categories
Misc

Building Cloud-Native, AI-Powered Avatars with NVIDIA Omniverse ACE

Explore the AI technology powering Violet, the interactive avatar showcased this week in the NVIDIA GTC 2022 keynote. Learn new details about NVIDIA Omniverse…

Explore the AI technology powering Violet, the interactive avatar showcased this week in the NVIDIA GTC 2022 keynote. Learn new details about NVIDIA Omniverse Avatar Cloud Engine (ACE), a collection of cloud-native AI microservices for faster, easier deployment of interactive avatars, and NVIDIA Tokkio, a domain-specific AI reference application that leverages Omniverse ACE for creating fully autonomous interactive customer service avatars.

AI-powered avatars in the cloud

Digital assistants and avatars can take many different forms and shapes, from the common text-driven chatbots to fully animated digital humans and physical robots that can see and hear people. These avatars will populate virtual worlds to help us create and build things, be a brand ambassador and customer service agent, help you find something on a website, take your order at a drive-through, or recommend a retirement or insurance plan. 

A real-time interactive 3D avatar can deliver a natural, engaging experience that makes people feel more comfortable. AI-based virtual assistants can also use non-verbal cues like your facial expressions and eye contact, to enhance communication and understanding of your requests and intent.

Photo shows a woman and child interact with an avatar in a restaurant ordering kiosk.
Figure 1. NVIDIA Omniverse ACE powers AI-powered avatars like Violet, from the NVIDIA Tokkio reference application showcased at GTC

But building these avatar applications at scale requires a broad range of expertise, including computer graphics, AI, and DevOps. Most current methods for animating avatars leverage traditional motion capturing solutions, which are challenging to use for real-time applications.

Cutting-edge NVIDIA AI technologies, such as Omniverse Audio2Face, NVIDIA Riva, and NVIDIA Metropolis, change the game by enabling avatar motion to be driven by audio and video. Connecting character animation directly to an avatar’s conversational intelligence enables faster, easier engineering and deployment of interactive avatars at scale.

When an avatar is created, it must also be integrated into an application and deployed. This requires powerful GPUs to drive both the rendering of sophisticated 3D characters and the AI intelligence that brings them to life. Monolithic solutions are optimized for specific endpoints, while cloud-native solutions are more scalable across all endpoints, including mobile, web, and limited compute devices such as augmented reality headsets.

NVIDIA Omniverse Avatar Cloud Engine (ACE) helps address these challenges by delivering all the necessary AI building blocks to bring intelligent avatars to life, at scale.

Omniverse ACE and AI microservices

Omniverse ACE is a collection of cloud-native AI models and microservices for building, customizing, and deploying intelligent and engaging avatars easily. These AI microservices power the backend of interactive avatars, making it possible for these virtual robots to see, perceive, intelligently converse, and provide recommendations to users.

Omniverse ACE uses Universal Scene Description (USD) and the NVIDIA Unified Compute Framework (UCF), a fully accelerated framework that enables you to combine optimized and accelerated microservices into real-time AI applications.

Every microservice has a bounded domain context (animation AI, conversational AI, vision AI, data analytics, or graphics rendering) and can be independently managed and deployed from UCF Studio.

The AI microservices include the following:

  • Animation AI: Omniverse Audio2Face simplifies the animation of a 3D character to match any voice-over track, helping users animate characters for games, films, or real-time digital assistants.
  • Conversational AI: Includes the NVIDIA Riva SDK for speech AI and the NVIDIA NeMo Megatron framework for natural language processing. These tools enable you to quickly build and deploy cutting-edge applications that deliver high-accuracy, expressive voices, and real-time responses.
  • Vision AI: NVIDIA Metropolis enables computer vision workflows—from model development to deployment—for individual developers, higher education and research, and enterprises.
  • Recommendation AI: NVIDIA Merlin is an open-source framework for building high-performing recommender systems at scale. It includes libraries, methods, and tools that streamline recommender builds.

NVIDIA UCF includes validated deployment-ready microservices to accelerate application development. The abstraction of each domain from the application alleviates the need for low-level domain and platform knowledge. New and custom microservices can be created using NVIDIA SDKs.

Sign up to be notified about the UCF Studio Early Access program.

No-code design tools for cloud deployment

Application developers will be able to bring all these UCF-based microservices together using NVIDIA UCF Studio, a no-code application builder tool to create, manage, and deploy applications to a private or public cloud of choice.

Designs are visualized as a combination of microservice processing pipelines. Using drag-and-drop operations, you can quickly create and combine these pipelines to build powerful applications that incorporate different AI modalities, graphics, and other processing functions.

Diagram shows how Omniverse ACE microservices connect together to form an avatar AI workflow pipeline.
Figure 2. Example of an avatar AI workflow pipeline built with Omniverse ACE

Built-in design rules and verification are part of the UCF Studio development environment to ensure that applications built there are correct-by-construction. When they’re complete, applications can be packaged into NVIDIA GPU-enabled containers and deployed to the cloud easily, using Helm charts.

Building Violet, the NVIDIA Tokkio avatar

NVIDIA Tokkio, showcased in the GTC keynote, represents the latest evolution of avatar development using Omniverse ACE. In the demo, Jensen Huang introduces Violet, a cloud-based, interactive customer service avatar that is fully autonomous.

Violet was developed using the NVIDIA Tokkio application workflow, which enables interactive avatars to see, perceive, converse intelligently, and provide recommendations to enhance customer service, both online and in places like restaurants and stores.

While the user interface and specific AI microservice components will continue to be refined within UCF Studio, the core process of how to create an avatar AI workflow pipeline and deploy it will remain the same. You will be able to quickly select, drag-and-drop, and switch between microservices to easily customize your avatars.

Video 1. The NVIDIA GTC demo showcased Violet, an AI-powered avatar that responds to natural speech and makes intelligent recommendations

You start with a fully rigged avatar and some basic animation that was rendered in Omniverse. With UCF Studio, you can select the necessary components to make the Violet character interactive. This example includes Riva automatic speech recognition (ASR) and text-to-speech (TTS) features to make her listen and speak, and Omniverse Audio2Face to provide the necessary animation.

Then, connect Violet to a food ordering dataset to enable her to handle customer orders and queries. When you’re done, UCF Studio generates a Helm chart that can be deployed onto a Kubernetes cluster through a series of CLI commands. Now, the Violet avatar is running in the cloud and can be interacted with through a web-based application or a physical food service kiosk.

Next, update her language model so that she can answer questions that don’t relate to food orders. The NVIDIA Tokkio application framework includes a customizable pretrained natural language processing (NLP) model built using NVIDIA NeMo Megatron. Her language model can be updated, in this case to a predeployed Megatron large language model (LLM) microservice, by going back into UCF Studio and updating the inference settings. Violet is redeployed and can now respond to broader, open-domain questions.

Omniverse ACE microservices will also support avatars rendered in third-party engines. You can switch out the avatar that this NVIDIA Tokkio pipeline is driving. Back in UCF Studio, replace the current microservice output of Omniverse Audio2Face to drive UltraViolet, an avatar created using Epic’s MetaHuman in Unreal Engine 5.

Learn more about Omniverse ACE

The more companies rely on AI-assisted virtual agents, the more they will want to ensure that users are relaxed, trusting, and comfortable interacting with these virtual agents and AI-assisted cloud applications.

With Omniverse ACE and domain-specific AI reference applications like NVIDIA Tokkio, you can more easily meet the demand for intelligent and responsive avatars like Violet. Take 3D models built and rendered with popular platforms like Unreal Engine and then connect these characters to AI microservices from Omniverse ACE, to bring them to life.

Interested in Omniverse ACE and getting early access when it becomes available?

To learn more about Omniverse ACE and how to build and deploy interactive avatars on the cloud, add the Building the Future of Work with AI-powered Digital Humans GTC session to your calendar.

Categories
Misc

Now You’re Speaking My Language: NVIDIA Riva Sets New Bar for Fully Customizable Speech AI

Whether for virtual assistants, transcriptions or contact centers, voice AI services are turning words and conversations into bits and bytes of business magic. Today at GTC, NVIDIA announced new additions to NVIDIA Riva, a GPU-accelerated software development kit for building and deploying speech AI applications. Riva’s pretrained models are now offered in seven languages, including Read article >

The post Now You’re Speaking My Language: NVIDIA Riva Sets New Bar for Fully Customizable Speech AI appeared first on NVIDIA Blog.

Categories
Misc

Unlocking New Opportunities with AI Cloud Infrastructure for 5G vRAN

The cellular industry spends over $50 billion on radio access networks (RAN) annually, according to a recent GSMA report on the mobile economy. Dedicated and…

The cellular industry spends over $50 billion on radio access networks (RAN) annually, according to a recent GSMA report on the mobile economy. Dedicated and overprovisioned hardware is primarily used to provide capacity for peak demand. As a result, most RAN sites have an average utilization below 25%

This has been the industry reality for years as technology evolved from 2G to 4G. But it is set to become even more pronounced in 5G as the push for densification, combined with the use of mmWave, leads to a near doubling of the number of cell sites by 2027 to over 17 million. The implication is that RAN capital expenditures, as a share of overall network total cost of ownership (TCO), will grow to as much as 65% in 5G, compared to 45-50% in 4G. 

A new game-changing approach turns underutilization into an opportunity: leverage the same cloud and data center infrastructure used for AI to dynamically load share with 5G virtual RAN (vRAN). This approach creates a new opportunity for cloud providers and helps reduce operational costs. 

Figure 1 shows how NVIDIA is enabling the CloudRAN solution with the NVIDIA A100X converged card, Spectrum SN3750SX Switch, NVIDIA Aerial SDK, and containerized DU.

Diagram of NVIDIA CloudRAN solution.
Figure 1. The NVIDIA CloudRAN solution, offering flexible mapping and dynamic compute utilization for 5G and AI workloads

The opportunity of RAN underutilization 

By pooling baseband computing resources into a cloud-native environment, the CloudRAN solution delivers significant improvements in asset utilization, creating efficiency gains for telcos and revenue opportunities for cloud service providers (CSPs). The solution achieves this by dynamically orchestrating resources between 5G and 5G RAN off-peak workloads. 

Examples of 5G RAN off-peak workloads are AI workloads such as drive mapping, federated learning, offline video analytics, predictive maintenance, factory digital twins, and many more. 5G compute capacity is mapped to changes in traffic demands from 5G radios, while the remaining compute is used for AI workloads. For telcos, this can increase RAN operational efficiency by more than 2x for an estimated 25% or more impact on the operating margin. 

For CSPs, running 5G vRAN as a workload alongside AI workloads within their existing data center architecture is a significant opportunity. To look at a specific example, the United States wireless market includes about 420,000 cell sites. If the telcos are using Centralized-RAN (C-RAN) for 50% of their network (mostly urban areas) and running a 4:1 configuration, then they will be using 52,000 GPUs to run their network. 

In a typical data center, the hourly GPU compute rate is $2. A C-RAN configuration using dynamic orchestration to combine 5G and AI workloads and the CSP’s monetization of 52,000 GPUs will lead to a $500 million revenue opportunity. Globally, this is a multi-billion dollar opportunity. 

Figure 2 shows the off-peak AI workloads that can dynamically share the C-RAN GPU with the 5G workload during off-peak periods. 

Diagram of AI workloads that can be run during RAN off-peak
Figure 2. Off-peak workloads that can dynamically share CloudRAN GPU with 5G

To provide an example, the combination of in-vehicle AI supercomputer and GPU resources in the cloud enables offline processing for self-driving cars. The system incorporates deep learning algorithms to detect lanes, signs, and other landmarks using a combination of AI and visual simultaneous localization and mapping (VSLAM). 

The industry recognizes the general trend to pull softwarized RAN resources together in a few centralized hub locations. While this improves RAN TCO by more than 30% over distributed-RAN topology, it does not solve RAN underutilization. Why? The unused RAN compute resource goes to waste during off-peak periods. 

NVIDIA CloudRAN: Five building blocks 

To realize the CloudRAN solution, current vRANs will need to evolve in the following five key focus areas. 

First, there is a need for a software-defined fronthaul (SD-FH) with an optimal timing mechanism. Second, the RAN hardware needs to evolve from bespoke, dedicated, and (in many cases) non-cloud-native architecture, to COTS-based and cloud-native hardware. Third, the softwarized 5G RAN needs to be programmable in real time and capable of running on cloud infrastructure. Fourth, the lifecycle management (LCM) of the 5G RAN needs to be dynamic and based on open APIs. 

Finally, there is a need for an end-to-end (E2E) network and service orchestrator that can dynamically manage RAN and other off-peak workloads based on network and infrastructure utilization information. E2E service orchestration executes service intent by dynamically composing the workflow based on service models, policy, and context and using closed-loop control to automate the entire service and network.

NVIDIA CloudRAN identifies and offers an enabling solution for each of the five identified evolution needs of vRAN, as shown in Figure 3.

Diagram of the building blocks of NVIDIA CloudRAN solution
Figure 3. Components of the NVIDIA CloudRAN solution
  • SD-FH switch: NVIDIA SN3750-SX is a new 200G Ethernet switch based on the Spectrum-2 ASIC. It was purpose-built to provide the network fabric for CloudRAN-converged infrastructure where it runs AI training workloads alongside 5G networking. It has software-defined and hardware-accelerated 5G fronthaul capabilities to steer 5G traffic available DUs to RUs based on orchestrator mapping. It offers 5G time sync protocols with PTP telco profiles, SyncE, and PPS in/out. The switch also supports the NetQ validation toolset.
  • General purpose computing hardware: The NVIDIA A100X converged card combines the power of the NVIDIA A100 Tensor Core GPU with the advanced networking capabilities of the NVIDIA BlueField-2 DPU in a single, unique platform. This convergence delivers unparalleled performance for GPU-powered, I/O-intensive workloads, such as distributed AI training in the enterprise data center and 5G vRAN processing as another workload within existing data center architecture. 
  • Programmable and cloud-native 5G software: The NVIDIA Aerial SDK provides the 5G workload in the CloudRAN solution. NVIDIA Aerial is a fully cloud-native virtual 5G RAN solution running on COTS servers. It realizes RAN functions as microservices in containers over bare-metal servers, using Kubernetes and applying DevOps principles. It provides a 5G RAN solution, with inline L1 GPU acceleration for 5G NR PHY processing and supports a full stack framework for a gNB integration L2/L3 (MAC, RLC, PDCP), along with manageability and orchestration.
  • Open API lifecycle management: The O-RAN disaggregated, software-centric approach can help automate and orchestrate RAN complexity, irrespective of multi-vendor or multi-technology networks. Ultimately, the service management orchestration (SMO) will provide open and Kubernetes cluster APIs for RAN automation.
  • E2E network and service orchestrator: E2E orchestration enables dynamic applications and services by consolidating an E2E view in real time across all technology and cloud domains. A single pane of glass enables automation of all aspects of cross-domain services and manages lifecycle management, optimization, and assurance of various workloads. The E2E orchestrator will also have an interface to interact with the cloud infrastructure manager. 

Delivering the CloudRAN solution

The NVIDIA CloudRAN solution delivers a compelling value proposition with an SD-FH, general purpose data center compute, cloud-native architecture, RAN domain orchestrator, and E2E service and network orchestrator. 

NVIDIA and its ecosystem partners are building a Kubernetes-based SMO and E2E service orchestrator to support dynamic workload management. With telcos, NVIDIA is working on COTS-based and cloud-native vRAN software. With CSPs, NVIDIA is working to optimize data center hardware to support 5G workloads. 

Join us for the GTC 2022 session, Using AI Infrastructure in the Cloud for 5G vRAN to learn more about the CloudRAN solution. 

Categories
Misc

A Podcast With Teeth: How Overjet Brings AI to Dentists’ Offices

Dentists get a bad rap. Dentists also get more people out of more aggravating pain than just about anyone. Which is why the more technology dentists have, the better. Overjet, a member of the NVIDIA Inception program for startups, is moving fast to bring AI to dentists’ offices. On this episode of the NVIDIA AI Read article >

The post A Podcast With Teeth: How Overjet Brings AI to Dentists’ Offices appeared first on NVIDIA Blog.

Categories
Misc

Open-Source Healthcare AI Innovation Continues to Expand with MONAI v1.0

Developing for the medical imaging AI lifecycle is a time-consuming and resource-intensive process that typically includes data acquisition, compute, and…

Developing for the medical imaging AI lifecycle is a time-consuming and resource-intensive process that typically includes data acquisition, compute, and training time, and a team of experts who are knowledgeable in creating models suited to your specific challenge. Project MONAI, the medical open network for AI, is continuing to expand its capabilities to help make each of these hurdles easier no matter where developers are starting in their medical AI workflow. 

A growing open-source platform for better medical AI

MONAI is the domain-specific, open-source medical AI framework that drives research breakthroughs and accelerates AI into clinical impact. It unites doctors with data scientists to unlock the power of medical data for deep learning models and deployable applications in medical AI workflows. MONAI features domain-specific tools in data labeling, model training, and application deployment that enable you to develop, reproduce, and standardize on medical AI lifecycles. 

The release of MONAI v1.0 brings a number of exciting new updates and tools for developers, including:

  • Model Zoo
  • Active Learning in MONAI Label
  • Auto-3D Segmentation
  • Federated Learning

MONAI is the fastest growing open-source platform that provides deep learning infrastructure and workflows optimized for medical imaging in a native PyTorch paradigm. Freely available and optimized for supercomputing scale, MONAI is backed by 12 of the top Academic Medical Centers (AMCs) and has 50,000 downloads per month. From research to clinical products, the launch of MONAI v1.0 allows researchers and developers to build models and applications in a quick and standardized way. 

Jump-start training workflows with MONAI Model Zoo

Training and constructing your own AI models takes significant time, data, compute power, and knowledge of training algorithms. MONAI Model Zoo enables developers to quickly discover pretrained and openly available models specific to medical imaging. By using the MONAI Bundle Format, you can get started with these models in just a few commands.

MONAI Model Zoo offers a curated collection of medical imaging AI models. It is also a framework for you as a developer to create and publish your own models, resulting in an open-source collection of pretrained medical imaging models that can be used to speed up the development process. 

Driven by the community, Model Zoo makes cutting-edge medical AI tasks accessible and helps you get started quickly in your workflows with plug-and-play documentation, examples, and bundles. Among the main contributors to Model Zoo are NVIDIA, KCL, Kitware, Vanderbilt, and Charite, including more than 15 models across imaging modalities such as CT, pathology, ultrasound, and endoscopy to perform segmentation, classification, annotation tasks, and more.

A GIF showing how to access models in MONAI Model Zoo.
Figure 1. Access and download models from MONAI Model Zoo with just a few clicks

Build better datasets with active learning

The process of labeling data can be time consuming, and experts who can annotate these images may not have time to annotate every image. MONAI Label has enhanced active learning capabilities, which is a process that aims to use the least amount of data to achieve the highest possible model performance. Choosing data that will have the greatest influence on your overall model accuracy allows human annotators to focus on the annotations that will have the highest impact on the model performance.

MONAI Label provides a clinician-friendly application to expertly label data in a fraction of the time while simultaneously training a model at the push of a button. With approaches like active learning, AI-powered algorithms can intelligently select the most difficult images for clinical inputs and increase the performance of the AI model as it learns from the expert. This enables human annotators to focus on the annotations that will provide the highest gain in model performance and address areas with model uncertainty. 

Active learning serves to build better datasets in a fraction of the time it would take humans to curate. MONAI Label can now automatically review and label large datasets, flag image data that requires human input, and then query the clinician to label it before it is added back into the training data. 

Developers can see up to 75% reduction in training costs with active learning in MONAI Label with increased labeling and training efficiency while achieving better model performance. With active learning, only 25% of the actual training dataset was used to achieve the same result of 0.82 Dice Score as training on 100% of the training dataset.

Diagram showing active learning framework on MONAI Label.
Figure 2. The six steps of the active learning framework on MONAI Label

Accelerate 3D segmentation

The process of model training to achieve a state-of-the-art 3D segmentation model takes significant time, compute, and developer and researcher expertise. To help accelerate this process, MONAI now offers a low-code 3D medical image segmentation framework that speeds up model training time without human interaction.

The MONAI Auto-3D Segmentation tool is a low-code framework that allows developers and researchers of any skill level to train models that can quickly delineate regions of interest in data from 3D imaging modalities like CT and MRI. It accelerates training time for developers from one week to two days with effective models, efficient workflows, and customizability to user needs.

Features include:

  • Data analysis tool
  • Automated configuration
  • Model training in a MONAI bundle
  • Model ensemble tool
  • Workflow manager
  • Trained model weights

Federated learning on MONAI

MONAI v1.0 includes the federated learning (FL) client algorithm APIs that are exposed as an abstract base class for defining an algorithm to be run on any federated learning platform.

NVIDIA FLARE, the federated learning platform, has already built the integration piece with these new APIs. Using MONAI Bundle configurations with new federated learning APIs, any bundle can be seamlessly extended to a federated paradigm. We welcome other federated learning toolkits to integrate with MONAI FL APIs, building a common foundation for carrying out collaborative learning in medical imaging.

Get started with MONAI

To get started with MONAI v1.0, visit the MONAI website. Access Python libraries, Jupyter notebooks, and MONAI tutorials on the Project MONAI GitHub repo.

You can also request a free hands-on lab through NVIDIA LaunchPad to get started annotating and adapting medical imaging models with MONAI Label.

Categories
Misc

Democratizing and Accelerating Genome Sequencing Analysis with NVIDIA Clara Parabricks v4.0

The field of computational biology relies on bioinformatics tools that are fast, accurate, and easy to use. As next-generation sequencing (NGS) is becoming…

The field of computational biology relies on bioinformatics tools that are fast, accurate, and easy to use. As next-generation sequencing (NGS) is becoming faster and less costly, a data deluge is emerging, and there is an ever-growing need for accessible, high-throughput, industry-standard analysis.

At GTC 2022, we announced the release of NVIDIA Clara Parabricks v4.0, which brings significant improvements to how genomic researchers and bioinformaticians deploy and scale genome sequencing analysis pipelines.

  • Clara Parabricks software is now free to researchers on NGC as individual tools or as a unified container. A licensed version is available through NVIDIA AI Enterprise for customers requiring enterprise-grade support.
  • Clara Parabricks is now easily integrated into common workflow languages such as Workflow Description Language (WDL) and NextFlow, for the interweaving of GPU-accelerated and third-party tools, and scalable deployment on-premises and in the cloud. The Cromwell workflow management system from the Broad Institute is also supported. 
  • Clara Parabricks can now be deployed on the Broad Institute’s Terra SaaS platform, making it available to the 25,000+ Terra scientists. Genome analysis is reduced to just over one hour with Clara Parabricks compared to 24 hours in a CPU environment, while reducing costs by 50% for whole genome sequencing analysis.
  • Clara Parabricks continues to focus on GPU-accelerated, industry-standard, and deep-learning-based tools and has included the latest DeepVariant v1.4 germline caller. Development in the areas of sequencer-agnostic tooling and deep learning approaches are a focus of Clara Parabricks.
  • Clara Parabricks is now available through more cloud providers and partners, including Amazon Web Services, Google Cloud Platform, Terra, DNAnexus, Lifebit, Agilent Technologies, UK Biobank Research Analysis Platform (RAP), Oracle Cloud Infrastructure, Naver Cloud, Alibaba Cloud, and Baidu AI Cloud.

License-free use for research and development

Clara Parabricks v4.0 is now available entirely free of charge for research and development. This means fewer technical barriers than ever before, including the removal of the install scripts and the enterprise license server present in previous versions of the genomic analysis software. 

This also means significant simplification in deployment, with the ability to pull and run Clara Parabricks Docker containers quickly and easily, on any NVIDIA-certified systems, with maximum ease of use on-premises or in the cloud.

Commercial users that require enterprise-level technical and engineering support for their production workflows, or to work with NVIDIA experts on new features, applications, and performance optimizations, can now subscribe to NVIDIA AI Enterprise Support. This support will be available for Parabricks v4.0 with the upcoming release of NVIDIA AI Enterprise v3.0.

An NVIDIA AI Enterprise Support subscription comes with full-stack support (from container-level, through to full on-premises and cloud deployment), access to NVIDIA Parabricks experts, security notifications, enterprise training in areas such as IT or data science, and deep learning support for TensorFlow, PyTorch, NVIDIA TensorRT, and NVIDIA RAPIDS. Learn more about NVIDIA AI Enterprise Support Services and Training

A table showing Clara Parabricks license options.
Figure 1. Access all the tools within Clara Parabricks at no cost, including the pipelines and workflows

Deploying in WDL and NextFlow workflows

You can now pull Clara Parabricks directly from NGC collection containers with no licensing server, meaning that it can easily be run as part of scalable and flexible bioinformatics workflows on a variety of systems and platforms.

This includes popular bioinformatics workflow managers WDL and NextFlow that are available on the new Clara-Parabricks-Workflows GitHub repo for general use by the bioinformatics community. You can find WDL and NextFlow workflows or modules for the following:

  • BWA-MEM alignment and processing with Clara Parabricks FQ2BAM
  • A germline calling workflow running accelerated HaplotypeCaller and DeepVariant, with the option to apply the GATK best practices
  • A BAM2FQ2BAM workflow to extract reads and realign to new reference genomes (such as the T2T completed human genome)
  • A somatic workflow using accelerated Mutect2, with an optional panel of normals
  • A workflow to generate a new panel of normals for somatic variant calling from VCFs
  • A workflow to build reference indexes (required for several of the workflows and tasks listed earlier)

In addition, a workflow for calling de novo mutations in trio data developed in collaboration with researchers at the National Cancer Institute will be available later this year.

These workflows bring impressive flexibility, enabling users to interweave the GPU-accelerated tools of Clara Parabricks with third-party tooling. They can specify individual compute resources for each task, before deploying at a massive scale on local clusters (on SLURM, for example) or on cloud platforms. See the Clara-Parabricks-Workflows GitHub repo for example configurations and recommended GPU instances.

A diagram showing how to pull directly from the Clara Parabricks Docker and specify gpuType and gpuCount compute requirements.
Figure 2. Pull directly from the Clara Parabricks Docker container and specify gpuType and gpuCount compute requirements

Run on-premises or in the cloud

Clara Parabricks is well-suited to cloud deployment. It is available to run on several cloud platforms, including Amazon Web Services, Google Cloud Services, DNAnexus, Lifebit, Baidu AI Cloud, Naver Cloud, Oracle Cloud Infrastructure, Alibaba Cloud, Terra, and more.

Clara Parabricks v4.0 WDL workflows are now integrated into the Broad Institute’s Terra platform for its 25,000+ scientists to run accelerated genomic analyses. Terra’s scalable platform runs on top of Google Cloud, which hosts a fleet of NVIDIA GPUs. A FASTQ to VCF analysis on a 30x whole genome takes 24 hours in a CPU environment compared to just over one hour with Clara Parabricks in Terra. In addition, costs are reduced by over 50%, from $5 to $2 (Figure 3).

In the Terra platform, researchers can gain access to a wealth of data much more easily than in an on-premises environment. They can access the Clara Parabricks workspace at the push of a button, rather than manually managing and configuring the hardware. Get started at the Clara Parabricks page on the Terra Community Workbench.

Graph showing time and cost comparison between CPU and GPU for 30x whole genome sequencing in Terra.
Figure 3. FASTQ to VCF runs in Terra

Runtimes and compute cost (preemptible pricing) for germline analysis of a 30x whole genome (including BWA-MEM, MarkDuplicates, BQSR, and HaplotypeCaller) are greatly reduced when using Clara Parabricks and NVIDIA GPUs.

Clara Parabricks v4.0 tools and features

Clara Parabricks v4.0 is a more focused genomic analysis toolset than previous versions, with rapid alignment, gold standard processing, and high accuracy variant calling. It offers the flexibility to freely and seamlessly intertwine GPU and CPU tasks and prioritize the GPU-acceleration of the most popular and bottlenecked tools in the genomics workflow. Clara Parabricks can also integrate cutting-edge deep learning approaches in genomics.

Diagram showing the NVIDIA Clara Parabricks v4.0 toolset.
Figure 4. The NVIDIA Clara Parabricks v4.0 toolset

The individual Clara Parabricks tools are also now offered in individual containers in the Clara Parabricks collection on NGC or as a unified container that encompasses all tools in one. For the individual containers, bioinformaticians can access lean containers, and the Clara Parabricks team can push more frequent agile per-tool releases to give access to the latest versions. 

The first of these releases is for DeepVariant v1.4. This latest version of DeepVariant increases accuracy across multiple genomics sequencers. There is an additional read insert size feature for Illumina whole genome and whole exome models, which reduces errors by 4-10%, and direct phasing for more accurate variant calling in PacBio sequencing runs. This means that you can now perform the high-accuracy process of phased variant calling for PacBio data directly in DeepVariant, with pipelines such as DeepVariant-WhatsHap-DeepVariant or PEPPER-Margin-DeepVariant.

DeepVariant v1.4 is also compatible with multiple custom DeepVariant models for emerging genomics sequencing instruments. The models have been GPU-accelerated in collaboration with the NVIDIA Clara Parabricks team to provide rapid and high-accuracy variant calls across sequencing instruments. DeepVariant v1.4 is now available in the Clara Parabricks collection on NGC.

Deep learning approaches to genomics and precision medicine is a big focus for Clara Parabricks and is highlighted in the GTC 2022 NVIDIA and Broad Institute announcement on further developments on the Genome Analysis Toolkit (GATK) and large language models for DNA and RNA.

Get started with Clara Parabricks v4.0 

To start using Clara Parabricks for free, visit the Clara Parabricks collection on NGC. You can also request a free Clara Parabricks NVIDIA LaunchPad lab to get hands-on experience running accelerated industry-standard tools for germline and somatic analysis for an exome and whole genome dataset.

For more information about Clara Parabricks, including technical details on the tools available, see the Clara Parabricks documentation.

Categories
Misc

New NVIDIA DGX System Software and Infrastructure Solutions Supercharge Enterprise AI

At GTC today, NVIDIA unveiled a number of updates to its DGX portfolio to power new breakthroughs in enterprise AI development. NVIDIA DGX H100 systems are now available for order. These infrastructure building blocks support NVIDIA’s full-stack enterprise AI solutions. With 32 petaflops of performance at FP8 precision, NVIDIA DGX H100 delivers a leap in Read article >

The post New NVIDIA DGX System Software and Infrastructure Solutions Supercharge Enterprise AI appeared first on NVIDIA Blog.

Categories
Misc

No Hang Ups With Hangul: KT Trains Smart Speakers, Customer Call Centers With NVIDIA AI

South Korea’s most popular AI voice assistant, GiGA Genie, converses with 8 million people each day. The AI-powered speaker from telecom company KT can control TVs, offer real-time traffic updates and complete a slew of other home-assistance tasks based on voice commands. It has mastered its conversational skills in the highly complex Korean language thanks Read article >

The post No Hang Ups With Hangul: KT Trains Smart Speakers, Customer Call Centers With NVIDIA AI appeared first on NVIDIA Blog.