Categories
Misc

Building a Speech-Enabled AI Virtual Assistant with NVIDIA Riva on Amazon EC2

Figure illustrating a screenshot of an NVIDIA Riva sample virtual assistant application running on a GPU-powered AWS EC2 instance through a web browser.
Learn how to get started with NVIDIA Riva, a fully accelerated speech AI SDK, on AWS EC2 using Jupyter Notebooks and a sample virtual assistant application.

Figure illustrating a screenshot of an NVIDIA Riva sample virtual assistant application running on a GPU-powered AWS EC2 instance through a web browser.

Speech AI can assist human agents in contact centers, power virtual assistants and digital avatars, generate live captioning in video conferencing, and much more. Under the hood, these voice-based technologies orchestrate a network of automatic speech recognition (ASR) and text-to-speech (TTS) pipelines to deliver intelligent, real-time responses.

Building these real-time speech AI applications from scratch is no easy task. From setting up GPU-optimized development environments to deploying speech AI inferences using customized large transformer-based language models in under 300ms, speech AI pipelines require dedicated time, expertise, and investment. 

In this post, we walk through how you can simplify the speech AI development process by using NVIDIA Riva to run GPU-optimized applications. With no prior knowledge or experience, you learn how to quickly configure a GPU-optimized development environment and run NVIDIA Riva ASR and TTS examples using Jupyter notebooks. After following along, this virtual assistant demo could be running on your web browser powered by NVIDIA GPUs on Amazon EC2.

Along with the step-by-step guide, we also provide you with resources to help expand your knowledge so you can go on to build and deploy powerful speech AI applications with NVIDIA support.

But first, here is how the Riva SDK works. 

How does Riva simplify speech AI?

Riva is a GPU-accelerated SDK for building real-time speech AI applications. It helps you quickly build intelligent speech applications, such as AI virtual assistants. 

By using powerful optimizations with NVIDIA TensorRT and NVIDIA Triton, Riva can build and deploy customizable, pretrained, out-of-the-box models that can deliver interactive client responses in less than 300ms, with 7x higher throughput on NVIDIA GPUs compared to CPUs.

The state-of-the-art Riva speech models have been trained for millions of hours on thousands of hours of audio data. When you deploy Riva on your platform, these models are ready for immediate use.

Riva can also be used to develop and deploy speech AI applications on NVIDIA GPUs anywhere: on premises, embedded devices, any public cloud, or the edge.

Here are the steps to follow for getting started with Riva on AWS.

Running Riva ASR and TTS examples to launch a virtual assistant

If AWS is where you develop and deploy workloads, you already have access to all the requirements needed for building speech AI applications. With a broad portfolio of NVIDIA GPU-powered Amazon EC2 instances combined with GPU-optimized software like Riva, you can accelerate every step of the speech AI pipeline.

There are four simple steps to get started with Riva on an NVIDIA GPU-powered Amazon EC2 instance:

  1. Launch an Amazon EC2 instance with NVIDIA GPU-optimized AMI.
  2. Pull the Riva container from the NGC catalog
  3. Run the Riva ASR and TTS Hello World examples with Jupyter notebooks. 
  4. Launch an intelligent virtual assistant application.

To follow along, make sure that you have an AWS account with access to NVIDIA GPU-powered instances (for example, Amazon EC2 G and P instance types such as P4d instances for NVIDIA A100 GPUs and G4dn instances for NVIDIA T4 GPUs).

Step 1: Launch an EC2 instance with the NVIDIA GPU-optimized AMI

In this post, you use the NVIDIA GPU-optimized AMI available on the AWS Marketplace. It is preconfigured with NVIDIA GPU drivers, CUDA, Docker toolkit, runtime, and other dependencies. It also provides a standardized stack for you to build speech AI applications. This AMI is validated and updated quarterly by NVIDIA with the newest drivers, security patches, and support for the latest GPUs to maximize performance.

Choose an instance type

In the AWS Management Console, launch an instance from the AWS Marketplace, using the NVIDIA GPU-Optimized AMI.

Instance types available may vary by region. For more information about choosing an appropriate instance type for your use case, see Choosing the right GPU for deep learning on AWS.

We recommend using NVIDIA A100 GPUs (P4d instances) for best performance at scale but for this guide, an A10G single-GPU instance (g5.xlarge instances) powered by the NVIDIA Ampere Architecture is fine.

For a greater number of pre- or postprocessing steps, consider larger sizes with the same single GPU, more vCPUs, and higher system memory, or consider the P4d instances that take advantage of 8x NVIDIA A100 GPUs.

Configure the instance

To connect to the EC2 instance securely, create a key pair.

  • For Key pair type, select RSA.
  • For Private key file format, select ppk for use with PuTTY, depending on how you plan to connect to the instance.

After the key pair is created, a file is downloaded to your local machine. You need this file in future steps for connecting to the EC2 instance. 

Network settings enable you to control the traffic into and out of your instance. Select Create Security Group and check the rule Allow SSH traffic from: Anywhere. At any point in the future, this can customized based on individual security preferences.

Finally, configure the storage. For this example, 100 GiB on a general purpose SSD should be plenty.

Now you are ready to launch the instance. If successful, your screen should look like Figure 1.

Screenshot of AWS environment after configuring all required settings to launch an NVIDIA GPU-powered EC2 instance.
Figure 1. Success message after launching an instance

Connect to the instance

After a few minutes under Instances on the sidebar, you will see your running instance with a public IPv4 DNS. Keep this address handy as it is used to connect to the instance using SSH. This address will change every time you start and stop your EC2 instance.

There are a number of ways to connect to your EC2 instance. This post uses the PuTTY SSH client to spin up a session from scratch and create the tunneling system into the instance.

You may begin working with your NVIDIA GPU-powered Amazon EC2 instance.

Screenshot of the PuTTY terminal window after a user successfully accesses the NVIDIA GPU-Optimized AMI on an EC2 instance.
Figure 2. Starting screen of the NVIDIA GPU-Optimized AMI on an EC2 instance

Log in with username ubuntu, and make sure that you have the right NVIDIA GPUs running:

nvidia-smi

Step 2: Pull the Riva container from the NGC catalog

To access Riva from your terminal, first create a free NGC account. The NGC catalog is a one-stop-shop for all GPU optimized software, containers, pretrained AI models, SDKs, Helm charts and other helpful AI tools. By signing up, you get access to the complete NVIDIA suite of monthly updated GPU optimized frameworks and training tools so that you can build your AI application in no time. 

After you create an account, generate an NGC API key. Keep your generated API key handy.

Now you can configure the NGC CLI (preinstalled with the NVIDIA GPU-Optimized AMI), by executing the following command:

ngc config set

Enter your NGC API Key from earlier, make sure that the CLI output is ASCII or JSON, and follow the instructions using the Choices section of the command line.

After configuration, on the Riva Skills Quick Start page, copy the download command by choosing Download at the top right side. Run the command in your PuTTY terminal. This initiates the Riva Quick Start resource to download onto your EC2 Linux instance.

Initialize Riva

After the download is completed, you are ready to initialize and start Riva. 

The default settings will prepare all of the underlying pretrained models during the Riva start-up process, which can take up to a couple of hours depending on your Internet speed. However, you can modify the config.sh file within the /quickstart directory with your preferred configuration around which subset of models to retrieve from NGC to speed up this process.

Within this file, you can also adjust storage location and specify which GPU to use, if more than one is installed on your system. This post uses the default configuration settings. The version (vX.Y.Z) number of Riva Quick Start that you downloaded is used to run the following commands (v2.3.0 is the version number used in this post).

cd riva_quickstart_v2.3.0
bash riva_init.sh
bash riva_start.sh

Riva is now running on your virtual machine. To familiarize yourself with Riva, run the Hello World examples next.

Step 3: Run the Riva ASR and TTS Hello World examples 

There are plenty of tutorials available in the /nvidia-riva GitHub repo. The TTS and ASR Python basics notebooks explore how you can use the Riva API.

Before getting started, you must clone the GitHub repo, set up your Python virtual environment, and install Jupyter on your machine by running the following commands in the /riva_quickstart_v2.3.0 directory:

git clone https://github.com/nvidia-riva/tutorials.git

Install and create a Python virtual environment named venv-riva-tutorials.

sudo apt install python3-venv
python3 -m venv venv-riva-tutorials
.venv-riva-tutorials/bin/activate

When the virtual environment has been activated, install the Riva API and Jupyter. Create an IPython kernel in the /riva_quickstart_v2.3.0 directory.

pip3 install riva_api-2.3.0-py3-none-any.whl
pip3 install nvidia-riva-client
pip3 install jupyter
ipython kernel install --user --name=venv-riva-tutorials

To run some simple Hello World examples, open the /tutorials directory and launch the Jupyter notebook with the following commands:

cd tutorials
jupyter notebook --generate-config
jupyter notebook --ip=0.0.0.0 --allow-root

The GPU-powered Jupyter notebook is now running and is accessible through the web. Copy and paste one of the URLs shown on your terminal to start interacting with the GitHub tutorials.

Open the tts-python-basics.ipynb and asr-python-basics.ipynb scripts on your browser and trust the notebook by selecting Not-Trusted at the top-right of your screen. To choose the venv-riva-tutorials kernel, choose Kernel, Change kernel.

You are now ready to work through the notebook to run your first Hello World Riva API calls using out-of-the-box models (Figure 3).

Screenshot of Jupyter Notebook running two scripts titled ‘How do I use Riva ASR APIs with out-of-the-box models?’ and ‘How do I use Riva TTS APIs with out-of-the-box models?’
Figure 3. Example Hello World Riva API notebooks

Explore the other notebooks to take advantage of the more advanced Riva customization features, such as word boosting, updating vocabulary, TAO fine-tuning, and more. You can exit Jupyter by pressing Ctrl+C while on the PuTTY terminal, and exit the virtual environment with the deactivate command.

Step 4: Launch an intelligent virtual assistant

Now that you are familiar with how Riva operates, you can explore how it can be applied with an intelligent virtual assistant found in /nvidia-riva/sample-apps GitHub repo.

To launch this application on your browser, run the following command in the /riva_quickstart_v2.3.0 directory:

git clone https://github.com/nvidia-riva/sample-apps.git

Create a Python virtual environment, and install the necessary dependencies: 

python3 -m venv apps-env
. apps-env/bin/activate
pip3 install riva_api-2.3.0-py3-none-any.whl
pip3 install nvidia-riva-client
cd sample-apps/virtual-assistant
pip3 install -U pip
pip3 install -r requirements.txt

Before you run the demo, you must update the config.py file in the Virtual Assistant directory. Vim is one text editor that you can use to modify the file:

vim config.py 
Screenshot of PuTTY terminal where users can edit the virtual assistant application’s config.py file.
Figure 4. Editing the virtual assistant application’s config.py file

Make sure that the PORT variable in client_config is set to 8888 and the RIVA_SPEECH_API_URL value is set to localhost:50051.

To allow the virtual assistant to access real-time weather data, sign up for the free tier of weatherstack, obtain your API access key, and insert the key value under WEATHERSTACK ACCESS KEY in riva_config.

Now you are ready to deploy the application! 

Deploy the assistant

Run python3 main.py and go to the following URL: https://localhost:8888/rivaWeather. This webpage opens the weather chatbot.

Screenshot of Riva’s sample virtual assistant application.
Figure 5. NVIDIA Riva-powered intelligent virtual assistant

Congratulations! 

You’ve launched an NVIDIA GPU-powered Amazon EC2 instance with the NVIDIA GPU-Optimized AMI, downloaded Riva from NGC, executed basic Riva API commands for ASR and TTS services, and launched an intelligent virtual assistant!

You can stop Riva at any time by executing the following command in the riva_quickstart_v2.3.0 directory:

bash riva_stop.sh.

Resources for exploring speech AI tools

You have access to several resources designed to help you learn how to build and deploy speech AI applications:

  • The /nvidia-riva/tutorials GitHub repo contains beginner to advanced scripts to walk you through ASR and TTS augmentations such as ASR word boosting and adjusting TTS pitch, rate, and pronunciation settings. 
  • To build and customize your speech AI pipeline, you can use the NVIDIA low-code AI model development TAO toolkit and the NeMo application framework for those who like more visibility under the hood for fine-tuning the fully customizable Riva ASR and TTS pipelines. 
  • Finally, to deploy speech AI applications at scale, you can deploy Riva on Amazon EKS and set up auto-scaling features with Kubernetes.

Interested in learning about how customers deploy Riva in production? Minerva CQ, an AI platform for agent assist in contact centers, has deployed Riva on AWS alongside their own natural language and intent models to deliver a unique and elevated customer support experience in the electric mobility market. 

Using NVIDIA Riva to process the automatic speech recognition (ASR) on the Minerva CQ platform has been great. Performance benchmarks are superb, and the SDK is easy to use and highly customizable to our needs.” Cosimo Spera, CEO of Minerva CQ 

Explore other real-world speech AI use cases in Riva customer stories and see how your company can get started with Riva Enterprise.

Categories
Misc

NVIDIA Studio Laptops Offer Students AI, Creative Capabilities That Are Best in… Class

Selecting the right laptop is a lot like trying to pick the right major. Both can be challenging tasks where choosing wrongly costs countless hours. But pick the right one, and graduation is just around the corner. The tips below can help the next generation of artists select the ideal NVIDIA Studio laptop to maximize performance for the critical workload demands of their unique creative fields — all within budget.

The post NVIDIA Studio Laptops Offer Students AI, Creative Capabilities That Are Best in… Class appeared first on NVIDIA Blog.

Categories
Misc

Welcome Back, Commander: ‘Command & Conquer Remastered Collection’ Joins GeForce NOW

Take a trip down memory lane this week with an instantly recognizable classic, Command & Conquer Remastered Collection, joining the nearly 20 Electronic Arts games streaming from the GeForce NOW library. Speaking of remastered, GeForce NOW members can enhance their gameplay further with improved resolution scaling in the 2.0.43 app update. When the feature is Read article >

The post Welcome Back, Commander: ‘Command & Conquer Remastered Collection’ Joins GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

How’s That? Startup Ups Game for Cricket, Football and More With Vision AI

Sports produce a slew of data. In a game of cricket, for example, each play generates millions of video-frame data points for a sports analyst to scrutinize, according to Masoumeh Izadi, managing director of deep-tech startup TVConal. The Singapore-based company uses NVIDIA AI and computer vision to power its sports video analytics platform, which enables Read article >

The post How’s That? Startup Ups Game for Cricket, Football and More With Vision AI appeared first on NVIDIA Blog.

Categories
Misc

Just Released: HPC SDK v22.7 with AWS Graviton3 C7g Support

Four panels vertically laid out each showing a simulation with a black backgroundEnhancements, fixes, and new support for AWS Graviton3 C7g instances, Arm SVE, Rocky Linux OS, OpenMP Tools visibility in Nsight Developer Tools, and more.Four panels vertically laid out each showing a simulation with a black background

Categories
Misc

Enhanced Image Analysis with Multidimensional Image Processing

Many times two dimensions are insufficient for analyzing image data. cuCIM is an open-source, accelerated, computer vision and image-processing software library for multidimensional images.

Image data can generally be described through two dimensions (rows and columns), with a possible additional dimension for the colors red, green, blue (RGB). However, sometimes further dimensions are required for more accurate and detailed image analysis in specific applications and domains.

For example, you may want to study a three-dimensional (3D) volume, measuring the distance between two parts or modeling how that 3D volume changes over time (the fourth dimension). In these instances, you need more than two dimensions to make sense of what you are seeing.

Multidimensional image processing, or n-dimensional image processing, is the broad term for analyzing, extracting, and enhancing useful information from image data with two or more dimensions. It is particularly helpful and needed for medical imaging, remote sensing, material science, and microscopy applications.

Some methods in these applications may involve data from more channels than traditional grayscale, RGB, or red, green, blue, alpha (RGBA) images. N-dimensional image processing helps you study and make informed decisions using devices enabled with identification, filtering, and segmentation capabilities. 

Multidimensional image processing gives you the flexibility to perform functions for traditional two-dimensional filtering in scientific applications. Within medical imaging specifically, computed tomography (CT) and magnetic resonance imaging (MRI) scans require multidimensional image processing to form images of the body and its functions. For example, multidimensional dimensional image processing is used in medical imaging to detect cancer or estimate tumor size (Figure 1). 

Multidimensional, high-resolution images of tissue.
Figure 1. Tissue can be rendered and examined quickly in multidimensional digital pathology use cases

Multidimensional image processing developer challenges

Outside of identifying, acquiring, and storing the image data itself, working with multidimensional image data comes with its own set of challenges.

First, multidimensional images are larger in size than their 2D counterparts and typically of high resolution, so loading them to memory and accessing them is time-consuming.

Second, processing each additional dimension of image data requires additional time and processing power. Analyzing more dimensions enlarges the scope of consideration.

Third, the computer-vision and image-processing algorithms take longer for analyzing each additional dimension, including the low-level operations and primitives. Multidimensional filters, gradients, and histogram complexity grow with each additional dimension.

Finally, when the data is manipulated, dataset visualization for multidimensional image processing is further complicated by the additional dimensions under consideration and quality to which it must be rendered. In biomedical imaging, the level of detail required can make the difference in identifying cancerous cells and damaged organ tissue.

Multidimensional input/output

If you’re a data scientist or researcher working in multidimensional image processing, you need software that can make data loading and handling for large image files efficient. Popular multidimensional file formats include the following:

  • NumPy binary format(.npy)
  • Tag Image File Format (TIFF)
  • TFRecord (.tfrecord)
  • Zarr
  • Variants of the formats listed above

Because every pixel counts, you have to process image data accurately with all the available processing power available.  Graphics processing units (GPU) hardware gives you the processing power and efficiency needed to handle and balance the workload of analyzing complex, multidimensional image data in real time.

cuCIM

Compute Unified Device Architecture Clara IMage (cuCIM) is an open-source, accelerated, computer-vision and image-processing software library that uses the processing power of GPUs to address the needs and pain points of developers working with multidimensional images.

Data scientists and researchers need software that is fast, easy to use, and reliable for an increasing workload. While specifically tuned for biomedical applications, cuCIM can be used for geospatial, material and life sciences, and remote sensing use cases.

cuCIM offers 200+ computer-vision and image-processing functions for color conversion, exposure, feature extraction, measuring, segmentation, restoration, and transforms. 

cuCIM is capable and fast image-processing software, requiring minimal changes to your existing pipeline. cuCIM equips you with enhanced digital image-processing capabilities that can be integrated into existing pipelines:

You can integrate using either a C++ or Python application programming interface (API) that matches OpenSlide for I/O and scikit-image for processing in Python. 

The cuCIM Python bindings offer many commonly used, computer-vision, image-processing functions that are easily integratable and compilable into the developer workflow. 

You don’t have to learn a new interface or programming language to use cuCIM. In most instances, only one line of code is added for transferring images to the GPU. The cuCIM coding structure is nearly identical to that used for the CPU, so there’s little change needed to take advantage of the GPU-enabled capabilities.

Because cuCIM is also enabled for GPUDirect Storage (GDS), you can efficiently transfer and write data directly from storage to the GPU without making an intermediate copy in host (CPU) memory. That saves time on I/O tasks.

With its quick set-up, cuCIM provides the benefit of GPU-accelerated image processing and efficient I/O with minimal developer effort and with no low-level compute unified device architecture (CUDA) programming required.

Free downloads and resources

cuCIM can be downloaded for free through Conda or PyPi. For more information, see the cuCIM developer page. You’ll learn about developer challenges, primitives, and use cases and get links to references and resources.

Categories
Offsites

Look and Talk: Natural Conversations with Google Assistant

In natural conversations, we don’t say people’s names every time we speak to each other. Instead, we rely on contextual signaling mechanisms to initiate conversations, and eye contact is often all it takes. Google Assistant, now available in more than 95 countries and over 29 languages, has primarily relied on a hotword mechanism (“Hey Google” or “OK Google”) to help more than 700 million people every month get things done across Assistant devices. As virtual assistants become an integral part of our everyday lives, we’re developing ways to initiate conversations more naturally.

At Google I/O 2022, we announced Look and Talk, a major development in our journey to create natural and intuitive ways to interact with Google Assistant-powered home devices. This is the first multimodal, on-device Assistant feature that simultaneously analyzes audio, video, and text to determine when you are speaking to your Nest Hub Max. Using eight machine learning models together, the algorithm can differentiate intentional interactions from passing glances in order to accurately identify a user’s intent to engage with Assistant. Once within 5ft of the device, the user may simply look at the screen and talk to start interacting with the Assistant.

We developed Look and Talk in alignment with our AI Principles. It meets our strict audio and video processing requirements, and like our other camera sensing features, video never leaves the device. You can always stop, review and delete your Assistant activity at myactivity.google.com. These added layers of protection enable Look and Talk to work just for those who turn it on, while keeping your data safe.

Google Assistant relies on a number of signals to accurately determine when the user is speaking to it. On the right is a list of signals used with indicators showing when each signal is triggered based on the user’s proximity to the device and gaze direction.

Modeling Challenges
The journey of this feature began as a technical prototype built on top of models developed for academic research. Deployment at scale, however, required solving real-world challenges unique to this feature. It had to:

  1. Support a range of demographic characteristics (e.g., age, skin tones).
  2. Adapt to the ambient diversity of the real world, including challenging lighting (e.g., backlighting, shadow patterns) and acoustic conditions (e.g., reverberation, background noise).
  3. Deal with unusual camera perspectives, since smart displays are commonly used as countertop devices and look up at the user(s), unlike the frontal faces typically used in research datasets to train models.
  4. Run in real-time to ensure timely responses while processing video on-device.

The evolution of the algorithm involved experiments with approaches ranging from domain adaptation and personalization to domain-specific dataset development, field-testing and feedback, and repeated tuning of the overall algorithm.

Technology Overview
A Look and Talk interaction has three phases. In the first phase, Assistant uses visual signals to detect when a user is demonstrating an intent to engage with it and then “wakes up” to listen to their utterance. The second phase is designed to further validate and understand the user’s intent using visual and acoustic signals. If any signal in the first or second processing phases indicates that it isn’t an Assistant query, Assistant returns to standby mode. These two phases are the core Look and Talk functionality, and are discussed below. The third phase of query fulfillment is typical query flow, and is beyond the scope of this blog.

Phase One: Engaging with Assistant
The first phase of Look and Talk is designed to assess whether an enrolled user is intentionally engaging with Assistant. Look and Talk uses face detection to identify the user’s presence, filters for proximity using the detected face box size to infer distance, and then uses the existing Face Match system to determine whether they are enrolled Look and Talk users.

For an enrolled user within range, an custom eye gaze model determines whether they are looking at the device. This model estimates both the gaze angle and a binary gaze-on-camera confidence from image frames using a multi-tower convolutional neural network architecture, with one tower processing the whole face and another processing patches around the eyes. Since the device screen covers a region underneath the camera that would be natural for a user to look at, we map the gaze angle and binary gaze-on-camera prediction to the device screen area. To ensure that the final prediction is resilient to spurious individual predictions and involuntary eye blinks and saccades, we apply a smoothing function to the individual frame-based predictions to remove spurious individual predictions.

Eye-gaze prediction and post-processing overview.

We enforce stricter attention requirements before informing users that the system is ready for interaction to minimize false triggers, e.g., when a passing user briefly glances at the device. Once the user looking at the device starts speaking, we relax the attention requirement, allowing the user to naturally shift their gaze.

The final signal necessary in this processing phase checks that the Face Matched user is the active speaker. This is provided by a multimodal active speaker detection model that takes as input both video of the user’s face and the audio containing speech, and predicts whether they are speaking. A number of augmentation techniques (including RandAugment, SpecAugment, and augmenting with AudioSet sounds) helps improve prediction quality for the in-home domain, boosting end-feature performance by over 10%.The final deployed model is a quantized, hardware-accelerated TFLite model, which uses five frames of context for the visual input and 0.5 seconds for the audio input.

Active speaker detection model overview: The two-tower audiovisual model provides the “speaking” probability prediction for the face. The visual network auxiliary prediction pushes the visual network to be as good as possible on its own, improving the final multimodal prediction.

Phase Two: Assistant Starts Listening
In phase two, the system starts listening to the content of the user’s query, still entirely on-device, to further assess whether the interaction is intended for Assistant using additional signals. First, Look and Talk uses Voice Match to further ensure that the speaker is enrolled and matches the earlier Face Match signal. Then, it runs a state-of-the-art automatic speech recognition model on-device to transcribe the utterance.

The next critical processing step is the intent understanding algorithm, which predicts whether the user’s utterance was intended to be an Assistant query. This has two parts: 1) a model that analyzes the non-lexical information in the audio (i.e., pitch, speed, hesitation sounds) to determine whether the utterance sounds like an Assistant query, and 2) a text analysis model that determines whether the transcript is an Assistant request. Together, these filter out queries not intended for Assistant. It also uses contextual visual signals to determine the likelihood that the interaction was intended for Assistant.

Overview of the semantic filtering approach to determine if a user utterance is a query intended for the Assistant.

Finally, when the intent understanding model determines that the user utterance was likely meant for Assistant, Look and Talk moves into the fulfillment phase where it communicates with the Assistant server to obtain a response to the user’s intent and query text.

Performance, Personalization and UX
Each model that supports Look and Talk was evaluated and improved in isolation and then tested in the end-to-end Look and Talk system. The huge variety of ambient conditions in which Look and Talk operates necessitates the introduction of personalization parameters for algorithm robustness. By using signals obtained during the user’s hotword-based interactions, the system personalizes parameters to individual users to deliver improvements over the generalized global model. This personalization also runs entirely on-device.

Without a predefined hotword as a proxy for user intent, latency was a significant concern for Look and Talk. Often, a strong enough interaction signal does not occur until well after the user has started speaking, which can add hundreds of milliseconds of latency, and existing models for intent understanding add to this since they require complete, not partial, queries. To bridge this gap, Look and Talk completely forgoes streaming audio to the server, with transcription and intent understanding being on-device. The intent understanding models can work off of partial utterances. This results in an end-to-end latency comparable with current hotword-based systems.

The UI experience is based on user research to provide well-balanced visual feedback with high learnability. This is illustrated in the figure below.

Left: The spatial interaction diagram of a user engaging with Look and Talk. Right: The User Interface (UI) experience.

We developed a diverse video dataset with over 3,000 participants to test the feature across demographic subgroups. Modeling improvements driven by diversity in our training data improved performance for all subgroups.

Conclusion
Look and Talk represents a significant step toward making user engagement with Google Assistant as natural as possible. While this is a key milestone in our journey, we hope this will be the first of many improvements to our interaction paradigms that will continue to reimagine the Google Assistant experience responsibly. Our goal is to make getting help feel natural and easy, ultimately saving time so users can focus on what matters most.

Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, UX, and cross-functional contributors. Key contributors from Google Assistant include Alexey Galata, Alice Chuang‎, Barbara Wang, Britanie Hall, Gabriel Leblanc, Gloria McGee, Hideaki Matsui, James Zanoni, Joanna (Qiong) Huang, Krunal Shah, Kavitha Kandappan, Pedro Silva, Tanya Sinha, Tuan Nguyen, Vishal Desai, Will Truong‎, Yixing Cai‎, Yunfan Ye; from Research including Hao Wu, Joseph Roth, Sagar Savla, Sourish Chaudhuri, Susanna Ricco. Thanks to Yuan Yuan and Caroline Pantofaru for their leadership, and everyone on the Nest, Assistant, and Research teams who provided invaluable input toward the development of Look and Talk.

Categories
Misc

Developing NLP Applications to Enhance Clinical Experiences and Accelerate Drug Discovery

Discover tools to translate unstructured data to structured data to help healthcare organizations harness relevant insights and improve healthcare delivery and patient experiences.

Natural language processing (NLP) can be defined as the combination of artificial intelligence (AI), computer science, and computational linguistics to understand human communication and extract meaning from unstructured spoken or written material. 

NLP use cases for healthcare have increased in the last few years to accelerate the development of therapeutics and improve quality of patient care through language understanding and predictive analytics. 

The healthcare industry generates vast amounts of unstructured data, but it is difficult to derive insights without finding ways to structure and represent that data in a computable form. Developers need the tools to translate unstructured data to structured data to help healthcare organizations harness relevant insights and improve healthcare delivery and patient care.

Transformer-based NLP has emerged as a paradigm shift in the performance of text-based healthcare workflows. Because of its versatility, NLP can structure virtually any proprietary or public data to spark insights in healthcare, leading to a wide variety of downstream applications that directly impact patient care or augment and accelerate drug discovery.

NLP for drug discovery

NLP is playing a critical role in accelerating small molecule drug discovery. Prior knowledge on the manufacturability or contraindications of a drug can be extracted from academic publications and proprietary data sets. NLP can also help with clinical trial analysis and accelerate the process of taking a drug to market.

Transformer architectures are popular in NLP, but these tools can also be used to understand the language of chemistry and biology. For example, text-based representations of chemical structure such as SMILES (Simplified Input Molecular Line Entry System) can be understood by transformer-based architectures leading to incredible capabilities for drug property evaluation and generative chemistry. 

MegaMolBART, a large transformer model developed by AstraZeneca and NVIDIA, is used for a wide range of tasks, including reaction prediction, molecular optimization, and de novo molecule generation.

Transformer-based NLP models are instrumental in understanding and predicting the structure and function of biomolecules like proteins. Much like they do for natural language, transformer-based representations of protein sequences provide powerful embeddings for use in downstream AI tasks, like predicting the final folded state of a protein, understanding the strength of protein-protein or protein-small molecule interactions, or in the design of protein structure provided a biological target.

NLP for clinical trial insights

Once a drug has been developed, patient data plays a large role in the process of taking it to market. Much of the patient data that is collected through the course of care is contained in free text, such as clinical notes from patient visits or procedural results. 

While these data are easily interpretable by a human, combining insights across clinical free text documents requires making information across diverse documents interoperable, such that the health of the patient is represented in a useful way. 

Modern NLP algorithms have accelerated our ability to derive these insights, helping to compare patients with similar symptoms, suggesting treatments, discovering diagnostic near-misses, and providing clinical care navigation and next-best-action prediction. 

NLP to enhance clinical experiences

Many patient interactions with the hospital system are remote, in part due to the growing use of telehealth services that stemmed from COVID-19. Those telehealth visits can be converted into structured information with the help of NLP. 

For physicians and surgeons, speech to text capabilities can turn verbal discussions with patients and clinical teams into text, which can then be stored in electronic health records (EHR). Applications include summarizing patient visits, catching near-misses, and predicting optimal treatment regimens.

Removing the burden of clinical documentation for each patient’s visit allows providers to spend more time and energy offering the best care for each patient, and simultaneously reduces physician burnout. NLP can also help hospitals predict patient outcomes such as readmission or sepsis. 

Learn more about NLP in healthcare

View on-demand sessions from the NVIDIA Healthcare and Life Sciences NLP Developer Summit to learn more about the use of NLP in healthcare. Session topics include best practices and insights for applications from speech AI in clinics to drug discovery.

Browse NVIDIA’s collection of biomedical pre-trained language models, as well as highly optimized pipelines for training NLP models on biomedical and clinical text, in the Clara NLP NGC Collection.

Categories
Misc

Closing the Sim2Real Gap with NVIDIA Isaac Sim and NVIDIA Isaac Replicator

NVIDIA Isaac Replicator, built on the Omniverse Replicator SDK, can help you develop a cost-effective and reliable workflow to train computer vision models using synthetic data.

Synthetic data is an important tool in training machine learning models for computer vision applications. Researchers from NVIDIA have introduced a structured domain randomization system within Omniverse Replicator that can help you train and refine models using synthetic data.

Omniverse Replicator is an SDK built on the NVIDIA Omniverse platform that enables you to build custom synthetic data generation tools and workflows. The NVIDIA Isaac Sim development team used Omniverse Replicator SDK to build NVIDIA Isaac Replicator, a robotics-specific synthetic data generation toolkit, exposed within the NVIDIA Isaac Sim app.

We explored using synthetic data generated from synthetic environments for a recent project. Trimble plans to deploy Boston Dynamics’ Spot in a variety of indoor settings and construction environments. But Trimble had to develop a cost-effective and reliable workflow to train ML-based perception models so that Spot could autonomously operate in different indoor settings.
By generating data from a synthetic indoor environment using structured domain randomization within NVIDIA Isaac Replicator, you can train an off-the-shelf object detection model to detect doors in the real indoor environment.

Collage of doors in indoor room and corridors
Figure 1. A few images from the test set of real 1,000 images

Sim2Real domain gap

Given that synthetic data sets are generated using simulation, it is critical to close the gap between the simulation and the real world. This gap is called the domain gap, which can be divided into two pieces:

  • Appearance gap: The pixel level differences between two images. These differences can be a result of differences in object detail, materials, or in the case of synthetic data, differences in the capabilities of the rendering system used.​
  • Content gap:  Refers to the difference between the domains. This includes factors like the number of objects in the scene, their diversity of type and placement, and similar contextual information. ​

A critical tool for overcoming these domain gaps is domain randomization (DR), which increases the size of the domain generated for a synthetic dataset. DR helps ensure that we include the range that best matches reality, including long-tail anomalies. By generating a wider range of data, we might find that a neural network could learn to better generalize across the full scope of the problem​.

The appearance gap can be further closed with high fidelity 3D assets, and ray tracing or path tracing-based rendering, using physically based materials, such as those defined with the MDL. Validated sensor models and domain randomization of their parameters can also help here.

Creating the synthetic scene

We imported the BIM Model of the indoor scene into NVIDIA Isaac Sim from Trimble SketchUp through the NVIDIA Omniverse SketchUp Connector. However, it looked rough with a significant appearance gap between sim and reality. Video 1 shows Trimble_DR_v1.1.usd.

Video 1. Starting indoor scene after it was imported into NVIDIA Isaac Sim from Trimble SketchUp through SketchUp Omniverse Connector

To close the appearance gap, we used NVIDIA MDL to add some textures and materials to the doors, walls, and ceilings. That made the scene look more realistic.

Video 2. Indoor scene after adding MDL materials

To close the content gap between sim and reality, we added props such as desks, office chairs, computer devices, and cardboard boxes to the scene through Omniverse DeepSearch, an AI-enabled service. Omniverse DeepSearch enables you to use natural language inputs and imagery for searching through the entire catalog of untagged 3D assets, objects, and characters.

These assets are publicly available in NVIDIA Omniverse.

3D models of chairs of different colors and shapes
Figure 2. Omniverse DeepSearch makes it easy to tag and search across 3D models

We also added ceiling lights to the scene. To capture the variety in door orientation, a domain randomization (DR) component was added to randomize the rotation of the doors, and Xform was used to simulate door hinges. This enabled the doors to open, close, or stay ajar at different angles. Video 3 shows the resulting scene with all the props.

Video 3. Indoor scene after adding props

Synthetic data generation

At this point, the Iterative process of synthetic data generation (SDG) was started. For the object detection model, we used TAO DetectNet V2 with a ResNet-18 backbone for all the experiments.

We fixed all model hyperparameters constant at their default values, including the batch size, learning rate, and dataset augmentation config parameters. In synthetic data generation, you iteratively tune the dataset generation parameters instead of model hyperparameters.

Diagram showing the workflow of synthetic dataset generation where synthetic data from 3D assets is improved by feedback obtained from training a model on it and then evaluating it on real data
Figure 3. The iterative process of synthetic data generation where dataset generation parameters are tuned based on feedback from model evaluation

The Trimble v1.3 scene contains 500 ray-traced images and environment props and no DR components except for Door Rotation. The door Texture was held fixed. Training on this scene resulted in 5% AP on the real test set (~1,000 images).

As you can see from the model’s prediction on real images, the model was failing to detect real doors adequately because it overfits to the texture of the simulated door. The model’s poor performance on the synthetic validation dataset with different textured doors confirmed this.

Another observation was that the lighting was held steady and constant in simulation, whereas reality has a variety of lighting conditions.

To prevent overfitting to the texture of the doors, we applied randomization to the door texture, randomizing between 30 different wood-like textures. To vary the lighting, we added DR over the ceiling lights to randomize the movement, intensity, and color of lights. Now that we were randomizing the texture of the door, it was important to give the model some learning signal on what makes a door besides its rectangular shape. We added realistic metallic door handles, kick plates, and door frames to all the doors in the scene. Training on 500 images from this improved scene yielded 57% AP on the real test set.

Video 4. Indoor scene after adding DR components for door rotation, texture, and color and movement of lights

This model was doing better than before, but it was still making false positive predictions on potted plants and QR codes on the walls in test real images. It was also doing poorly on the corridor images where we had multiple doors lined up. It had a lot of false positives in low-temperature lighting conditions (Figure 5).

To make the model robust to noise like QR codes on walls, we applied DR over the texture of the walls with different textures, including QR codes and other synthetic textures.

We added a few potted plants to the scene. We already had a corridor, so to generate synthetic data from it, two cameras were added along the corridor along with ceiling lights.

We added DR over light temperature, along with intensity, movement, and color, to have the model better generalize in different lighting conditions. We also noticed a variety of floors like shiny granite, carpet, and tiles in real images. To model these, we applied DR to randomize the material of the floor between different kinds of carpet, marble, tiles, and granite materials.

Similarly, we added a DR component to randomize the texture of the ceiling between different colors and different kinds of materials. We also added a DR visibility component to randomly add a few carts in the corridor in simulation, hoping to minimize the model’s false positives over carts in real images.

The synthetic dataset of 4,000 images generated from this scene got around 87% AP on the real test set by training only on synthetic data, achieving decent Sim2Real performance.

Video 5. Final scene with more DR components

Figure 6 shows a few inferences on real images from the final model.

Synthetic data generation in Omniverse

Using Omniverse connectors, MDL, and easy-to-use tools like DeepSearch, it’s possible for ML engineers and data scientists with no background in 3D design to create synthetic scenes.

NVIDIA Isaac Replicator makes it easy to bridge the Sim2Real gap by generating synthetic data with structured domain randomization. This way, Omniverse makes synthetic data generation accessible for you to bootstrap perception-based ML projects.

The approach presented here should be scalable, and it should be possible to increase the number of objects of interest and easily generate new synthetic data every time you want to detect additional new objects.

For more information, see the following resources:

If you have any questions, post them in the Omniverse Synthetic Data, Omniverse Code, or Omniverse Isaac Sim forums.

Categories
Misc

Applying Inference over Specific Frame Regions with NVIDIA DeepStream

This tutorial shares how to apply inference over a predefined area of the incoming video frames.

Detecting objects in high-resolution input is a well-known problem in computer vision. When a certain area of the frame is of interest, inference over the complete frame is unnecessary. There are two ways to solve this issue:

  • Use a large model with a high input resolution.
  • Divide the large image into tiles and apply the smaller model to each one.

In many ways, the first approach is difficult. Training a model with large input often requires larger backbones, making the overall model bulkier. Training or deploying such a model also requires more computing resources. Larger models are deemed unfit for edge deployment on smaller devices.

The second method, dividing the entire image into tiles and applying smaller models to each tile has obvious advantages. Smaller models are used so lesser computation power is required in training and inference. No retraining is required to apply the model to the high-resolution input. Smaller models are also considered edge-deployment-friendly. 

In this post, we discuss at how NVIDIA DeepStream can help in applying smaller models onto high-resolution input to detect a specific frame region.

Overview of video surveillance systems

Video surveillance systems are used to solve various problems such as the identification of pedestrians, vehicles, and cars. Nowadays, 4K and 8K cameras are used to capture details of the scene. The military uses aerial photography for various purposes and that also has a large area covered.

With the increase in resolution, the number of pixels increases exponentially. It takes a huge amount of computing power to process such a huge number of pixels, especially with a deep neural network.

Based on the input dimension selected during model building, deep neural networks operate on the fixed shape input. This fixed-size input is also known as the receptive field of the model. Typically, receptive fields vary from 256×256 to 1280×1280 and beyond in the detection and segmentation networks.

You might find that the region of interest is a small area and not the entire frame. In this case, if the detection is applied over the entire frame, it is an unnecessary use of compute resources. The DeepStream NvDsPreprocess plugin enables you to invest compute over a specific area of the frame.

DeepStream NvDsPreprocessing plugin

However, when tiling is applied to images or frames, especially over the video feeds, you require an additional element in the inference pipeline. Such an element is expected to perform a tiling mechanism that can be configured per stream, batched inference over the tile, and combining inference from multiple tiles onto single frames.

Interestingly, all these functionalities are provided in DeepStream with the Gst-NvDsPreprocess customizable plugin. It provides a custom library interface for preprocessing input streams. Each stream can have its own preprocessing requirements.

The default plugin implementation provides the following functionality:

  • Streams with predefined regions of interest (ROIs) or tiles are scaled and format converted as per the network requirements for inference. Per-stream ROIs are specified in the config file.
  • It prepares a raw tensor from the scaled and converted ROIs and is passed to the downstream plugins through user metadata. Downstream plugins can access this tensor for inference.

DeepStream pipeline with tiling

Modifying the existing code to support tiling is next.

Using the NvdsPreprocessing plugin

Define the preprocess element inside the pipeline:

preprocess = Gst.ElementFactory.make("nvdspreprocess", "preprocess-plugin")

NvDsPreprocess requires a config file as input:

preprocess.set_property("config-file", "config_preprocess.txt")

Add the preprocess element to the pipeline:

pipeline.add(preprocess)

Link the element to the pipeline:

streammux.link(preprocess)
preprocess.link(pgie)

Let the NvdsPreprocess plugin do preprocessing

The inference is done with the NvDsInfer plugin, which has frame preprocessing capabilities.

When you use the NvdsPreprocess plugin before NvDsInfer, you want the preprocessing (scaling or format conversion) to be done by NvdsPreprocess and not NvDsInfer. To do this, set the input-tensor-meta property of NvDsInfer to true. This let NvdsPreprocess do preprocessing and use preprocessed input tensors attached as metadata instead of preprocessing inside NvDsInfer itself.

The following steps are required to incorporate Gst-nvdspreprocess functionality into your existing pipeline.

Define and add the nvdspreprocess plugin to the pipeline:

preprocess = Gst.ElementFactory.make("nvdspreprocess", "preprocess-plugin")
pipeline.add(preprocess)

Set the input-tensor-meta property of NvDsInfer to true:

pgie.set_property("input-tensor-meta", True)

Define the config file property for the nvdspreprocess plugin:

preprocess.set_property("config-file", "config_preprocess.txt")

Link the preprocess plugin before the primary inference engine (pgie):

streammux.link(preprocess)
preprocess.link(pgie)

Creating the config file

The Gst-nvdspreprocess configuration file uses a key file format. For more information, see the config_preprocess.txt in the Python and C source code.

  • The [property] group configures the general behavior of the plugin.
    • The [group-] group configures ROIs, tiles, and ull-frames for a group of streams with src-id values and custom-input-transformation-function from custom lib.
    • The [user-configs] group configures parameters required by the custom library, which is passed on to the custom lib through a map of as a key-value pair. Then, custom lib must parse the values accordingly.

The minimum required config_preprocess.txt looks like the following code example:

[property]
enable=1
target-unique-ids=1
    # 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=0

network-input-order=0
processing-width=960
processing-height=544
scaling-buf-pool-size=6
tensor-buf-pool-size=6
    # tensor shape based on network-input-order
network-input-shape=12;3;544;960
    # 0=RGB, 1=BGR, 2=GRAY

network-color-format=0
    # 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=input_1
    # 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE 3=NVBUF_MEM_CUDA_UNIFIED
scaling-pool-memory-type=0
    # 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU 2=NvBufSurfTransformCompute_VIC
scaling-pool-compute-hw=0
    # Scaling Interpolation method
    # 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
    # 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
    # 6=NvBufSurfTransformInter_Default
scaling-filter=0
custom-lib-path=/opt/nvidia/deepstream/deepstream/lib/gst-plugins/libcustom2d_preprocess.so
custom-tensor-preparation-function=CustomTensorPreparation

[user-configs]
pixel-normalization-factor=0.003921568
#mean-file=
#offsets=

[group-0]
src-ids=0;1;2;3
custom-input-transformation-function=CustomAsyncTransformation
process-on-roi=1
roi-params-src-0=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-1=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-2=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-3=0;540;900;500;960;0;900;500;0;0;540;900;

Processing-width and processing-height refer to the width and height of the slice onto the entire frame.

For network-input-shape, the current config file is configured to run 12 ROI at the most. To increase the ROI count, increase the first dimension to the required number, for example, network-input-shape=12;3;544;960.

In the current config file config-preprocess.txt, there are three ROIs per source and a total of 12 ROI for all four sources. The total ROIs from all the sources must not exceed the first dimension specified in the network-input-shape parameter.

Roi-params-src- Indicates III coordinates for source-. For each ROI specify left;top;width;height defining the ROI if process-on-roi is enabled. Gst-nvdspreprocess does not combine detection and count of objects in the overlapping tiles.

Code

The C code is downloadable from /opt/nvidia/deepstream/deepstream-6.0/source/app/sample_app/deepstream-preprocess-test.

The Python code is downloadable from the NVIDIA-AI-IOT/deepstream_python_apps GitHub repo.

Results

Figure 1 shows that you can specify one or more tiles. An object within the tile is detected and detection is not applied to the remaining area of the frame.

Image showing the application of NVIDIA DeepStream Gst-nvdspreprocess plugin.
Figure 1. Showing detection applied over tiles by using the Gst-nvdspreprocess plugin. The green box shows the tile boundary, and the red boxes show detected objects within the tile.

Gst-nvdspreprocess enables applying inference on a specific portion of the video (tile or region of interest). With Gst-nvdspreprocess, you can specify one or more tiles on a single frame.

Here are the performance metrics when yolov4 is applied over the entire frame compared to over the tile. Perf metrics are collected by increasing the number of streams up to the point either decoder or compute saturates and increasing the stream any further shows no gain in performance.

The video resolution of 1080p was used for performance benchmarks over the NVIDIA V100 GPU. Consider the tradeoff between performance and the number of tiles, as placing too many tiles increases the compute requirement.

Tiling with NvDsPreprocess helps in the selective inference over the portion of the video where it is required. In Figure 1, for instance, the inference can only be used on the sidewalk and not the entire frame.

Gst-nvdsanalytics performs analytics on metadata attached by nvinfer (primary detector) and nvtracker. Gst-nvdsanalytics can be applied to the tiles for ROI Filtering, Overcrowding Detection, Direction Detection, and Line Crossing.