Categories
Misc

Enhanced Image Analysis with Multidimensional Image Processing

Many times two dimensions are insufficient for analyzing image data. cuCIM is an open-source, accelerated, computer vision and image-processing software library for multidimensional images.

Image data can generally be described through two dimensions (rows and columns), with a possible additional dimension for the colors red, green, blue (RGB). However, sometimes further dimensions are required for more accurate and detailed image analysis in specific applications and domains.

For example, you may want to study a three-dimensional (3D) volume, measuring the distance between two parts or modeling how that 3D volume changes over time (the fourth dimension). In these instances, you need more than two dimensions to make sense of what you are seeing.

Multidimensional image processing, or n-dimensional image processing, is the broad term for analyzing, extracting, and enhancing useful information from image data with two or more dimensions. It is particularly helpful and needed for medical imaging, remote sensing, material science, and microscopy applications.

Some methods in these applications may involve data from more channels than traditional grayscale, RGB, or red, green, blue, alpha (RGBA) images. N-dimensional image processing helps you study and make informed decisions using devices enabled with identification, filtering, and segmentation capabilities. 

Multidimensional image processing gives you the flexibility to perform functions for traditional two-dimensional filtering in scientific applications. Within medical imaging specifically, computed tomography (CT) and magnetic resonance imaging (MRI) scans require multidimensional image processing to form images of the body and its functions. For example, multidimensional dimensional image processing is used in medical imaging to detect cancer or estimate tumor size (Figure 1). 

Multidimensional, high-resolution images of tissue.
Figure 1. Tissue can be rendered and examined quickly in multidimensional digital pathology use cases

Multidimensional image processing developer challenges

Outside of identifying, acquiring, and storing the image data itself, working with multidimensional image data comes with its own set of challenges.

First, multidimensional images are larger in size than their 2D counterparts and typically of high resolution, so loading them to memory and accessing them is time-consuming.

Second, processing each additional dimension of image data requires additional time and processing power. Analyzing more dimensions enlarges the scope of consideration.

Third, the computer-vision and image-processing algorithms take longer for analyzing each additional dimension, including the low-level operations and primitives. Multidimensional filters, gradients, and histogram complexity grow with each additional dimension.

Finally, when the data is manipulated, dataset visualization for multidimensional image processing is further complicated by the additional dimensions under consideration and quality to which it must be rendered. In biomedical imaging, the level of detail required can make the difference in identifying cancerous cells and damaged organ tissue.

Multidimensional input/output

If you’re a data scientist or researcher working in multidimensional image processing, you need software that can make data loading and handling for large image files efficient. Popular multidimensional file formats include the following:

  • NumPy binary format(.npy)
  • Tag Image File Format (TIFF)
  • TFRecord (.tfrecord)
  • Zarr
  • Variants of the formats listed above

Because every pixel counts, you have to process image data accurately with all the available processing power available.  Graphics processing units (GPU) hardware gives you the processing power and efficiency needed to handle and balance the workload of analyzing complex, multidimensional image data in real time.

cuCIM

Compute Unified Device Architecture Clara IMage (cuCIM) is an open-source, accelerated, computer-vision and image-processing software library that uses the processing power of GPUs to address the needs and pain points of developers working with multidimensional images.

Data scientists and researchers need software that is fast, easy to use, and reliable for an increasing workload. While specifically tuned for biomedical applications, cuCIM can be used for geospatial, material and life sciences, and remote sensing use cases.

cuCIM offers 200+ computer-vision and image-processing functions for color conversion, exposure, feature extraction, measuring, segmentation, restoration, and transforms. 

cuCIM is capable and fast image-processing software, requiring minimal changes to your existing pipeline. cuCIM equips you with enhanced digital image-processing capabilities that can be integrated into existing pipelines:

You can integrate using either a C++ or Python application programming interface (API) that matches OpenSlide for I/O and scikit-image for processing in Python. 

The cuCIM Python bindings offer many commonly used, computer-vision, image-processing functions that are easily integratable and compilable into the developer workflow. 

You don’t have to learn a new interface or programming language to use cuCIM. In most instances, only one line of code is added for transferring images to the GPU. The cuCIM coding structure is nearly identical to that used for the CPU, so there’s little change needed to take advantage of the GPU-enabled capabilities.

Because cuCIM is also enabled for GPUDirect Storage (GDS), you can efficiently transfer and write data directly from storage to the GPU without making an intermediate copy in host (CPU) memory. That saves time on I/O tasks.

With its quick set-up, cuCIM provides the benefit of GPU-accelerated image processing and efficient I/O with minimal developer effort and with no low-level compute unified device architecture (CUDA) programming required.

Free downloads and resources

cuCIM can be downloaded for free through Conda or PyPi. For more information, see the cuCIM developer page. You’ll learn about developer challenges, primitives, and use cases and get links to references and resources.

Categories
Offsites

Look and Talk: Natural Conversations with Google Assistant

In natural conversations, we don’t say people’s names every time we speak to each other. Instead, we rely on contextual signaling mechanisms to initiate conversations, and eye contact is often all it takes. Google Assistant, now available in more than 95 countries and over 29 languages, has primarily relied on a hotword mechanism (“Hey Google” or “OK Google”) to help more than 700 million people every month get things done across Assistant devices. As virtual assistants become an integral part of our everyday lives, we’re developing ways to initiate conversations more naturally.

At Google I/O 2022, we announced Look and Talk, a major development in our journey to create natural and intuitive ways to interact with Google Assistant-powered home devices. This is the first multimodal, on-device Assistant feature that simultaneously analyzes audio, video, and text to determine when you are speaking to your Nest Hub Max. Using eight machine learning models together, the algorithm can differentiate intentional interactions from passing glances in order to accurately identify a user’s intent to engage with Assistant. Once within 5ft of the device, the user may simply look at the screen and talk to start interacting with the Assistant.

We developed Look and Talk in alignment with our AI Principles. It meets our strict audio and video processing requirements, and like our other camera sensing features, video never leaves the device. You can always stop, review and delete your Assistant activity at myactivity.google.com. These added layers of protection enable Look and Talk to work just for those who turn it on, while keeping your data safe.

Google Assistant relies on a number of signals to accurately determine when the user is speaking to it. On the right is a list of signals used with indicators showing when each signal is triggered based on the user’s proximity to the device and gaze direction.

Modeling Challenges
The journey of this feature began as a technical prototype built on top of models developed for academic research. Deployment at scale, however, required solving real-world challenges unique to this feature. It had to:

  1. Support a range of demographic characteristics (e.g., age, skin tones).
  2. Adapt to the ambient diversity of the real world, including challenging lighting (e.g., backlighting, shadow patterns) and acoustic conditions (e.g., reverberation, background noise).
  3. Deal with unusual camera perspectives, since smart displays are commonly used as countertop devices and look up at the user(s), unlike the frontal faces typically used in research datasets to train models.
  4. Run in real-time to ensure timely responses while processing video on-device.

The evolution of the algorithm involved experiments with approaches ranging from domain adaptation and personalization to domain-specific dataset development, field-testing and feedback, and repeated tuning of the overall algorithm.

Technology Overview
A Look and Talk interaction has three phases. In the first phase, Assistant uses visual signals to detect when a user is demonstrating an intent to engage with it and then “wakes up” to listen to their utterance. The second phase is designed to further validate and understand the user’s intent using visual and acoustic signals. If any signal in the first or second processing phases indicates that it isn’t an Assistant query, Assistant returns to standby mode. These two phases are the core Look and Talk functionality, and are discussed below. The third phase of query fulfillment is typical query flow, and is beyond the scope of this blog.

Phase One: Engaging with Assistant
The first phase of Look and Talk is designed to assess whether an enrolled user is intentionally engaging with Assistant. Look and Talk uses face detection to identify the user’s presence, filters for proximity using the detected face box size to infer distance, and then uses the existing Face Match system to determine whether they are enrolled Look and Talk users.

For an enrolled user within range, an custom eye gaze model determines whether they are looking at the device. This model estimates both the gaze angle and a binary gaze-on-camera confidence from image frames using a multi-tower convolutional neural network architecture, with one tower processing the whole face and another processing patches around the eyes. Since the device screen covers a region underneath the camera that would be natural for a user to look at, we map the gaze angle and binary gaze-on-camera prediction to the device screen area. To ensure that the final prediction is resilient to spurious individual predictions and involuntary eye blinks and saccades, we apply a smoothing function to the individual frame-based predictions to remove spurious individual predictions.

Eye-gaze prediction and post-processing overview.

We enforce stricter attention requirements before informing users that the system is ready for interaction to minimize false triggers, e.g., when a passing user briefly glances at the device. Once the user looking at the device starts speaking, we relax the attention requirement, allowing the user to naturally shift their gaze.

The final signal necessary in this processing phase checks that the Face Matched user is the active speaker. This is provided by a multimodal active speaker detection model that takes as input both video of the user’s face and the audio containing speech, and predicts whether they are speaking. A number of augmentation techniques (including RandAugment, SpecAugment, and augmenting with AudioSet sounds) helps improve prediction quality for the in-home domain, boosting end-feature performance by over 10%.The final deployed model is a quantized, hardware-accelerated TFLite model, which uses five frames of context for the visual input and 0.5 seconds for the audio input.

Active speaker detection model overview: The two-tower audiovisual model provides the “speaking” probability prediction for the face. The visual network auxiliary prediction pushes the visual network to be as good as possible on its own, improving the final multimodal prediction.

Phase Two: Assistant Starts Listening
In phase two, the system starts listening to the content of the user’s query, still entirely on-device, to further assess whether the interaction is intended for Assistant using additional signals. First, Look and Talk uses Voice Match to further ensure that the speaker is enrolled and matches the earlier Face Match signal. Then, it runs a state-of-the-art automatic speech recognition model on-device to transcribe the utterance.

The next critical processing step is the intent understanding algorithm, which predicts whether the user’s utterance was intended to be an Assistant query. This has two parts: 1) a model that analyzes the non-lexical information in the audio (i.e., pitch, speed, hesitation sounds) to determine whether the utterance sounds like an Assistant query, and 2) a text analysis model that determines whether the transcript is an Assistant request. Together, these filter out queries not intended for Assistant. It also uses contextual visual signals to determine the likelihood that the interaction was intended for Assistant.

Overview of the semantic filtering approach to determine if a user utterance is a query intended for the Assistant.

Finally, when the intent understanding model determines that the user utterance was likely meant for Assistant, Look and Talk moves into the fulfillment phase where it communicates with the Assistant server to obtain a response to the user’s intent and query text.

Performance, Personalization and UX
Each model that supports Look and Talk was evaluated and improved in isolation and then tested in the end-to-end Look and Talk system. The huge variety of ambient conditions in which Look and Talk operates necessitates the introduction of personalization parameters for algorithm robustness. By using signals obtained during the user’s hotword-based interactions, the system personalizes parameters to individual users to deliver improvements over the generalized global model. This personalization also runs entirely on-device.

Without a predefined hotword as a proxy for user intent, latency was a significant concern for Look and Talk. Often, a strong enough interaction signal does not occur until well after the user has started speaking, which can add hundreds of milliseconds of latency, and existing models for intent understanding add to this since they require complete, not partial, queries. To bridge this gap, Look and Talk completely forgoes streaming audio to the server, with transcription and intent understanding being on-device. The intent understanding models can work off of partial utterances. This results in an end-to-end latency comparable with current hotword-based systems.

The UI experience is based on user research to provide well-balanced visual feedback with high learnability. This is illustrated in the figure below.

Left: The spatial interaction diagram of a user engaging with Look and Talk. Right: The User Interface (UI) experience.

We developed a diverse video dataset with over 3,000 participants to test the feature across demographic subgroups. Modeling improvements driven by diversity in our training data improved performance for all subgroups.

Conclusion
Look and Talk represents a significant step toward making user engagement with Google Assistant as natural as possible. While this is a key milestone in our journey, we hope this will be the first of many improvements to our interaction paradigms that will continue to reimagine the Google Assistant experience responsibly. Our goal is to make getting help feel natural and easy, ultimately saving time so users can focus on what matters most.

Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, UX, and cross-functional contributors. Key contributors from Google Assistant include Alexey Galata, Alice Chuang‎, Barbara Wang, Britanie Hall, Gabriel Leblanc, Gloria McGee, Hideaki Matsui, James Zanoni, Joanna (Qiong) Huang, Krunal Shah, Kavitha Kandappan, Pedro Silva, Tanya Sinha, Tuan Nguyen, Vishal Desai, Will Truong‎, Yixing Cai‎, Yunfan Ye; from Research including Hao Wu, Joseph Roth, Sagar Savla, Sourish Chaudhuri, Susanna Ricco. Thanks to Yuan Yuan and Caroline Pantofaru for their leadership, and everyone on the Nest, Assistant, and Research teams who provided invaluable input toward the development of Look and Talk.

Categories
Misc

Developing NLP Applications to Enhance Clinical Experiences and Accelerate Drug Discovery

Discover tools to translate unstructured data to structured data to help healthcare organizations harness relevant insights and improve healthcare delivery and patient experiences.

Natural language processing (NLP) can be defined as the combination of artificial intelligence (AI), computer science, and computational linguistics to understand human communication and extract meaning from unstructured spoken or written material. 

NLP use cases for healthcare have increased in the last few years to accelerate the development of therapeutics and improve quality of patient care through language understanding and predictive analytics. 

The healthcare industry generates vast amounts of unstructured data, but it is difficult to derive insights without finding ways to structure and represent that data in a computable form. Developers need the tools to translate unstructured data to structured data to help healthcare organizations harness relevant insights and improve healthcare delivery and patient care.

Transformer-based NLP has emerged as a paradigm shift in the performance of text-based healthcare workflows. Because of its versatility, NLP can structure virtually any proprietary or public data to spark insights in healthcare, leading to a wide variety of downstream applications that directly impact patient care or augment and accelerate drug discovery.

NLP for drug discovery

NLP is playing a critical role in accelerating small molecule drug discovery. Prior knowledge on the manufacturability or contraindications of a drug can be extracted from academic publications and proprietary data sets. NLP can also help with clinical trial analysis and accelerate the process of taking a drug to market.

Transformer architectures are popular in NLP, but these tools can also be used to understand the language of chemistry and biology. For example, text-based representations of chemical structure such as SMILES (Simplified Input Molecular Line Entry System) can be understood by transformer-based architectures leading to incredible capabilities for drug property evaluation and generative chemistry. 

MegaMolBART, a large transformer model developed by AstraZeneca and NVIDIA, is used for a wide range of tasks, including reaction prediction, molecular optimization, and de novo molecule generation.

Transformer-based NLP models are instrumental in understanding and predicting the structure and function of biomolecules like proteins. Much like they do for natural language, transformer-based representations of protein sequences provide powerful embeddings for use in downstream AI tasks, like predicting the final folded state of a protein, understanding the strength of protein-protein or protein-small molecule interactions, or in the design of protein structure provided a biological target.

NLP for clinical trial insights

Once a drug has been developed, patient data plays a large role in the process of taking it to market. Much of the patient data that is collected through the course of care is contained in free text, such as clinical notes from patient visits or procedural results. 

While these data are easily interpretable by a human, combining insights across clinical free text documents requires making information across diverse documents interoperable, such that the health of the patient is represented in a useful way. 

Modern NLP algorithms have accelerated our ability to derive these insights, helping to compare patients with similar symptoms, suggesting treatments, discovering diagnostic near-misses, and providing clinical care navigation and next-best-action prediction. 

NLP to enhance clinical experiences

Many patient interactions with the hospital system are remote, in part due to the growing use of telehealth services that stemmed from COVID-19. Those telehealth visits can be converted into structured information with the help of NLP. 

For physicians and surgeons, speech to text capabilities can turn verbal discussions with patients and clinical teams into text, which can then be stored in electronic health records (EHR). Applications include summarizing patient visits, catching near-misses, and predicting optimal treatment regimens.

Removing the burden of clinical documentation for each patient’s visit allows providers to spend more time and energy offering the best care for each patient, and simultaneously reduces physician burnout. NLP can also help hospitals predict patient outcomes such as readmission or sepsis. 

Learn more about NLP in healthcare

View on-demand sessions from the NVIDIA Healthcare and Life Sciences NLP Developer Summit to learn more about the use of NLP in healthcare. Session topics include best practices and insights for applications from speech AI in clinics to drug discovery.

Browse NVIDIA’s collection of biomedical pre-trained language models, as well as highly optimized pipelines for training NLP models on biomedical and clinical text, in the Clara NLP NGC Collection.

Categories
Misc

Closing the Sim2Real Gap with NVIDIA Isaac Sim and NVIDIA Isaac Replicator

NVIDIA Isaac Replicator, built on the Omniverse Replicator SDK, can help you develop a cost-effective and reliable workflow to train computer vision models using synthetic data.

Synthetic data is an important tool in training machine learning models for computer vision applications. Researchers from NVIDIA have introduced a structured domain randomization system within Omniverse Replicator that can help you train and refine models using synthetic data.

Omniverse Replicator is an SDK built on the NVIDIA Omniverse platform that enables you to build custom synthetic data generation tools and workflows. The NVIDIA Isaac Sim development team used Omniverse Replicator SDK to build NVIDIA Isaac Replicator, a robotics-specific synthetic data generation toolkit, exposed within the NVIDIA Isaac Sim app.

We explored using synthetic data generated from synthetic environments for a recent project. Trimble plans to deploy Boston Dynamics’ Spot in a variety of indoor settings and construction environments. But Trimble had to develop a cost-effective and reliable workflow to train ML-based perception models so that Spot could autonomously operate in different indoor settings.
By generating data from a synthetic indoor environment using structured domain randomization within NVIDIA Isaac Replicator, you can train an off-the-shelf object detection model to detect doors in the real indoor environment.

Collage of doors in indoor room and corridors
Figure 1. A few images from the test set of real 1,000 images

Sim2Real domain gap

Given that synthetic data sets are generated using simulation, it is critical to close the gap between the simulation and the real world. This gap is called the domain gap, which can be divided into two pieces:

  • Appearance gap: The pixel level differences between two images. These differences can be a result of differences in object detail, materials, or in the case of synthetic data, differences in the capabilities of the rendering system used.​
  • Content gap:  Refers to the difference between the domains. This includes factors like the number of objects in the scene, their diversity of type and placement, and similar contextual information. ​

A critical tool for overcoming these domain gaps is domain randomization (DR), which increases the size of the domain generated for a synthetic dataset. DR helps ensure that we include the range that best matches reality, including long-tail anomalies. By generating a wider range of data, we might find that a neural network could learn to better generalize across the full scope of the problem​.

The appearance gap can be further closed with high fidelity 3D assets, and ray tracing or path tracing-based rendering, using physically based materials, such as those defined with the MDL. Validated sensor models and domain randomization of their parameters can also help here.

Creating the synthetic scene

We imported the BIM Model of the indoor scene into NVIDIA Isaac Sim from Trimble SketchUp through the NVIDIA Omniverse SketchUp Connector. However, it looked rough with a significant appearance gap between sim and reality. Video 1 shows Trimble_DR_v1.1.usd.

Video 1. Starting indoor scene after it was imported into NVIDIA Isaac Sim from Trimble SketchUp through SketchUp Omniverse Connector

To close the appearance gap, we used NVIDIA MDL to add some textures and materials to the doors, walls, and ceilings. That made the scene look more realistic.

Video 2. Indoor scene after adding MDL materials

To close the content gap between sim and reality, we added props such as desks, office chairs, computer devices, and cardboard boxes to the scene through Omniverse DeepSearch, an AI-enabled service. Omniverse DeepSearch enables you to use natural language inputs and imagery for searching through the entire catalog of untagged 3D assets, objects, and characters.

These assets are publicly available in NVIDIA Omniverse.

3D models of chairs of different colors and shapes
Figure 2. Omniverse DeepSearch makes it easy to tag and search across 3D models

We also added ceiling lights to the scene. To capture the variety in door orientation, a domain randomization (DR) component was added to randomize the rotation of the doors, and Xform was used to simulate door hinges. This enabled the doors to open, close, or stay ajar at different angles. Video 3 shows the resulting scene with all the props.

Video 3. Indoor scene after adding props

Synthetic data generation

At this point, the Iterative process of synthetic data generation (SDG) was started. For the object detection model, we used TAO DetectNet V2 with a ResNet-18 backbone for all the experiments.

We fixed all model hyperparameters constant at their default values, including the batch size, learning rate, and dataset augmentation config parameters. In synthetic data generation, you iteratively tune the dataset generation parameters instead of model hyperparameters.

Diagram showing the workflow of synthetic dataset generation where synthetic data from 3D assets is improved by feedback obtained from training a model on it and then evaluating it on real data
Figure 3. The iterative process of synthetic data generation where dataset generation parameters are tuned based on feedback from model evaluation

The Trimble v1.3 scene contains 500 ray-traced images and environment props and no DR components except for Door Rotation. The door Texture was held fixed. Training on this scene resulted in 5% AP on the real test set (~1,000 images).

As you can see from the model’s prediction on real images, the model was failing to detect real doors adequately because it overfits to the texture of the simulated door. The model’s poor performance on the synthetic validation dataset with different textured doors confirmed this.

Another observation was that the lighting was held steady and constant in simulation, whereas reality has a variety of lighting conditions.

To prevent overfitting to the texture of the doors, we applied randomization to the door texture, randomizing between 30 different wood-like textures. To vary the lighting, we added DR over the ceiling lights to randomize the movement, intensity, and color of lights. Now that we were randomizing the texture of the door, it was important to give the model some learning signal on what makes a door besides its rectangular shape. We added realistic metallic door handles, kick plates, and door frames to all the doors in the scene. Training on 500 images from this improved scene yielded 57% AP on the real test set.

Video 4. Indoor scene after adding DR components for door rotation, texture, and color and movement of lights

This model was doing better than before, but it was still making false positive predictions on potted plants and QR codes on the walls in test real images. It was also doing poorly on the corridor images where we had multiple doors lined up. It had a lot of false positives in low-temperature lighting conditions (Figure 5).

To make the model robust to noise like QR codes on walls, we applied DR over the texture of the walls with different textures, including QR codes and other synthetic textures.

We added a few potted plants to the scene. We already had a corridor, so to generate synthetic data from it, two cameras were added along the corridor along with ceiling lights.

We added DR over light temperature, along with intensity, movement, and color, to have the model better generalize in different lighting conditions. We also noticed a variety of floors like shiny granite, carpet, and tiles in real images. To model these, we applied DR to randomize the material of the floor between different kinds of carpet, marble, tiles, and granite materials.

Similarly, we added a DR component to randomize the texture of the ceiling between different colors and different kinds of materials. We also added a DR visibility component to randomly add a few carts in the corridor in simulation, hoping to minimize the model’s false positives over carts in real images.

The synthetic dataset of 4,000 images generated from this scene got around 87% AP on the real test set by training only on synthetic data, achieving decent Sim2Real performance.

Video 5. Final scene with more DR components

Figure 6 shows a few inferences on real images from the final model.

Synthetic data generation in Omniverse

Using Omniverse connectors, MDL, and easy-to-use tools like DeepSearch, it’s possible for ML engineers and data scientists with no background in 3D design to create synthetic scenes.

NVIDIA Isaac Replicator makes it easy to bridge the Sim2Real gap by generating synthetic data with structured domain randomization. This way, Omniverse makes synthetic data generation accessible for you to bootstrap perception-based ML projects.

The approach presented here should be scalable, and it should be possible to increase the number of objects of interest and easily generate new synthetic data every time you want to detect additional new objects.

For more information, see the following resources:

If you have any questions, post them in the Omniverse Synthetic Data, Omniverse Code, or Omniverse Isaac Sim forums.

Categories
Misc

Applying Inference over Specific Frame Regions with NVIDIA DeepStream

This tutorial shares how to apply inference over a predefined area of the incoming video frames.

Detecting objects in high-resolution input is a well-known problem in computer vision. When a certain area of the frame is of interest, inference over the complete frame is unnecessary. There are two ways to solve this issue:

  • Use a large model with a high input resolution.
  • Divide the large image into tiles and apply the smaller model to each one.

In many ways, the first approach is difficult. Training a model with large input often requires larger backbones, making the overall model bulkier. Training or deploying such a model also requires more computing resources. Larger models are deemed unfit for edge deployment on smaller devices.

The second method, dividing the entire image into tiles and applying smaller models to each tile has obvious advantages. Smaller models are used so lesser computation power is required in training and inference. No retraining is required to apply the model to the high-resolution input. Smaller models are also considered edge-deployment-friendly. 

In this post, we discuss at how NVIDIA DeepStream can help in applying smaller models onto high-resolution input to detect a specific frame region.

Overview of video surveillance systems

Video surveillance systems are used to solve various problems such as the identification of pedestrians, vehicles, and cars. Nowadays, 4K and 8K cameras are used to capture details of the scene. The military uses aerial photography for various purposes and that also has a large area covered.

With the increase in resolution, the number of pixels increases exponentially. It takes a huge amount of computing power to process such a huge number of pixels, especially with a deep neural network.

Based on the input dimension selected during model building, deep neural networks operate on the fixed shape input. This fixed-size input is also known as the receptive field of the model. Typically, receptive fields vary from 256×256 to 1280×1280 and beyond in the detection and segmentation networks.

You might find that the region of interest is a small area and not the entire frame. In this case, if the detection is applied over the entire frame, it is an unnecessary use of compute resources. The DeepStream NvDsPreprocess plugin enables you to invest compute over a specific area of the frame.

DeepStream NvDsPreprocessing plugin

However, when tiling is applied to images or frames, especially over the video feeds, you require an additional element in the inference pipeline. Such an element is expected to perform a tiling mechanism that can be configured per stream, batched inference over the tile, and combining inference from multiple tiles onto single frames.

Interestingly, all these functionalities are provided in DeepStream with the Gst-NvDsPreprocess customizable plugin. It provides a custom library interface for preprocessing input streams. Each stream can have its own preprocessing requirements.

The default plugin implementation provides the following functionality:

  • Streams with predefined regions of interest (ROIs) or tiles are scaled and format converted as per the network requirements for inference. Per-stream ROIs are specified in the config file.
  • It prepares a raw tensor from the scaled and converted ROIs and is passed to the downstream plugins through user metadata. Downstream plugins can access this tensor for inference.

DeepStream pipeline with tiling

Modifying the existing code to support tiling is next.

Using the NvdsPreprocessing plugin

Define the preprocess element inside the pipeline:

preprocess = Gst.ElementFactory.make("nvdspreprocess", "preprocess-plugin")

NvDsPreprocess requires a config file as input:

preprocess.set_property("config-file", "config_preprocess.txt")

Add the preprocess element to the pipeline:

pipeline.add(preprocess)

Link the element to the pipeline:

streammux.link(preprocess)
preprocess.link(pgie)

Let the NvdsPreprocess plugin do preprocessing

The inference is done with the NvDsInfer plugin, which has frame preprocessing capabilities.

When you use the NvdsPreprocess plugin before NvDsInfer, you want the preprocessing (scaling or format conversion) to be done by NvdsPreprocess and not NvDsInfer. To do this, set the input-tensor-meta property of NvDsInfer to true. This let NvdsPreprocess do preprocessing and use preprocessed input tensors attached as metadata instead of preprocessing inside NvDsInfer itself.

The following steps are required to incorporate Gst-nvdspreprocess functionality into your existing pipeline.

Define and add the nvdspreprocess plugin to the pipeline:

preprocess = Gst.ElementFactory.make("nvdspreprocess", "preprocess-plugin")
pipeline.add(preprocess)

Set the input-tensor-meta property of NvDsInfer to true:

pgie.set_property("input-tensor-meta", True)

Define the config file property for the nvdspreprocess plugin:

preprocess.set_property("config-file", "config_preprocess.txt")

Link the preprocess plugin before the primary inference engine (pgie):

streammux.link(preprocess)
preprocess.link(pgie)

Creating the config file

The Gst-nvdspreprocess configuration file uses a key file format. For more information, see the config_preprocess.txt in the Python and C source code.

  • The [property] group configures the general behavior of the plugin.
    • The [group-] group configures ROIs, tiles, and ull-frames for a group of streams with src-id values and custom-input-transformation-function from custom lib.
    • The [user-configs] group configures parameters required by the custom library, which is passed on to the custom lib through a map of as a key-value pair. Then, custom lib must parse the values accordingly.

The minimum required config_preprocess.txt looks like the following code example:

[property]
enable=1
target-unique-ids=1
    # 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=0

network-input-order=0
processing-width=960
processing-height=544
scaling-buf-pool-size=6
tensor-buf-pool-size=6
    # tensor shape based on network-input-order
network-input-shape=12;3;544;960
    # 0=RGB, 1=BGR, 2=GRAY

network-color-format=0
    # 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=input_1
    # 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE 3=NVBUF_MEM_CUDA_UNIFIED
scaling-pool-memory-type=0
    # 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU 2=NvBufSurfTransformCompute_VIC
scaling-pool-compute-hw=0
    # Scaling Interpolation method
    # 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
    # 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
    # 6=NvBufSurfTransformInter_Default
scaling-filter=0
custom-lib-path=/opt/nvidia/deepstream/deepstream/lib/gst-plugins/libcustom2d_preprocess.so
custom-tensor-preparation-function=CustomTensorPreparation

[user-configs]
pixel-normalization-factor=0.003921568
#mean-file=
#offsets=

[group-0]
src-ids=0;1;2;3
custom-input-transformation-function=CustomAsyncTransformation
process-on-roi=1
roi-params-src-0=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-1=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-2=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-3=0;540;900;500;960;0;900;500;0;0;540;900;

Processing-width and processing-height refer to the width and height of the slice onto the entire frame.

For network-input-shape, the current config file is configured to run 12 ROI at the most. To increase the ROI count, increase the first dimension to the required number, for example, network-input-shape=12;3;544;960.

In the current config file config-preprocess.txt, there are three ROIs per source and a total of 12 ROI for all four sources. The total ROIs from all the sources must not exceed the first dimension specified in the network-input-shape parameter.

Roi-params-src- Indicates III coordinates for source-. For each ROI specify left;top;width;height defining the ROI if process-on-roi is enabled. Gst-nvdspreprocess does not combine detection and count of objects in the overlapping tiles.

Code

The C code is downloadable from /opt/nvidia/deepstream/deepstream-6.0/source/app/sample_app/deepstream-preprocess-test.

The Python code is downloadable from the NVIDIA-AI-IOT/deepstream_python_apps GitHub repo.

Results

Figure 1 shows that you can specify one or more tiles. An object within the tile is detected and detection is not applied to the remaining area of the frame.

Image showing the application of NVIDIA DeepStream Gst-nvdspreprocess plugin.
Figure 1. Showing detection applied over tiles by using the Gst-nvdspreprocess plugin. The green box shows the tile boundary, and the red boxes show detected objects within the tile.

Gst-nvdspreprocess enables applying inference on a specific portion of the video (tile or region of interest). With Gst-nvdspreprocess, you can specify one or more tiles on a single frame.

Here are the performance metrics when yolov4 is applied over the entire frame compared to over the tile. Perf metrics are collected by increasing the number of streams up to the point either decoder or compute saturates and increasing the stream any further shows no gain in performance.

The video resolution of 1080p was used for performance benchmarks over the NVIDIA V100 GPU. Consider the tradeoff between performance and the number of tiles, as placing too many tiles increases the compute requirement.

Tiling with NvDsPreprocess helps in the selective inference over the portion of the video where it is required. In Figure 1, for instance, the inference can only be used on the sidewalk and not the entire frame.

Gst-nvdsanalytics performs analytics on metadata attached by nvinfer (primary detector) and nvtracker. Gst-nvdsanalytics can be applied to the tiles for ROI Filtering, Overcrowding Detection, Direction Detection, and Line Crossing.

Categories
Offsites

ML-Enhanced Code Completion Improves Developer Productivity

The increasing complexity of code poses a key challenge to productivity in software engineering. Code completion has been an essential tool that has helped mitigate this complexity in integrated development environments (IDEs). Conventionally, code completion suggestions are implemented with rule-based semantic engines (SEs), which typically have access to the full repository and understand its semantic structure. Recent research has demonstrated that large language models (e.g., Codex and PaLM) enable longer and more complex code suggestions, and as a result, useful products have emerged (e.g., Copilot). However, the question of how code completion powered by machine learning (ML) impacts developer productivity, beyond perceived productivity and accepted suggestions, remains open.

Today we describe how we combined ML and SE to develop a novel Transformer-based hybrid semantic ML code completion, now available to internal Google developers. We discuss how ML and SEs can be combined by (1) re-ranking SE single token suggestions using ML, (2) applying single and multi-line completions using ML and checking for correctness with the SE, or (3) using single and multi-line continuation by ML of single token semantic suggestions. We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) and a 7% reduction in context switches (i.e., leaving the IDE) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.

Transformers for Completion
A common approach to code completion is to train transformer models, which use a self-attention mechanism for language understanding, to enable code understanding and completion predictions. We treat code similar to language, represented with sub-word tokens and a SentencePiece vocabulary, and use encoder-decoder transformer models running on TPUs to make completion predictions. The input is the code that is surrounding the cursor (~1000-2000 tokens) and the output is a set of suggestions to complete the current or multiple lines. Sequences are generated with a beam search (or tree exploration) on the decoder.

During training on Google’s monorepo, we mask out the remainder of a line and some follow-up lines, to mimic code that is being actively developed. We train a single model on eight languages (C++, Java, Python, Go, Typescript, Proto, Kotlin, and Dart) and observe improved or equal performance across all languages, removing the need for dedicated models. Moreover, we find that a model size of ~0.5B parameters gives a good tradeoff for high prediction accuracy with low latency and resource cost. The model strongly benefits from the quality of the monorepo, which is enforced by guidelines and reviews. For multi-line suggestions, we iteratively apply the single-line model with learned thresholds for deciding whether to start predicting completions for the following line.

Encoder-decoder transformer models are used to predict the remainder of the line or lines of code.

Re-rank Single Token Suggestions with ML
While a user is typing in the IDE, code completions are interactively requested from the ML model and the SE simultaneously in the backend. The SE typically only predicts a single token. The ML models we use predict multiple tokens until the end of the line, but we only consider the first token to match predictions from the SE. We identify the top three ML suggestions that are also contained in the SE suggestions and boost their rank to the top. The re-ranked results are then shown as suggestions for the user in the IDE.

In practice, our SEs are running in the cloud, providing language services (e.g., semantic completion, diagnostics, etc.) with which developers are familiar, and so we collocated the SEs to run on the same locations as the TPUs performing ML inference. The SEs are based on an internal library that offers compiler-like features with low latencies. Due to the design setup, where requests are done in parallel and ML is typically faster to serve (~40 ms median), we do not add any latency to completions. We observe a significant quality improvement in real usage. For 28% of accepted completions, the rank of the completion is higher due to boosting, and in 0.4% of cases it is worse. Additionally, we find that users type >10% fewer characters before accepting a completion suggestion.

Check Single / Multi-line ML Completions for Semantic Correctness
At inference time, ML models are typically unaware of code outside of their input window, and code seen during training might miss recent additions needed for completions in actively changing repositories. This leads to a common drawback of ML-powered code completion whereby the model may suggest code that looks correct, but doesn’t compile. Based on internal user experience research, this issue can lead to the erosion of user trust over time while reducing productivity gains.

We use SEs to perform fast semantic correctness checks within a given latency budget (<100ms for end-to-end completion) and use cached abstract syntax trees to enable a “full” structural understanding. Typical semantic checks include reference resolution (i.e., does this object exist), method invocation checks (e.g., confirming the method was called with a correct number of parameters), and assignability checks (to confirm the type is as expected).

For example, for the coding language Go, ~8% of suggestions contain compilation errors before semantic checks. However, the application of semantic checks filtered out 80% of uncompilable suggestions. The acceptance rate for single-line completions improved by 1.9x over the first six weeks of incorporating the feature, presumably due to increased user trust. As a comparison, for languages where we did not add semantic checking, we only saw a 1.3x increase in acceptance.

Language servers with access to source code and the ML backend are collocated on the cloud. They both perform semantic checking of ML completion suggestions.

Results
With 10k+ Google-internal developers using the completion setup in their IDE, we measured a user acceptance rate of 25-34%. We determined that the transformer-based hybrid semantic ML code completion completes >3% of code, while reducing the coding iteration time for Googlers by 6% (at a 90% confidence level). The size of the shift corresponds to typical effects observed for transformational features (e.g., key framework) that typically affect only a subpopulation, whereas ML has the potential to generalize for most major languages and engineers.

Fraction of all code added by ML 2.6%
Reduction in coding iteration duration 6%
Reduction in number of context switches 7%
Acceptance rate (for suggestions visible for >750ms) 25%
Average characters per accept 21
Key metrics for single-line code completion measured in production for 10k+ Google-internal developers using it in their daily development across eight languages.
Fraction of all code added by ML (with >1 line in suggestion) 0.6%
Average characters per accept 73
Acceptance rate (for suggestions visible for >750ms) 34%
Key metrics for multi-line code completion measured in production for 5k+ Google-internal developers using it in their daily development across eight languages.

Providing Long Completions while Exploring APIs
We also tightly integrated the semantic completion with full line completion. When the dropdown with semantic single token completions appears, we display inline the single-line completions returned from the ML model. The latter represent a continuation of the item that is the focus of the dropdown. For example, if a user looks at possible methods of an API, the inline full line completions show the full method invocation also containing all parameters of the invocation.

Integrated full line completions by ML continuing the semantic dropdown completion that is in focus.
Suggestions of multiple line completions by ML.

Conclusion and Future Work
We demonstrate how the combination of rule-based semantic engines and large language models can be used to significantly improve developer productivity with better code completion. As a next step, we want to utilize SEs further, by providing extra information to ML models at inference time. One example can be for long predictions to go back and forth between the ML and the SE, where the SE iteratively checks correctness and offers all possible continuations to the ML model. When adding new features powered by ML, we want to be mindful to go beyond just “smart” results, but ensure a positive impact on productivity.

Acknowledgements
This research is the outcome of a two-year collaboration between Google Core and Google Research, Brain Team. Special thanks to Marc Rasi, Yurun Shen, Vlad Pchelin, Charles Sutton, Varun Godbole, Jacob Austin, Danny Tarlow, Benjamin Lee, Satish Chandra, Ksenia Korovina, Stanislav Pyatykh, Cristopher Claeys, Petros Maniatis, Evgeny Gryaznov, Pavel Sychev, Chris Gorgolewski, Kristof Molnar, Alberto Elizondo, Ambar Murillo, Dominik Schulz, David Tattersall, Rishabh Singh, Manzil Zaheer, Ted Ying, Juanjo Carin, Alexander Froemmgen and Marcus Revaj for their contributions.

Categories
Misc

Accelerating GPU Applications with NVIDIA Math Libraries

NVIDIA Math Libraries are available to boost your application’s performance, from GPU-accelerated implementations of BLAS to random number generation.

There are three main ways to accelerate GPU applications: compiler directives, programming languages, and preprogrammed libraries. Compiler directives such as OpenACC aIlow you to smoothly port your code to the GPU for acceleration with a directive-based programming model. While it is simple to use, it may not provide optimal performance in certain scenarios. 

Programming languages such as CUDA C and C++ give you greater flexibility when accelerating your applications, but it is also the user’s responsibility to write code that takes advantage of new hardware features to achieve optimal performance on the latest hardware. This is where preprogrammed libraries fill in the gap. 

In addition to enhancing code reusability, the NVIDIA Math Libraries are optimized to make best use of GPU hardware for the greatest performance gain. If you’re looking for a straightforward way to speed up your application, continue reading to learn about using libraries to improve your application’s performance. 

The NVIDIA math libraries, available as part of the CUDA Toolkit and the high-performance computing (HPC) software development kit (SDK), offer high-quality implementations of functions encountered in a wide range of compute-intensive applications. These applications include the domains of machine learning, deep learning, molecular dynamics, computational fluid dynamics (CFD), computational chemistry, medical imaging, and seismic exploration. 

These libraries are designed to replace the common CPU libraries such as OpenBLAS, LAPACK, and Intel MKL, as well as accelerate applications on NVIDIA GPUs with minimal code changes. To show the process, we created an example of the double precision general matrix multiplication (DGEMM) functionality to compare the performance of cuBLAS with OpenBLAS. 

The code example below demonstrates the use of the OpenBLAS DGEMM call.

// Init Data
…
// Execute GEMM
cblas_dgemm(CblasColMajor, CblasNoTrans, CblasTrans, m, n, k, alpha, A.data(), lda, B.data(), ldb, beta, C.data(), ldc);

Code example 2 below shows the cuBLAS dgemm call.

// Init Data
…
// Data movement to GPU
…
// Execute GEMM
cublasDgemm(cublasH, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, d_A, lda, d_B, ldb, &beta, d_C, ldc));

As shown in the example above, you can simply add and replace the OpenBLAS CPU code with the cuBLAS API functions. See the full code for both the cuBLAS and OpenBLAS examples. This cuBLAS example was run on an NVIDIA(R) V100 Tensor Core GPU with a nearly 20x speed-up. The graph below displays the speedup and specs when running these examples.

Bar graph showing the 19.2x speed-up in performance gained by replacing the OpenBLAS CPU code with the cuBLAS API function on the GPU
Figure 1. Replacing the OpenBLAS CPU code with the cuBLAS API function on the GPU yields a 19.2x speed-up in the DGEMM computation, where A, B, and C matrices are 4K x 4K matrices, on the CPU and the GPU. 

Fun fact: These libraries are invoked in the higher-level Python APIs such as cuPy, cuDNN and RAPIDS, so if you have experience with those, then you have already been using these NVIDIA Math Libraries. 

The remainder of this post covers all of the math libraries available. For the latest updates and information, watch Recent Developments in NVIDIA Math Libraries.

Delivering better performance compared to CPU-only alternatives

There are many NVIDIA Math Libraries to take advantage of, from GPU-accelerated implementations of BLAS to random number generation. Take a look below at an overview of the NVIDIA Math Libraries and learn how to get started to easily boost your application’s performance.

Speed up Basic Linear Algebra Subprograms with cuBLAS

General Matrix Multiplication (GEMM) is one of the most popular Basic Linear Algebra Subprograms (BLAS) deployed in AI and scientific computing. GEMMs also form the foundational blocks for deep learning frameworks. To learn more about the use of GEMMs in deep learning frameworks, see Why GEMM Is at the Heart of Deep Learning.

The cuBLAS Library is an implementation of BLAS which leverages GPU capabilities to achieve great speed-ups. It comprises routines for performing vector and matrix operations such as dot products (Level 1), vector addition (Level 2), and matrix multiplication (Level 3).

Additionally, if you would like to parallelize your matrix-matrix multiplies, cuBLAS supports the versatile batched GEMMs which finds use in tensor computations, machine learning, and LAPACK. For more details about improving efficiency in machine learning and tensor contractions, see Tensor Contractions with Extended BLAS Kernels on CPU and GPU

cuBLASXt

If the problem size is too big to fit on the GPU, or your application needs single-node, multi-GPU support, cuBLASXt is a great option. cuBLASXt allows for hybrid CPU-GPU computation and supports BLAS Level 3 operations that perform matrix-to-matrix operations such as herk which performs the Hermitian rank update.

cuBLASLt

cuBLASLt is a lightweight library that covers GEMM. cuBLASLt uses fused kernels to speed up applications by “combining” two or more kernels into a single kernel which allows for reuse of data and reduced data movement. cuBLASLt also allows users to set the post-processing options for the epilogue (apply Bias and then ReLU transform or apply bias gradient to an input matrix).

cuBLASMg: CUDA Math Library Early Access Program

For large-scale problems, check out cuBLASMg for state-of-the-art multi-GPU, multi-node matrix-matrix multiplication support. It is currently a part of the CUDA Math Library Early Access Program. Apply for access.

Process sparse matrices with cuSPARSE

Sparse-matrix, dense-matrix multiplication (SpMM) is fundamental to many complex algorithms in machine learning, deep learning, CFD, and seismic exploration, as well as economic, graph, and data analytics. Efficiently processing sparse matrices is critical to many scientific simulations.

The growing size of neural networks and the associated increase in cost and resources incurred has led to the need for sparsification. Sparsity has gained popularity in the context of both deep learning training and inference to optimize the use of resources. For more insight into this school of thought and the need for a library such as cuSPARSE, see The Future of Sparsity in Deep Neural Networks

cuSPARSE provides a set of basic linear algebra subprograms used for handling sparse matrices which can be used to build GPU-accelerated solvers. There are four categories of the library routines: 

  • Level 1 operates between a sparse vector and dense vector, such as the dot product between two vectors. 
  • Level 2 operates between a sparse matrix and a dense vector, such as a matrix-vector product. 
  • Level 3 operates between a sparse matrix and a set of dense vectors, such as a matrix-matrix product). 
  • Level 4 allows conversion between different matrix formats and compression of compressed sparse row (CSR) matrices.

cuSPARSELt

For a lightweight version of the cuSPARSE library with compute capabilities to perform sparse matrix-dense matrix multiplication along with helper functions for pruning and compression of matrices, try cuSPARSELt. For a better understanding of the cuSPARSELt library, see Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt.

Accelerate tensor applications with cuTENSOR

The cuTENSOR library is a tensor linear algebra library implementation. Tensors are core to machine learning applications and are an essential mathematical tool used to derive the governing equations for applied problems. cuTENSOR provides routines for direct tensor contractions, tensor reductions, and element-wise tensor operations. cuTENSOR is used to improve performance in deep learning training and inference, computer vision, quantum chemistry, and computational physics applications. 

cuTENSORMg

If you still want cuTENSOR features, but with support for large tensors that can be distributed across multi-GPUs in a single node such as with the DGX A100, cuTENSORMg is the library of choice. It provides broad mixed-precision support, and its main computational routines include direct tensor contractions, tensor reductions, and element-wise tensor operations. 

GPU-accelerated LAPACK features with cuSOLVER

The cuSOLVER library is a high-level package useful for linear algebra functions based on the cuBLAS and cuSPARSE libraries. cuSOLVER provides LAPACK-like features, such as matrix factorization, triangular solve routines for dense matrices, a sparse least-squares solver, and an eigenvalue solver.

There are three separate components of cuSOLVER: 

  • cuSolverDN is used for dense matrix factorization.
  • cuSolverSP provides a set of sparse routines based on sparse QR factorization.
  • cuSolverRF is a sparse re-factorization package, useful for solving sequences of matrices with a shared sparsity pattern. 

cuSOLVERMg

For GPU-accelerated ScaLAPACK features, a symmetric eigensolver, 1-D column block cyclic layout support, and single-node, multi-GPU support for cuSOLVER features, consider cuSOLVERMg

cuSOLVERMp 

Multi-node, multi-GPU support is needed for solving large systems of linear equations. Known for its lower-upper factorization and Cholesky factorization features, cuSOLVERMp is a great solution. 

Large-scale generation of random numbers with cuRAND

The cuRAND library focuses on the generation of random numbers through pseudo-random or quasi-random number generators on either the host (CPU) API or a device (GPU) API. The host API can generate random numbers purely on the host and store them in host memory, or they can be generated on the device where the calls to the library happen on the host, but the random number generation occurs on the device and is stored to global memory. 

The device API defines functions for setting up random number generator states and generating sequences of random numbers which can be immediately used by user kernels without having to read and write to global memory. Several physics-based problems have shown the need for large-scale random number generation.

Monte Carlo simulation is one such use case for random number generators on the GPU. The Development of GPU-Based Parallel PRNG for Monte Carlo Applications in CUDA Fortran highlights the application of cuRAND in large-scale generation of random numbers. 

Calculate fast Fourier transforms with cuFFT

cuFFT, the CUDA Fast Fourier Transform (FFT) library provides a simple interface for computing FFTs on an NVIDIA GPU. The FFT is a divide-and-conquer algorithm for efficiently computing discrete Fourier transforms of complex or real-valued data sets. It is one of the most widely used numerical algorithms in computational physics and general signal processing. 

cuFFT can be used for a wide range of applications, including medical imaging and fluid dynamics. Parallel Computing for Quantitative Blood Flow Imaging in Photoacoustic Microscopy illustrates the use of cuFFT in physics-based applications. Users with existing FFTW applications should use cuFFTW to easily port code to NVIDIA GPUs with minimal effort. The cuFFTW library provides the FFTW3 API to facilitate porting of existing FFTW applications.

cuFFTXt

To distribute FFT calculations across GPUs in a single node, check out cuFFTXt. This library includes functions to help users manipulate data on multiple GPUs and keep track of data ordering, which allows data to be processed in the most efficient way possible.

cuFFTMp

Not only is there multi-GPU support within a single system, cuFFTMp provides support for multi-GPUs across multiple nodes. This library can be used with any MPI application since it is independent of the quality of MPI implementation. It uses NVSHMEM which is a communication library based on OpenSHMEM standards that was designed for NVIDIA GPUs.

cuFFTDx

To improve performance by avoiding unnecessary trips to global memory and allowing fusion of FFT kernels with other operations, check out cuFFT device extensions (cuFFTDx) . Part of the Math Libraries Device Extensions, it allows applications to compute FFTs inside user kernels. 

Optimize standard mathematical functions with CUDA Math API

The CUDA Math API is a collection of standard mathematical functions optimized for every NVIDIA GPU architecture. All of the CUDA libraries rely on the CUDA Math Library. CUDA Math API supports all C99 standard float and double math functions, float, double, and all rounding modes, and different functions such as trigonometry and exponential functions ( cospi, sincos) and additional inverse error functions (erfinv, erfcinv). 

Customize code using C++ templates with CUTLASS

Matrix multiplications are the foundation of many scientific computations. These multiplications are particularly important in efficient implementation of deep learning algorithms. Similar to cuBLAS, CUDA Templates for Linear Algebra Subroutines (CUTLASS) comprises a set of linear algebra routines to carry out efficient computation and scaling. 

It incorporates strategies for hierarchical decomposition and data movement similar to those used to implement cuBLAS and cuDNN. However, unlike cuBLAS, CUTLASS is increasingly modularized and reconfigurable. It decomposes the moving parts of GEMM into fundamental components or blocks available as C++ template classes, thereby giving you flexibility to customize your algorithms.

The software is pipelined to hide latency and maximize data reuse. Access shared memory without conflict to maximize your data throughput, eliminate memory footprints, and design your application exactly the way you want. To learn more about using CUTLASS to improve the performance of your application, see CUTLASS: Fast Linear Algebra in CUDA C++.

Compute differential equations with AmgX

AmgX provides a GPU-accelerated AMG (algebraic multi-grid) library and is supported on a single GPU or multi-GPUs on distributed nodes. It allows users to create complex nested solvers, smoothers, and preconditioners. This library implements classical and aggregation-based algebraic multigrid methods with different smoothers such as block-Jacobi, Gauss-Seidel, and dense LU.

This library also contains preconditioned Krylov subspace iterative methods such as PCG and BICGStab. AmgX provides up to 10x acceleration to the computationally intense linear solver portion of simulations and is well-suited for implicit unstructured methods.

AmgX was specifically developed for CFD applications and can be used in domains such as energy, physics, and nuclear safety. A real-life example of the AmgX library is in solving the Poisson Equation for small-scale to large-scale computing problems.

The flying snake simulation example shows the reduction in time and cost incurred when using the AmgX wrapper on GPUs to accelerate CFD codes. There is a 21x speed-up with 3 million mesh points on one K20 GPU when compared to one 12-core CPU node. 

Get started with NVIDIA Math Libraries 

We continue working to improve the NVIDIA Math Libraries. If you have questions or a new feature request, contact Product Manager Matthew Nicely.

Acknowledgements

We would like to thank Matthew Nicely for his guidance and active feedback. A special thank you to Anita Weemaes for all her feedback and her continued support throughout.

Categories
Misc

A Guide to Understanding Essential Speech AI Terms

Abstract visual with geometric shapes and vectorsHow is speech AI related to AI, ML and DL? A quick guide on need-to-know speech AI terminologies like automatic speech recognition and text-to-speech.

Abstract visual with geometric shapes and vectors

Speech AI is the technology that makes it possible to communicate with computer systems using your voice. Commanding an in-car assistant or handling a smart home device? An AI-enabled voice interface helps you interact with devices without having to type or tap on a screen.

The field of speech AI is relatively new. But as voice interaction matures and expands to new devices and platforms, it’s important for developers to keep up with the evolving terminology.

In this explainer, I present key concepts from the world of speech AI, describe where it is situated in the bigger universe of AI, and discuss how it relates to other fields of science and technology.

Foundational concepts

You might have heard of, or even be familiar with these technologies but for the sake of completeness, here are the basics:

  • Artificial intelligence (AI) refers to the broad discipline of creating intelligent machines that either match or exceed human-level cognitive abilities.
  • Machine learning (ML) is a subfield of AI that involves creating methods and systems that learn how to carry out specific tasks using past data.
  • Deep learning (DL) is a family of ML methods based on artificial neural networks with many layers and usually trained with massive amounts of data.

How are speech AI systems related to AI, ML, and DL?

Speech AI is the use of AI for voice-based technologies. Core components of a speech AI system include:

  • An automatic speech recognition (ASR) system, also known as speech-to-text, speech recognition, or voice recognition. This converts the speech audio signal into text. 
  • A text-to-speech (TTS) system, also known as speech synthesis. This turns a text into a verbal, audio form.

Speech AI is a subfield within conversational AI, drawing its techniques primarily from the fields of DL and ML. The relationship between AI, ML, DL, and speech AI can be represented by the Venn diagram in Figure 1.

Venn diagram illustrating the relationship between AI, machine learning, deep learning, conversational AI, speech AI, and NLP.
Figure 1. The relationship between AI, ML, DL, and speech AI

Figure 1 shows that conversational AI is the larger universe of language-based applications, of which not all include a voice component (speech).

Here’s how speech AI technologies work side by side with other tools and techniques to form a complete conversational AI system.

Conversational AI

Conversational AI is the discipline that involves designing intelligent systems capable of interacting with human users through natural language in a conversational fashion. Commercial examples include home assistants and chatbots (for example, an insurance claim chatbot or travel agent chatbot).

There can be multiple modalities for conversation, including audio, text, and sign language but when the input and output are spoken natural language, you have a voice-based conversational AI system (Figure 2).

Diagram showing the components of a conversational AI system with a voice interface added.
Figure 2. A voice-based conversational AI system

The components of a typical voice-based conversational AI system include the following:

  • A speech interface, enabled by speech AI technologies, enables the system to interact with users through a spoken natural-language format.
  • A dialog system manages the conversation with the user while interacting with external fulfillment systems to satisfy the user’s needs. It consists of two components:
    • A natural language understanding (NLU) module parses the text and identifies relevant information, such as the intent of the user, and any relevant parameter to that intent. For example, if the user is requesting, “What’s the weather tomorrow morning?”, then “weather information” is the intent, while time is a releva,nt parameter to extract from the request, which is “tomorrow morning” in this case.
    • NLU is part of natural language processing (NLP), a subfield of linguistics and artificial intelligence concerned with computational methods to process and analyze natural language data.
    • A dialog manager monitors the state of the conversation and decides which action to take next. The dialog manager takes information from the NLU module, remembers the context, and fulfills the user’s request.
  • The fulfillment engines execute the tasks that are functional to the conversational AI system, for instance: retrieving weather information, reading news, booking tickets, providing stock market information, answering trivia Q&A and much more. In general, they are not considered part of the conversational AI system, but work closely together to satisfy the user’s needs.

Speech AI concepts

In this section, we dive into concepts specific to speech AI: automatic speech recognition and text-to-speech.

Automatic speech recognition

A typical deep learning-based ASR pipeline includes five main components (Figure 3).

Diagram showing how a speech input is filtered through various stages like feature extraction and inverse text normalization to produce text output in ASR.
Figure 3. Anatomy of a deep learning-based ASR pipeline

Feature extractor

A feature extractor segments the audio signal into fixed-length blocks (aka. time step) and then converts the blocks from the temporal domain to the frequency domain. 

Acoustic model

This machine learning model (usually a multi-layer deep neural network) predicts the probabilities over characters at each time step of the audio data.

Decoder and language model

A decoder converts the matrix of probabilities given by the acoustic model into a sequence of characters, which in turn make words and sentences.

A language model (LM) can give a score indicating the likelihood of a sentence appearing in its training corpus. For example, an LM trained on an English corpus will judge “Recognize speech” as more likely than “Wreck a nice peach,” while also judging “Je suis un étudiant” as quite unlikely (for that being a French sentence).

When coupled with an LM, a decoder would be able to correct what it “hears” (“I’ve got rose beef for lunch”) to what makes more common sense (“I’ve got roast beef for lunch”), for the LM will give a higher score for the latter sentence than the former.

Punctuation and capitalization model

The punctuation and capitalization model adds punctuations and capitalizes the decoder-produced text.

Inverse text normalization model

Lastly, inverse text normalization (ITN) rules are applied to transform the text in verbal format into a desired written format, for example, “ten o’clock” to “10:00,” or “ten dollars” to “$10”.

Other ASR concepts

Word error rate (WER) and character error rate (CER) are typical performance metrics of ASR systems.

WER is the number of errors divided by the total number of spoken words. For example, if there are five errors in a total of 50 spoken words, the WER would be 25%.

CER operates similarly except on characters instead of words. Languages like Japanese and Mandarin do not have “words” separated by a specific marker or delimiter (like spaces for English).

Diagram showing how a speech input is filtered through various stages like feature extraction and inverse text normalization to produce text output in ASR.
Figure 4. Anatomy of a two-stage deep-learning-based TTS pipeline

Text-to-speech (TTS) 

The Text-to-speech step is commonly achieved using two different approaches:

  • A two-stage pipeline: Two separate networks are trained separately for converting speech-to-text: the spectrogram generator network and the vocoder network.
  • An end-to-end pipeline: Uses one model to generate audio straight from text.

The components of a two-state pipeline are:

  • Text normalization model: Transforms the text in written format into a verbal format, for example, “10:00” to “ten o’clock”, “$10” to “ten dollars”. This is the opposite process of ITN.
  • Spectrogram generator network: The first stage of the TTS pipeline uses a neural network to generate a spectrogram from text.
  • Vocoder network: The second stage of the TTS pipeline takes the spectrogram from the spectrogram generator network as an input and generates a natural-sounding speech.

Speech Synthesis Markup Language

Other TTS concepts include Speech Synthesis Markup Language (SSML), which is an XML-based markup language that lets you specify how input text is converted into synthesized speech. Your configuration can make the generated synthetic speech more expressive using parameters such as pitch, pronunciation, speaking rate, and volume.

Common SSML tags include the following:

  • Prosody is used to customize the pitch, speaking rate, and volume of the generated speech.
  • Phoneme is used to override manually the pronunciation of words in the generated synthetic voice.

Mean opinion score

To assess the quality of TTS engines, Mean opinion score (MOS) is frequently used. Originating from the telecommunications field, MOS is defined as the arithmetic mean over ratings given by human evaluators for a provided stimulus in a subjective quality evaluation test. 

For example, a common TTS evaluation setup would be a group of people listening to generated samples and giving each sample a score from 0 to 5. MOS is then calculated as the average score of overall evaluators and test samples.

How to get started with speech AI

Speech AI has nowadays become mainstream and an integral part of consumers’ everyday life. Businesses are discovering new ways of bringing added value to their products by incorporating speech AI capabilities.

The best way to gain expertise in speech AI is to experience it. For more information about how to build and deploy real-time speech AI pipelines for your conversational AI application, see the free Building Speech AI Applications ebook.

Categories
Misc

Synchronizing Present Calls Between Applications on Distributed Systems with DirectX 12

Present barrier provides an easy way of synchronizing present calls between application windows on the same machine, as well as on distributed systems.

Swap groups and swap barriers are well-known methods to synchronize buffer swaps between different windows on the same system and on distributed systems, respectively. Initially introduced for OpenGL, they were later extended through public NvAPI interfaces and supported in DirectX 9 through 12.

NVIDIA now introduces the concept of present barriers. They combine swap groups and swap barriers and provide a simple way to set up synchronized present calls within and between systems.

When an application requests to join the present barrier, the driver tries to set up either a swap group or a combination of a swap group and a swap barrier, depending on the current system configuration. The functions are again provided through public NvAPI interfaces.

The present barrier is only effective when an application is in a full-screen state with no window borders, as well as no desktop scaling or taskbar composition. If at least one of these requirements is not met, the present barrier disengages and reverts to a pending state until they all are. When the present barrier is in the pending state, no synchronization across displays happens.

Similarly, the present barrier works correctly only when displays are attached to the same GPU and set to the same timing. Displays can also be synchronized with either the Quadro Sync card or the NVLink connector.

Display synchronization occurs in one of two ways:

  • The displays have been configured to form a synchronized group or synchronized to an external sync source, or both, using the Quadro Sync add-on board.
  • The displays have been synchronized by creating a Mosaic display surface spanning the displays.

When the display timings have been synchronized through one of these methods, then the DX12 present barrier is available to use.

NvAPI interfaces

To set up synchronized present calls through the present barrier extension in NvAPI, the app must make sure that the present barrier is supported at all. If that’s the case, it must create a present barrier client, register needed DirectX resources, and join the present barrier.

Query present barrier support

Before any attempt to synchronize present calls, the application should first check whether present barrier synchronization is supported on the current OS, driver, and hardware configuration. This is done by calling the according function with the desired D3D12 device as a parameter.

ID3D12Device* device;
... // initialize the device
bool supported;
assert(NvAPI_D3D12_QueryPresentBarrierSupport(device, &supported) == NVAPI_OK);
if(supported) {
  LOG("D3D12 present barrier is supported on this system.");
  ...
}

Create a present barrier client handle

If the system offers present barrier support, the app can create a present barrier client by supplying the D3D12 device and DXGI swap chain. The handle is used to register needed resources, join or leave the present barrier, and query frame statistics.

IDXGISwapChain swapChain;
... // initialize the swap chain
NvPresentBarrierClientHandle pbClientHandle = nullptr;
assert(NvAPI_D3D12_CreatePresentBarrierClient(device, swapChain, &pbClientHandle) == NVAPI_OK);

Register present barrier resources

After client creation, the present barrier needs access to the swap chain’s buffer resources and a fence object for proper frame synchronization. The fence value is incremented by the present barrier at each frame and must not be changed by the app. However, the app may use it to synchronize command allocator usage between the host and device. The function must be called again whenever the swap chain’s buffers change.

ID3D12Fence pbFence; // the app may wait on the fence but must not signal it
assert(SUCCEEDED(device->CreateFence(0, D3D12_FENCE_FLAG_NONE, IID_PPV_ARGS(&pbFence))));
ID3D12Resource** backBuffers;
unsigned int backBufferCount;
... // query buffers from swap chain
assert(NvAPI_D3D12_RegisterPresentBarrierResources(pbClientHandle, pbFence, backBuffers, backBufferCount) == NVAPI_OK);

Join the present barrier

After creating the present barrier client handle and registering the scanout resources, the application can join present barrier synchronization. Future present calls are then synchronized with other clients.

NV_JOIN_PRESENT_BARRIER_PARAMS params = {};
params.dwVersion = NV_JOIN_PRESENT_BARRIER_PARAMS_VER1;
assert(NvAPI_JoinPresentBarrier(pbClientHandle, &params) == NVAPI_OK);

Leave the present barrier

A similar function exists to leave present barrier synchronization. The client is left intact such that the app can easily join again.

assert(NvAPI_LeavePresentBarrier(pbClientHandle));

Application’s main loop

When everything is set up, the app can execute its main loop without any changes, including the present call. The present barrier handles synchronization by itself. While the app can choose to use the fence provided to the present barrier for host and device synchronization, it is also practical to use its own dedicated one.

Query statistics

While the client is registered to the present barrier, the app can query frame and synchronization statistics at any time to make sure that everything works as intended.

NV_PRESENT_BARRIER_FRAME_STATISTICS stats = {};
stats.dwVersion = NV_PRESENT_BARRIER_FRAME_STATICS_VER1;
assert(NvAPI_QueryPresentBarrierFrameStatistics(pbClientHandle, &stats) == NVAPI_OK);

The present barrier statistics object filled by the function call supplies several useful values.

  • SyncMode: The present barrier mode of the client from the last present call. Possible values:
    • PRESENT_BARRIER_NOT_JOINED: The client has not joined the present barrier.
    • PRESENT_BARRIER_SYNC_CLIENT: The client joined the present barrier but is not synchronized with any other clients.
    • PRESENT_BARRIER_SYNC_SYSTEM: The client joined the present barrier and is synchronized with other clients within the system.
    • PRESENT_BARRIER_SYNC_CLUSTER: The client joined the present barrier and is synchronized with other clients within the system and across systems.
  • PresentCount: The total count of times that a frame has been presented from the client after it joined the present barrier successfully.
  • PresentInSyncCount: The total count of times that a frame has been presented from the client and that has happened since the returned SyncMode is PRESENT_BARRIER_SYNC_SYSTEM or PRESENT_BARRIER_SYNC_CLUSTER. It resets to 0 if SyncMode switches away from those values.
  • FlipInSyncCount: The total count of flips from the client since the returned SyncMode is PRESENT_BARRIER_SYNC_SYSTEM or PRESENT_BARRIER_SYNC_CLUSTER. It resets to 0 if SyncMode switches away from those values.
  • RefreshCount: The total count of v-blanks since the returned SyncMode of the client is PRESENT_BARRIER_SYNC_SYSTEM or PRESENT_BARRIER_SYNC_CLUSTER. It resets to 0 if SyncMode switches away from those values.

Sample application

A dedicated sample app is available in the NVIDIA DesignWorks Samples GitHub repo. It features an adjustable and moving pattern of colored bars and columns to check visually for synchronization quality (Figure 1). The app also supports alternate frame rendering on multi-GPU setups and stereoscopic rendering. During runtime, it can join or leave the present barrier synchronization.

For source code and usage details, see the project at nvpro-samples/dx12_present_barrier.

Screenshot of a black screen with a vertical red bar, horizontal green bar, and statistics in the upper left corner.
Figure 1. Sample application with moving bars and lines, and real-time statistics.

Conclusion

Present barrier synchronization is an easy, high-level way to realize synchronized present calls on multiple displays, in both single system, and multiple distributed system scenarios. The interface is fully contained inside the NvAPI library and consists of only six setup functions while the complex management concepts are hidden from the user-facing code.

Categories
Misc

What Is an Exaflop?

Computers are crunching more numbers than ever to crack the most complex problems of our time — how to cure diseases like COVID and cancer, mitigate climate change and more. These and other grand challenges ushered computing into today’s exascale era when top performance is often measured in exaflops. So, What’s an Exaflop? An exaflop Read article >

The post What Is an Exaflop? appeared first on NVIDIA Blog.