Categories
Misc

NVIDIA Studio Laptops Offer Students AI, Creative Capabilities That Are Best in… Class

Selecting the right laptop is a lot like trying to pick the right major. Both can be challenging tasks where choosing wrongly costs countless hours. But pick the right one, and graduation is just around the corner. The tips below can help the next generation of artists select the ideal NVIDIA Studio laptop to maximize performance for the critical workload demands of their unique creative fields — all within budget.

The post NVIDIA Studio Laptops Offer Students AI, Creative Capabilities That Are Best in… Class appeared first on NVIDIA Blog.

Categories
Misc

Welcome Back, Commander: ‘Command & Conquer Remastered Collection’ Joins GeForce NOW

Take a trip down memory lane this week with an instantly recognizable classic, Command & Conquer Remastered Collection, joining the nearly 20 Electronic Arts games streaming from the GeForce NOW library. Speaking of remastered, GeForce NOW members can enhance their gameplay further with improved resolution scaling in the 2.0.43 app update. When the feature is Read article >

The post Welcome Back, Commander: ‘Command & Conquer Remastered Collection’ Joins GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

How’s That? Startup Ups Game for Cricket, Football and More With Vision AI

Sports produce a slew of data. In a game of cricket, for example, each play generates millions of video-frame data points for a sports analyst to scrutinize, according to Masoumeh Izadi, managing director of deep-tech startup TVConal. The Singapore-based company uses NVIDIA AI and computer vision to power its sports video analytics platform, which enables Read article >

The post How’s That? Startup Ups Game for Cricket, Football and More With Vision AI appeared first on NVIDIA Blog.

Categories
Misc

Just Released: HPC SDK v22.7 with AWS Graviton3 C7g Support

Four panels vertically laid out each showing a simulation with a black backgroundEnhancements, fixes, and new support for AWS Graviton3 C7g instances, Arm SVE, Rocky Linux OS, OpenMP Tools visibility in Nsight Developer Tools, and more.Four panels vertically laid out each showing a simulation with a black background

Categories
Misc

Enhanced Image Analysis with Multidimensional Image Processing

Many times two dimensions are insufficient for analyzing image data. cuCIM is an open-source, accelerated, computer vision and image-processing software library for multidimensional images.

Image data can generally be described through two dimensions (rows and columns), with a possible additional dimension for the colors red, green, blue (RGB). However, sometimes further dimensions are required for more accurate and detailed image analysis in specific applications and domains.

For example, you may want to study a three-dimensional (3D) volume, measuring the distance between two parts or modeling how that 3D volume changes over time (the fourth dimension). In these instances, you need more than two dimensions to make sense of what you are seeing.

Multidimensional image processing, or n-dimensional image processing, is the broad term for analyzing, extracting, and enhancing useful information from image data with two or more dimensions. It is particularly helpful and needed for medical imaging, remote sensing, material science, and microscopy applications.

Some methods in these applications may involve data from more channels than traditional grayscale, RGB, or red, green, blue, alpha (RGBA) images. N-dimensional image processing helps you study and make informed decisions using devices enabled with identification, filtering, and segmentation capabilities. 

Multidimensional image processing gives you the flexibility to perform functions for traditional two-dimensional filtering in scientific applications. Within medical imaging specifically, computed tomography (CT) and magnetic resonance imaging (MRI) scans require multidimensional image processing to form images of the body and its functions. For example, multidimensional dimensional image processing is used in medical imaging to detect cancer or estimate tumor size (Figure 1). 

Multidimensional, high-resolution images of tissue.
Figure 1. Tissue can be rendered and examined quickly in multidimensional digital pathology use cases

Multidimensional image processing developer challenges

Outside of identifying, acquiring, and storing the image data itself, working with multidimensional image data comes with its own set of challenges.

First, multidimensional images are larger in size than their 2D counterparts and typically of high resolution, so loading them to memory and accessing them is time-consuming.

Second, processing each additional dimension of image data requires additional time and processing power. Analyzing more dimensions enlarges the scope of consideration.

Third, the computer-vision and image-processing algorithms take longer for analyzing each additional dimension, including the low-level operations and primitives. Multidimensional filters, gradients, and histogram complexity grow with each additional dimension.

Finally, when the data is manipulated, dataset visualization for multidimensional image processing is further complicated by the additional dimensions under consideration and quality to which it must be rendered. In biomedical imaging, the level of detail required can make the difference in identifying cancerous cells and damaged organ tissue.

Multidimensional input/output

If you’re a data scientist or researcher working in multidimensional image processing, you need software that can make data loading and handling for large image files efficient. Popular multidimensional file formats include the following:

  • NumPy binary format(.npy)
  • Tag Image File Format (TIFF)
  • TFRecord (.tfrecord)
  • Zarr
  • Variants of the formats listed above

Because every pixel counts, you have to process image data accurately with all the available processing power available.  Graphics processing units (GPU) hardware gives you the processing power and efficiency needed to handle and balance the workload of analyzing complex, multidimensional image data in real time.

cuCIM

Compute Unified Device Architecture Clara IMage (cuCIM) is an open-source, accelerated, computer-vision and image-processing software library that uses the processing power of GPUs to address the needs and pain points of developers working with multidimensional images.

Data scientists and researchers need software that is fast, easy to use, and reliable for an increasing workload. While specifically tuned for biomedical applications, cuCIM can be used for geospatial, material and life sciences, and remote sensing use cases.

cuCIM offers 200+ computer-vision and image-processing functions for color conversion, exposure, feature extraction, measuring, segmentation, restoration, and transforms. 

cuCIM is capable and fast image-processing software, requiring minimal changes to your existing pipeline. cuCIM equips you with enhanced digital image-processing capabilities that can be integrated into existing pipelines:

You can integrate using either a C++ or Python application programming interface (API) that matches OpenSlide for I/O and scikit-image for processing in Python. 

The cuCIM Python bindings offer many commonly used, computer-vision, image-processing functions that are easily integratable and compilable into the developer workflow. 

You don’t have to learn a new interface or programming language to use cuCIM. In most instances, only one line of code is added for transferring images to the GPU. The cuCIM coding structure is nearly identical to that used for the CPU, so there’s little change needed to take advantage of the GPU-enabled capabilities.

Because cuCIM is also enabled for GPUDirect Storage (GDS), you can efficiently transfer and write data directly from storage to the GPU without making an intermediate copy in host (CPU) memory. That saves time on I/O tasks.

With its quick set-up, cuCIM provides the benefit of GPU-accelerated image processing and efficient I/O with minimal developer effort and with no low-level compute unified device architecture (CUDA) programming required.

Free downloads and resources

cuCIM can be downloaded for free through Conda or PyPi. For more information, see the cuCIM developer page. You’ll learn about developer challenges, primitives, and use cases and get links to references and resources.

Categories
Offsites

Look and Talk: Natural Conversations with Google Assistant

In natural conversations, we don’t say people’s names every time we speak to each other. Instead, we rely on contextual signaling mechanisms to initiate conversations, and eye contact is often all it takes. Google Assistant, now available in more than 95 countries and over 29 languages, has primarily relied on a hotword mechanism (“Hey Google” or “OK Google”) to help more than 700 million people every month get things done across Assistant devices. As virtual assistants become an integral part of our everyday lives, we’re developing ways to initiate conversations more naturally.

At Google I/O 2022, we announced Look and Talk, a major development in our journey to create natural and intuitive ways to interact with Google Assistant-powered home devices. This is the first multimodal, on-device Assistant feature that simultaneously analyzes audio, video, and text to determine when you are speaking to your Nest Hub Max. Using eight machine learning models together, the algorithm can differentiate intentional interactions from passing glances in order to accurately identify a user’s intent to engage with Assistant. Once within 5ft of the device, the user may simply look at the screen and talk to start interacting with the Assistant.

We developed Look and Talk in alignment with our AI Principles. It meets our strict audio and video processing requirements, and like our other camera sensing features, video never leaves the device. You can always stop, review and delete your Assistant activity at myactivity.google.com. These added layers of protection enable Look and Talk to work just for those who turn it on, while keeping your data safe.

Google Assistant relies on a number of signals to accurately determine when the user is speaking to it. On the right is a list of signals used with indicators showing when each signal is triggered based on the user’s proximity to the device and gaze direction.

Modeling Challenges
The journey of this feature began as a technical prototype built on top of models developed for academic research. Deployment at scale, however, required solving real-world challenges unique to this feature. It had to:

  1. Support a range of demographic characteristics (e.g., age, skin tones).
  2. Adapt to the ambient diversity of the real world, including challenging lighting (e.g., backlighting, shadow patterns) and acoustic conditions (e.g., reverberation, background noise).
  3. Deal with unusual camera perspectives, since smart displays are commonly used as countertop devices and look up at the user(s), unlike the frontal faces typically used in research datasets to train models.
  4. Run in real-time to ensure timely responses while processing video on-device.

The evolution of the algorithm involved experiments with approaches ranging from domain adaptation and personalization to domain-specific dataset development, field-testing and feedback, and repeated tuning of the overall algorithm.

Technology Overview
A Look and Talk interaction has three phases. In the first phase, Assistant uses visual signals to detect when a user is demonstrating an intent to engage with it and then “wakes up” to listen to their utterance. The second phase is designed to further validate and understand the user’s intent using visual and acoustic signals. If any signal in the first or second processing phases indicates that it isn’t an Assistant query, Assistant returns to standby mode. These two phases are the core Look and Talk functionality, and are discussed below. The third phase of query fulfillment is typical query flow, and is beyond the scope of this blog.

Phase One: Engaging with Assistant
The first phase of Look and Talk is designed to assess whether an enrolled user is intentionally engaging with Assistant. Look and Talk uses face detection to identify the user’s presence, filters for proximity using the detected face box size to infer distance, and then uses the existing Face Match system to determine whether they are enrolled Look and Talk users.

For an enrolled user within range, an custom eye gaze model determines whether they are looking at the device. This model estimates both the gaze angle and a binary gaze-on-camera confidence from image frames using a multi-tower convolutional neural network architecture, with one tower processing the whole face and another processing patches around the eyes. Since the device screen covers a region underneath the camera that would be natural for a user to look at, we map the gaze angle and binary gaze-on-camera prediction to the device screen area. To ensure that the final prediction is resilient to spurious individual predictions and involuntary eye blinks and saccades, we apply a smoothing function to the individual frame-based predictions to remove spurious individual predictions.

Eye-gaze prediction and post-processing overview.

We enforce stricter attention requirements before informing users that the system is ready for interaction to minimize false triggers, e.g., when a passing user briefly glances at the device. Once the user looking at the device starts speaking, we relax the attention requirement, allowing the user to naturally shift their gaze.

The final signal necessary in this processing phase checks that the Face Matched user is the active speaker. This is provided by a multimodal active speaker detection model that takes as input both video of the user’s face and the audio containing speech, and predicts whether they are speaking. A number of augmentation techniques (including RandAugment, SpecAugment, and augmenting with AudioSet sounds) helps improve prediction quality for the in-home domain, boosting end-feature performance by over 10%.The final deployed model is a quantized, hardware-accelerated TFLite model, which uses five frames of context for the visual input and 0.5 seconds for the audio input.

Active speaker detection model overview: The two-tower audiovisual model provides the “speaking” probability prediction for the face. The visual network auxiliary prediction pushes the visual network to be as good as possible on its own, improving the final multimodal prediction.

Phase Two: Assistant Starts Listening
In phase two, the system starts listening to the content of the user’s query, still entirely on-device, to further assess whether the interaction is intended for Assistant using additional signals. First, Look and Talk uses Voice Match to further ensure that the speaker is enrolled and matches the earlier Face Match signal. Then, it runs a state-of-the-art automatic speech recognition model on-device to transcribe the utterance.

The next critical processing step is the intent understanding algorithm, which predicts whether the user’s utterance was intended to be an Assistant query. This has two parts: 1) a model that analyzes the non-lexical information in the audio (i.e., pitch, speed, hesitation sounds) to determine whether the utterance sounds like an Assistant query, and 2) a text analysis model that determines whether the transcript is an Assistant request. Together, these filter out queries not intended for Assistant. It also uses contextual visual signals to determine the likelihood that the interaction was intended for Assistant.

Overview of the semantic filtering approach to determine if a user utterance is a query intended for the Assistant.

Finally, when the intent understanding model determines that the user utterance was likely meant for Assistant, Look and Talk moves into the fulfillment phase where it communicates with the Assistant server to obtain a response to the user’s intent and query text.

Performance, Personalization and UX
Each model that supports Look and Talk was evaluated and improved in isolation and then tested in the end-to-end Look and Talk system. The huge variety of ambient conditions in which Look and Talk operates necessitates the introduction of personalization parameters for algorithm robustness. By using signals obtained during the user’s hotword-based interactions, the system personalizes parameters to individual users to deliver improvements over the generalized global model. This personalization also runs entirely on-device.

Without a predefined hotword as a proxy for user intent, latency was a significant concern for Look and Talk. Often, a strong enough interaction signal does not occur until well after the user has started speaking, which can add hundreds of milliseconds of latency, and existing models for intent understanding add to this since they require complete, not partial, queries. To bridge this gap, Look and Talk completely forgoes streaming audio to the server, with transcription and intent understanding being on-device. The intent understanding models can work off of partial utterances. This results in an end-to-end latency comparable with current hotword-based systems.

The UI experience is based on user research to provide well-balanced visual feedback with high learnability. This is illustrated in the figure below.

Left: The spatial interaction diagram of a user engaging with Look and Talk. Right: The User Interface (UI) experience.

We developed a diverse video dataset with over 3,000 participants to test the feature across demographic subgroups. Modeling improvements driven by diversity in our training data improved performance for all subgroups.

Conclusion
Look and Talk represents a significant step toward making user engagement with Google Assistant as natural as possible. While this is a key milestone in our journey, we hope this will be the first of many improvements to our interaction paradigms that will continue to reimagine the Google Assistant experience responsibly. Our goal is to make getting help feel natural and easy, ultimately saving time so users can focus on what matters most.

Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, UX, and cross-functional contributors. Key contributors from Google Assistant include Alexey Galata, Alice Chuang‎, Barbara Wang, Britanie Hall, Gabriel Leblanc, Gloria McGee, Hideaki Matsui, James Zanoni, Joanna (Qiong) Huang, Krunal Shah, Kavitha Kandappan, Pedro Silva, Tanya Sinha, Tuan Nguyen, Vishal Desai, Will Truong‎, Yixing Cai‎, Yunfan Ye; from Research including Hao Wu, Joseph Roth, Sagar Savla, Sourish Chaudhuri, Susanna Ricco. Thanks to Yuan Yuan and Caroline Pantofaru for their leadership, and everyone on the Nest, Assistant, and Research teams who provided invaluable input toward the development of Look and Talk.

Categories
Misc

Developing NLP Applications to Enhance Clinical Experiences and Accelerate Drug Discovery

Discover tools to translate unstructured data to structured data to help healthcare organizations harness relevant insights and improve healthcare delivery and patient experiences.

Natural language processing (NLP) can be defined as the combination of artificial intelligence (AI), computer science, and computational linguistics to understand human communication and extract meaning from unstructured spoken or written material. 

NLP use cases for healthcare have increased in the last few years to accelerate the development of therapeutics and improve quality of patient care through language understanding and predictive analytics. 

The healthcare industry generates vast amounts of unstructured data, but it is difficult to derive insights without finding ways to structure and represent that data in a computable form. Developers need the tools to translate unstructured data to structured data to help healthcare organizations harness relevant insights and improve healthcare delivery and patient care.

Transformer-based NLP has emerged as a paradigm shift in the performance of text-based healthcare workflows. Because of its versatility, NLP can structure virtually any proprietary or public data to spark insights in healthcare, leading to a wide variety of downstream applications that directly impact patient care or augment and accelerate drug discovery.

NLP for drug discovery

NLP is playing a critical role in accelerating small molecule drug discovery. Prior knowledge on the manufacturability or contraindications of a drug can be extracted from academic publications and proprietary data sets. NLP can also help with clinical trial analysis and accelerate the process of taking a drug to market.

Transformer architectures are popular in NLP, but these tools can also be used to understand the language of chemistry and biology. For example, text-based representations of chemical structure such as SMILES (Simplified Input Molecular Line Entry System) can be understood by transformer-based architectures leading to incredible capabilities for drug property evaluation and generative chemistry. 

MegaMolBART, a large transformer model developed by AstraZeneca and NVIDIA, is used for a wide range of tasks, including reaction prediction, molecular optimization, and de novo molecule generation.

Transformer-based NLP models are instrumental in understanding and predicting the structure and function of biomolecules like proteins. Much like they do for natural language, transformer-based representations of protein sequences provide powerful embeddings for use in downstream AI tasks, like predicting the final folded state of a protein, understanding the strength of protein-protein or protein-small molecule interactions, or in the design of protein structure provided a biological target.

NLP for clinical trial insights

Once a drug has been developed, patient data plays a large role in the process of taking it to market. Much of the patient data that is collected through the course of care is contained in free text, such as clinical notes from patient visits or procedural results. 

While these data are easily interpretable by a human, combining insights across clinical free text documents requires making information across diverse documents interoperable, such that the health of the patient is represented in a useful way. 

Modern NLP algorithms have accelerated our ability to derive these insights, helping to compare patients with similar symptoms, suggesting treatments, discovering diagnostic near-misses, and providing clinical care navigation and next-best-action prediction. 

NLP to enhance clinical experiences

Many patient interactions with the hospital system are remote, in part due to the growing use of telehealth services that stemmed from COVID-19. Those telehealth visits can be converted into structured information with the help of NLP. 

For physicians and surgeons, speech to text capabilities can turn verbal discussions with patients and clinical teams into text, which can then be stored in electronic health records (EHR). Applications include summarizing patient visits, catching near-misses, and predicting optimal treatment regimens.

Removing the burden of clinical documentation for each patient’s visit allows providers to spend more time and energy offering the best care for each patient, and simultaneously reduces physician burnout. NLP can also help hospitals predict patient outcomes such as readmission or sepsis. 

Learn more about NLP in healthcare

View on-demand sessions from the NVIDIA Healthcare and Life Sciences NLP Developer Summit to learn more about the use of NLP in healthcare. Session topics include best practices and insights for applications from speech AI in clinics to drug discovery.

Browse NVIDIA’s collection of biomedical pre-trained language models, as well as highly optimized pipelines for training NLP models on biomedical and clinical text, in the Clara NLP NGC Collection.

Categories
Misc

Closing the Sim2Real Gap with NVIDIA Isaac Sim and NVIDIA Isaac Replicator

NVIDIA Isaac Replicator, built on the Omniverse Replicator SDK, can help you develop a cost-effective and reliable workflow to train computer vision models using synthetic data.

Synthetic data is an important tool in training machine learning models for computer vision applications. Researchers from NVIDIA have introduced a structured domain randomization system within Omniverse Replicator that can help you train and refine models using synthetic data.

Omniverse Replicator is an SDK built on the NVIDIA Omniverse platform that enables you to build custom synthetic data generation tools and workflows. The NVIDIA Isaac Sim development team used Omniverse Replicator SDK to build NVIDIA Isaac Replicator, a robotics-specific synthetic data generation toolkit, exposed within the NVIDIA Isaac Sim app.

We explored using synthetic data generated from synthetic environments for a recent project. Trimble plans to deploy Boston Dynamics’ Spot in a variety of indoor settings and construction environments. But Trimble had to develop a cost-effective and reliable workflow to train ML-based perception models so that Spot could autonomously operate in different indoor settings.
By generating data from a synthetic indoor environment using structured domain randomization within NVIDIA Isaac Replicator, you can train an off-the-shelf object detection model to detect doors in the real indoor environment.

Collage of doors in indoor room and corridors
Figure 1. A few images from the test set of real 1,000 images

Sim2Real domain gap

Given that synthetic data sets are generated using simulation, it is critical to close the gap between the simulation and the real world. This gap is called the domain gap, which can be divided into two pieces:

  • Appearance gap: The pixel level differences between two images. These differences can be a result of differences in object detail, materials, or in the case of synthetic data, differences in the capabilities of the rendering system used.​
  • Content gap:  Refers to the difference between the domains. This includes factors like the number of objects in the scene, their diversity of type and placement, and similar contextual information. ​

A critical tool for overcoming these domain gaps is domain randomization (DR), which increases the size of the domain generated for a synthetic dataset. DR helps ensure that we include the range that best matches reality, including long-tail anomalies. By generating a wider range of data, we might find that a neural network could learn to better generalize across the full scope of the problem​.

The appearance gap can be further closed with high fidelity 3D assets, and ray tracing or path tracing-based rendering, using physically based materials, such as those defined with the MDL. Validated sensor models and domain randomization of their parameters can also help here.

Creating the synthetic scene

We imported the BIM Model of the indoor scene into NVIDIA Isaac Sim from Trimble SketchUp through the NVIDIA Omniverse SketchUp Connector. However, it looked rough with a significant appearance gap between sim and reality. Video 1 shows Trimble_DR_v1.1.usd.

Video 1. Starting indoor scene after it was imported into NVIDIA Isaac Sim from Trimble SketchUp through SketchUp Omniverse Connector

To close the appearance gap, we used NVIDIA MDL to add some textures and materials to the doors, walls, and ceilings. That made the scene look more realistic.

Video 2. Indoor scene after adding MDL materials

To close the content gap between sim and reality, we added props such as desks, office chairs, computer devices, and cardboard boxes to the scene through Omniverse DeepSearch, an AI-enabled service. Omniverse DeepSearch enables you to use natural language inputs and imagery for searching through the entire catalog of untagged 3D assets, objects, and characters.

These assets are publicly available in NVIDIA Omniverse.

3D models of chairs of different colors and shapes
Figure 2. Omniverse DeepSearch makes it easy to tag and search across 3D models

We also added ceiling lights to the scene. To capture the variety in door orientation, a domain randomization (DR) component was added to randomize the rotation of the doors, and Xform was used to simulate door hinges. This enabled the doors to open, close, or stay ajar at different angles. Video 3 shows the resulting scene with all the props.

Video 3. Indoor scene after adding props

Synthetic data generation

At this point, the Iterative process of synthetic data generation (SDG) was started. For the object detection model, we used TAO DetectNet V2 with a ResNet-18 backbone for all the experiments.

We fixed all model hyperparameters constant at their default values, including the batch size, learning rate, and dataset augmentation config parameters. In synthetic data generation, you iteratively tune the dataset generation parameters instead of model hyperparameters.

Diagram showing the workflow of synthetic dataset generation where synthetic data from 3D assets is improved by feedback obtained from training a model on it and then evaluating it on real data
Figure 3. The iterative process of synthetic data generation where dataset generation parameters are tuned based on feedback from model evaluation

The Trimble v1.3 scene contains 500 ray-traced images and environment props and no DR components except for Door Rotation. The door Texture was held fixed. Training on this scene resulted in 5% AP on the real test set (~1,000 images).

As you can see from the model’s prediction on real images, the model was failing to detect real doors adequately because it overfits to the texture of the simulated door. The model’s poor performance on the synthetic validation dataset with different textured doors confirmed this.

Another observation was that the lighting was held steady and constant in simulation, whereas reality has a variety of lighting conditions.

To prevent overfitting to the texture of the doors, we applied randomization to the door texture, randomizing between 30 different wood-like textures. To vary the lighting, we added DR over the ceiling lights to randomize the movement, intensity, and color of lights. Now that we were randomizing the texture of the door, it was important to give the model some learning signal on what makes a door besides its rectangular shape. We added realistic metallic door handles, kick plates, and door frames to all the doors in the scene. Training on 500 images from this improved scene yielded 57% AP on the real test set.

Video 4. Indoor scene after adding DR components for door rotation, texture, and color and movement of lights

This model was doing better than before, but it was still making false positive predictions on potted plants and QR codes on the walls in test real images. It was also doing poorly on the corridor images where we had multiple doors lined up. It had a lot of false positives in low-temperature lighting conditions (Figure 5).

To make the model robust to noise like QR codes on walls, we applied DR over the texture of the walls with different textures, including QR codes and other synthetic textures.

We added a few potted plants to the scene. We already had a corridor, so to generate synthetic data from it, two cameras were added along the corridor along with ceiling lights.

We added DR over light temperature, along with intensity, movement, and color, to have the model better generalize in different lighting conditions. We also noticed a variety of floors like shiny granite, carpet, and tiles in real images. To model these, we applied DR to randomize the material of the floor between different kinds of carpet, marble, tiles, and granite materials.

Similarly, we added a DR component to randomize the texture of the ceiling between different colors and different kinds of materials. We also added a DR visibility component to randomly add a few carts in the corridor in simulation, hoping to minimize the model’s false positives over carts in real images.

The synthetic dataset of 4,000 images generated from this scene got around 87% AP on the real test set by training only on synthetic data, achieving decent Sim2Real performance.

Video 5. Final scene with more DR components

Figure 6 shows a few inferences on real images from the final model.

Synthetic data generation in Omniverse

Using Omniverse connectors, MDL, and easy-to-use tools like DeepSearch, it’s possible for ML engineers and data scientists with no background in 3D design to create synthetic scenes.

NVIDIA Isaac Replicator makes it easy to bridge the Sim2Real gap by generating synthetic data with structured domain randomization. This way, Omniverse makes synthetic data generation accessible for you to bootstrap perception-based ML projects.

The approach presented here should be scalable, and it should be possible to increase the number of objects of interest and easily generate new synthetic data every time you want to detect additional new objects.

For more information, see the following resources:

If you have any questions, post them in the Omniverse Synthetic Data, Omniverse Code, or Omniverse Isaac Sim forums.

Categories
Misc

Applying Inference over Specific Frame Regions with NVIDIA DeepStream

This tutorial shares how to apply inference over a predefined area of the incoming video frames.

Detecting objects in high-resolution input is a well-known problem in computer vision. When a certain area of the frame is of interest, inference over the complete frame is unnecessary. There are two ways to solve this issue:

  • Use a large model with a high input resolution.
  • Divide the large image into tiles and apply the smaller model to each one.

In many ways, the first approach is difficult. Training a model with large input often requires larger backbones, making the overall model bulkier. Training or deploying such a model also requires more computing resources. Larger models are deemed unfit for edge deployment on smaller devices.

The second method, dividing the entire image into tiles and applying smaller models to each tile has obvious advantages. Smaller models are used so lesser computation power is required in training and inference. No retraining is required to apply the model to the high-resolution input. Smaller models are also considered edge-deployment-friendly. 

In this post, we discuss at how NVIDIA DeepStream can help in applying smaller models onto high-resolution input to detect a specific frame region.

Overview of video surveillance systems

Video surveillance systems are used to solve various problems such as the identification of pedestrians, vehicles, and cars. Nowadays, 4K and 8K cameras are used to capture details of the scene. The military uses aerial photography for various purposes and that also has a large area covered.

With the increase in resolution, the number of pixels increases exponentially. It takes a huge amount of computing power to process such a huge number of pixels, especially with a deep neural network.

Based on the input dimension selected during model building, deep neural networks operate on the fixed shape input. This fixed-size input is also known as the receptive field of the model. Typically, receptive fields vary from 256×256 to 1280×1280 and beyond in the detection and segmentation networks.

You might find that the region of interest is a small area and not the entire frame. In this case, if the detection is applied over the entire frame, it is an unnecessary use of compute resources. The DeepStream NvDsPreprocess plugin enables you to invest compute over a specific area of the frame.

DeepStream NvDsPreprocessing plugin

However, when tiling is applied to images or frames, especially over the video feeds, you require an additional element in the inference pipeline. Such an element is expected to perform a tiling mechanism that can be configured per stream, batched inference over the tile, and combining inference from multiple tiles onto single frames.

Interestingly, all these functionalities are provided in DeepStream with the Gst-NvDsPreprocess customizable plugin. It provides a custom library interface for preprocessing input streams. Each stream can have its own preprocessing requirements.

The default plugin implementation provides the following functionality:

  • Streams with predefined regions of interest (ROIs) or tiles are scaled and format converted as per the network requirements for inference. Per-stream ROIs are specified in the config file.
  • It prepares a raw tensor from the scaled and converted ROIs and is passed to the downstream plugins through user metadata. Downstream plugins can access this tensor for inference.

DeepStream pipeline with tiling

Modifying the existing code to support tiling is next.

Using the NvdsPreprocessing plugin

Define the preprocess element inside the pipeline:

preprocess = Gst.ElementFactory.make("nvdspreprocess", "preprocess-plugin")

NvDsPreprocess requires a config file as input:

preprocess.set_property("config-file", "config_preprocess.txt")

Add the preprocess element to the pipeline:

pipeline.add(preprocess)

Link the element to the pipeline:

streammux.link(preprocess)
preprocess.link(pgie)

Let the NvdsPreprocess plugin do preprocessing

The inference is done with the NvDsInfer plugin, which has frame preprocessing capabilities.

When you use the NvdsPreprocess plugin before NvDsInfer, you want the preprocessing (scaling or format conversion) to be done by NvdsPreprocess and not NvDsInfer. To do this, set the input-tensor-meta property of NvDsInfer to true. This let NvdsPreprocess do preprocessing and use preprocessed input tensors attached as metadata instead of preprocessing inside NvDsInfer itself.

The following steps are required to incorporate Gst-nvdspreprocess functionality into your existing pipeline.

Define and add the nvdspreprocess plugin to the pipeline:

preprocess = Gst.ElementFactory.make("nvdspreprocess", "preprocess-plugin")
pipeline.add(preprocess)

Set the input-tensor-meta property of NvDsInfer to true:

pgie.set_property("input-tensor-meta", True)

Define the config file property for the nvdspreprocess plugin:

preprocess.set_property("config-file", "config_preprocess.txt")

Link the preprocess plugin before the primary inference engine (pgie):

streammux.link(preprocess)
preprocess.link(pgie)

Creating the config file

The Gst-nvdspreprocess configuration file uses a key file format. For more information, see the config_preprocess.txt in the Python and C source code.

  • The [property] group configures the general behavior of the plugin.
    • The [group-] group configures ROIs, tiles, and ull-frames for a group of streams with src-id values and custom-input-transformation-function from custom lib.
    • The [user-configs] group configures parameters required by the custom library, which is passed on to the custom lib through a map of as a key-value pair. Then, custom lib must parse the values accordingly.

The minimum required config_preprocess.txt looks like the following code example:

[property]
enable=1
target-unique-ids=1
    # 0=NCHW, 1=NHWC, 2=CUSTOM
network-input-order=0

network-input-order=0
processing-width=960
processing-height=544
scaling-buf-pool-size=6
tensor-buf-pool-size=6
    # tensor shape based on network-input-order
network-input-shape=12;3;544;960
    # 0=RGB, 1=BGR, 2=GRAY

network-color-format=0
    # 0=FP32, 1=UINT8, 2=INT8, 3=UINT32, 4=INT32, 5=FP16
tensor-data-type=0
tensor-name=input_1
    # 0=NVBUF_MEM_DEFAULT 1=NVBUF_MEM_CUDA_PINNED 2=NVBUF_MEM_CUDA_DEVICE 3=NVBUF_MEM_CUDA_UNIFIED
scaling-pool-memory-type=0
    # 0=NvBufSurfTransformCompute_Default 1=NvBufSurfTransformCompute_GPU 2=NvBufSurfTransformCompute_VIC
scaling-pool-compute-hw=0
    # Scaling Interpolation method
    # 0=NvBufSurfTransformInter_Nearest 1=NvBufSurfTransformInter_Bilinear 2=NvBufSurfTransformInter_Algo1
    # 3=NvBufSurfTransformInter_Algo2 4=NvBufSurfTransformInter_Algo3 5=NvBufSurfTransformInter_Algo4
    # 6=NvBufSurfTransformInter_Default
scaling-filter=0
custom-lib-path=/opt/nvidia/deepstream/deepstream/lib/gst-plugins/libcustom2d_preprocess.so
custom-tensor-preparation-function=CustomTensorPreparation

[user-configs]
pixel-normalization-factor=0.003921568
#mean-file=
#offsets=

[group-0]
src-ids=0;1;2;3
custom-input-transformation-function=CustomAsyncTransformation
process-on-roi=1
roi-params-src-0=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-1=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-2=0;540;900;500;960;0;900;500;0;0;540;900;
roi-params-src-3=0;540;900;500;960;0;900;500;0;0;540;900;

Processing-width and processing-height refer to the width and height of the slice onto the entire frame.

For network-input-shape, the current config file is configured to run 12 ROI at the most. To increase the ROI count, increase the first dimension to the required number, for example, network-input-shape=12;3;544;960.

In the current config file config-preprocess.txt, there are three ROIs per source and a total of 12 ROI for all four sources. The total ROIs from all the sources must not exceed the first dimension specified in the network-input-shape parameter.

Roi-params-src- Indicates III coordinates for source-. For each ROI specify left;top;width;height defining the ROI if process-on-roi is enabled. Gst-nvdspreprocess does not combine detection and count of objects in the overlapping tiles.

Code

The C code is downloadable from /opt/nvidia/deepstream/deepstream-6.0/source/app/sample_app/deepstream-preprocess-test.

The Python code is downloadable from the NVIDIA-AI-IOT/deepstream_python_apps GitHub repo.

Results

Figure 1 shows that you can specify one or more tiles. An object within the tile is detected and detection is not applied to the remaining area of the frame.

Image showing the application of NVIDIA DeepStream Gst-nvdspreprocess plugin.
Figure 1. Showing detection applied over tiles by using the Gst-nvdspreprocess plugin. The green box shows the tile boundary, and the red boxes show detected objects within the tile.

Gst-nvdspreprocess enables applying inference on a specific portion of the video (tile or region of interest). With Gst-nvdspreprocess, you can specify one or more tiles on a single frame.

Here are the performance metrics when yolov4 is applied over the entire frame compared to over the tile. Perf metrics are collected by increasing the number of streams up to the point either decoder or compute saturates and increasing the stream any further shows no gain in performance.

The video resolution of 1080p was used for performance benchmarks over the NVIDIA V100 GPU. Consider the tradeoff between performance and the number of tiles, as placing too many tiles increases the compute requirement.

Tiling with NvDsPreprocess helps in the selective inference over the portion of the video where it is required. In Figure 1, for instance, the inference can only be used on the sidewalk and not the entire frame.

Gst-nvdsanalytics performs analytics on metadata attached by nvinfer (primary detector) and nvtracker. Gst-nvdsanalytics can be applied to the tiles for ROI Filtering, Overcrowding Detection, Direction Detection, and Line Crossing.

Categories
Offsites

ML-Enhanced Code Completion Improves Developer Productivity

The increasing complexity of code poses a key challenge to productivity in software engineering. Code completion has been an essential tool that has helped mitigate this complexity in integrated development environments (IDEs). Conventionally, code completion suggestions are implemented with rule-based semantic engines (SEs), which typically have access to the full repository and understand its semantic structure. Recent research has demonstrated that large language models (e.g., Codex and PaLM) enable longer and more complex code suggestions, and as a result, useful products have emerged (e.g., Copilot). However, the question of how code completion powered by machine learning (ML) impacts developer productivity, beyond perceived productivity and accepted suggestions, remains open.

Today we describe how we combined ML and SE to develop a novel Transformer-based hybrid semantic ML code completion, now available to internal Google developers. We discuss how ML and SEs can be combined by (1) re-ranking SE single token suggestions using ML, (2) applying single and multi-line completions using ML and checking for correctness with the SE, or (3) using single and multi-line continuation by ML of single token semantic suggestions. We compare the hybrid semantic ML code completion of 10k+ Googlers (over three months across eight programming languages) to a control group and see a 6% reduction in coding iteration time (time between builds and tests) and a 7% reduction in context switches (i.e., leaving the IDE) when exposed to single-line ML completion. These results demonstrate that the combination of ML and SEs can improve developer productivity. Currently, 3% of new code (measured in characters) is now generated from accepting ML completion suggestions.

Transformers for Completion
A common approach to code completion is to train transformer models, which use a self-attention mechanism for language understanding, to enable code understanding and completion predictions. We treat code similar to language, represented with sub-word tokens and a SentencePiece vocabulary, and use encoder-decoder transformer models running on TPUs to make completion predictions. The input is the code that is surrounding the cursor (~1000-2000 tokens) and the output is a set of suggestions to complete the current or multiple lines. Sequences are generated with a beam search (or tree exploration) on the decoder.

During training on Google’s monorepo, we mask out the remainder of a line and some follow-up lines, to mimic code that is being actively developed. We train a single model on eight languages (C++, Java, Python, Go, Typescript, Proto, Kotlin, and Dart) and observe improved or equal performance across all languages, removing the need for dedicated models. Moreover, we find that a model size of ~0.5B parameters gives a good tradeoff for high prediction accuracy with low latency and resource cost. The model strongly benefits from the quality of the monorepo, which is enforced by guidelines and reviews. For multi-line suggestions, we iteratively apply the single-line model with learned thresholds for deciding whether to start predicting completions for the following line.

Encoder-decoder transformer models are used to predict the remainder of the line or lines of code.

Re-rank Single Token Suggestions with ML
While a user is typing in the IDE, code completions are interactively requested from the ML model and the SE simultaneously in the backend. The SE typically only predicts a single token. The ML models we use predict multiple tokens until the end of the line, but we only consider the first token to match predictions from the SE. We identify the top three ML suggestions that are also contained in the SE suggestions and boost their rank to the top. The re-ranked results are then shown as suggestions for the user in the IDE.

In practice, our SEs are running in the cloud, providing language services (e.g., semantic completion, diagnostics, etc.) with which developers are familiar, and so we collocated the SEs to run on the same locations as the TPUs performing ML inference. The SEs are based on an internal library that offers compiler-like features with low latencies. Due to the design setup, where requests are done in parallel and ML is typically faster to serve (~40 ms median), we do not add any latency to completions. We observe a significant quality improvement in real usage. For 28% of accepted completions, the rank of the completion is higher due to boosting, and in 0.4% of cases it is worse. Additionally, we find that users type >10% fewer characters before accepting a completion suggestion.

Check Single / Multi-line ML Completions for Semantic Correctness
At inference time, ML models are typically unaware of code outside of their input window, and code seen during training might miss recent additions needed for completions in actively changing repositories. This leads to a common drawback of ML-powered code completion whereby the model may suggest code that looks correct, but doesn’t compile. Based on internal user experience research, this issue can lead to the erosion of user trust over time while reducing productivity gains.

We use SEs to perform fast semantic correctness checks within a given latency budget (<100ms for end-to-end completion) and use cached abstract syntax trees to enable a “full” structural understanding. Typical semantic checks include reference resolution (i.e., does this object exist), method invocation checks (e.g., confirming the method was called with a correct number of parameters), and assignability checks (to confirm the type is as expected).

For example, for the coding language Go, ~8% of suggestions contain compilation errors before semantic checks. However, the application of semantic checks filtered out 80% of uncompilable suggestions. The acceptance rate for single-line completions improved by 1.9x over the first six weeks of incorporating the feature, presumably due to increased user trust. As a comparison, for languages where we did not add semantic checking, we only saw a 1.3x increase in acceptance.

Language servers with access to source code and the ML backend are collocated on the cloud. They both perform semantic checking of ML completion suggestions.

Results
With 10k+ Google-internal developers using the completion setup in their IDE, we measured a user acceptance rate of 25-34%. We determined that the transformer-based hybrid semantic ML code completion completes >3% of code, while reducing the coding iteration time for Googlers by 6% (at a 90% confidence level). The size of the shift corresponds to typical effects observed for transformational features (e.g., key framework) that typically affect only a subpopulation, whereas ML has the potential to generalize for most major languages and engineers.

Fraction of all code added by ML 2.6%
Reduction in coding iteration duration 6%
Reduction in number of context switches 7%
Acceptance rate (for suggestions visible for >750ms) 25%
Average characters per accept 21
Key metrics for single-line code completion measured in production for 10k+ Google-internal developers using it in their daily development across eight languages.
Fraction of all code added by ML (with >1 line in suggestion) 0.6%
Average characters per accept 73
Acceptance rate (for suggestions visible for >750ms) 34%
Key metrics for multi-line code completion measured in production for 5k+ Google-internal developers using it in their daily development across eight languages.

Providing Long Completions while Exploring APIs
We also tightly integrated the semantic completion with full line completion. When the dropdown with semantic single token completions appears, we display inline the single-line completions returned from the ML model. The latter represent a continuation of the item that is the focus of the dropdown. For example, if a user looks at possible methods of an API, the inline full line completions show the full method invocation also containing all parameters of the invocation.

Integrated full line completions by ML continuing the semantic dropdown completion that is in focus.
Suggestions of multiple line completions by ML.

Conclusion and Future Work
We demonstrate how the combination of rule-based semantic engines and large language models can be used to significantly improve developer productivity with better code completion. As a next step, we want to utilize SEs further, by providing extra information to ML models at inference time. One example can be for long predictions to go back and forth between the ML and the SE, where the SE iteratively checks correctness and offers all possible continuations to the ML model. When adding new features powered by ML, we want to be mindful to go beyond just “smart” results, but ensure a positive impact on productivity.

Acknowledgements
This research is the outcome of a two-year collaboration between Google Core and Google Research, Brain Team. Special thanks to Marc Rasi, Yurun Shen, Vlad Pchelin, Charles Sutton, Varun Godbole, Jacob Austin, Danny Tarlow, Benjamin Lee, Satish Chandra, Ksenia Korovina, Stanislav Pyatykh, Cristopher Claeys, Petros Maniatis, Evgeny Gryaznov, Pavel Sychev, Chris Gorgolewski, Kristof Molnar, Alberto Elizondo, Ambar Murillo, Dominik Schulz, David Tattersall, Rishabh Singh, Manzil Zaheer, Ted Ying, Juanjo Carin, Alexander Froemmgen and Marcus Revaj for their contributions.