Categories
Misc

GPU-Accelerated Deep Learning Can Spot Signs of Early Alzheimer’s With 99% Accuracy

Speedy diagnoses are critical, especially when a loved one seems to be slowly losing their cognitive abilities. Researchers from the Kaunas University of Technology in Lithuania report they’ve developed a deep learning-based method able to predict the possible onset of Alzheimer’s disease from brain images with an accuracy of over 99 percent. The impact of Read article >

The post GPU-Accelerated Deep Learning Can Spot Signs of Early Alzheimer’s With 99% Accuracy  appeared first on The Official NVIDIA Blog.

Categories
Misc

Developing End-to-End Real-time Applications with the NVIDIA Clara AGX Developer Kit

The NVIDIA Clara AGX development kit with the us4R ultrasound development system makes it possible to quickly develop and test a real-time AI processing system for ultrasound imaging.

The NVIDIA Clara AGX development kit with the us4R ultrasound development system makes it possible to quickly develop and test a real-time AI processing system for ultrasound imaging. The Clara AGX development kit has an Arm CPU and high-performance RTX 6000 GPU. The us4R teamwork provides ultrasound system designers with the ability to develop, prototype, and test end-to-end software-defined ultrasound systems. Clara AGX is launching the era of software-defined medical instruments with reconfigurable pipelines without changes to the hardware.

The us4us hardware and SDK provide an end-to-end ultrasound algorithm development and RF processing platform, while the high-end Clara AGX GPU enables real-time deep learning and AI image reconstruction and inferencing.  With this approach, the whole systems engineering team benefits: beamforming experts can create optimal beam strategies, and AI experts can design and deploy the next generation of real-time algorithms.

This combined hardware and software platform democratizes ultrasound development for both research labs and commercial vendors to develop novel features. No longer are massive capital budgets required to design, prototype, and test functional hardware.  Each stage of the device pipeline can be modified. Data acquisition, data processing, image reconstruction, image processing, AI analysis, and visualization are all defined in software and executed in real time with low latency performance. The system is completely configurable and can create new RF transmission waveforms and beamforming algorithms using AI or traditional approaches. 

With ultrafast, low latency end-to-end data transfer is possible with NVIDIA ConnectX-6 SmartNIC 100Gb/s ethernet and RDMA data transfer to the GPU. The NVIDIA supercharged GPU can run circles around existing legacy premium-cost systems. It enables improved signal-to-noise, real-time processing of highly advanced and complex algorithms in image reconstruction, denoising, and pipelines.

The GPU has enough headroom for multiple real-time clinical inferencing predictions to run simultaneously, including measurement, operator guidance, image interpretation, tissue and organ identification, advanced visualization, and clinical overlays. 

Commercialization of clinical applications of the Clara AGX hardware will be available with medical-grade hardware from third party vendors in a compact and energy-efficient CPU+GPU SoC form-factor similar to that in self-driving automotive applications.

The Clara AGX development kit is a high-end performance workstation built with development of medical applications in mind. The system includes an NVIDIA RTX 6000 GPU, delivering 200+ INT8 AI TOPs and 16.3 FP32 TFLOPS at peak performance, with 24 GB of VRAM. This leaves plenty of headroom for running multiple models. High-bandwidth I/O communication with sensors is possible with the 100G Ethernet Mellanox ConnectX-6 network interface card (NIC). 

NVIDIA partners are currently using the Clara AGX development kit to develop ultrasound, endoscopy, and genomics applications.

The Clara AGX Developer Kit showing the inside of the case with key components highlighted, NVIDIA Jetson AGX Xavier, NVIDIA Mellanox ConnectX-6, and NVIDIA RTX 6000 GPU.
Figure 1. Clara AGX Developer Kit

us4R and NVIDIA Clara AGX

A front-facing photo of the Ultrasound Research System
Figure 2. us4R-lite Ultrasound Research System

us4us Ltd. offers two systems:

  • Advanced us4R with up to 1024 TX / 256 RX channels
  • A portable us4R-lite with 256 TX / 64 RX channels

Both use a PCIe streaming architecture for low-latency data transfer and GPU for scalable processing of raw ultrasound echo signals. The us4OEM ultrasound frontend modules support 128TX/32RX analog channels and high throughput 3GB/s, PCIe Gen3 x4 data interface (Figure 2).

A diagram showing the connections between four pieces of hardware.  From left to right, the Ultrasound Probe Connector uses 128-ch tx/rx connections to the us4R-lite system.  Next, the us4R-lite connections over PCIe Gen3 x4 to the NVIDIA Clara AGX. Last, the NVIDIA Clara AGX uses the dGPU to output to the display.
Figure 3. Us4r-lite and Clara AGX platform

End-to-end, software-defined ultrasound design

The ARRUS package is an SDK for the us4R that provides a high-level hardware abstraction layer enabling systems programming in Python, C++, or MATLAB. The hardware programming is performed by defining RF module including the following:

  • Active transmit (TX) probe elements
  • Transmission parameters, such as TX voltage, TX waveform, and TX delays
  • Receive (RX) aperture and acquisition parameters such as gain, filters, and time-gain compensation

Commonly used TX/RX sequences, such as classical linear scanning, plane wave imaging (PWI) and synthetic transmit aperture (STA) are preconfigured and can be quickly implemented.  Custom sequences are configured with user-defined, low-level parameters like TX/RX apertures mask and TX delays. 

The ARRUS package also includes a Python implementation of many standard ultrasound processing algorithms for image reconstruction, including raw RF data, RF data preprocessing (data filtering, quadrature demodulation, and so on), beamforming (PWI, STA, and classical schemes), and post-processing of B-mode images.

These algorithms are building blocks used to construct an arbitrary imaging pipeline that can handle the RF data stream produced by the us4R system. GPU accelerated numerical routines are provided by cuPy. DLPack specifies a common in-memory Tensor structure that enables data sharing between machine learning frameworks and GPU processing libraries, while using RDMA no additional overhead is required to copy data between them. The DLPack interface provides access to predefined or user-developed deep learning models in TensorFlow, PyTorch, Chainer, and MXNet. 

A diagram showing how the us4OEM Drivers and NVIDIA GPU Drivers interact with the US4US McCoy Docker container.
Figure 4. NGC container software schematic for this release

US4US ultrasound demo

By combining the software and hardware stack, you can quickly implement an ultrasound workflow with configurable parameters in less than one page of easy-to-read Python code.  In this section, we show you how to use the ARRUS APIs, a us4R-lite platform, and the Clara AGX DevKit to create your own ultrasound imaging pipeline in minutes.

The following code example should work with the proper environment. However, we recommend using the Docker container available directly through NGC. There is an interactive Jupyter notebook available to help guide you through this demo in the container at /us4us_examples/mimicknet-example.ipynb.

Start by importing the relevant libraries, including ARRUS, Numpy, TensorFlow, and CuPy:

# Imports for ARRUS, Numpy, TensorFlow and CuPy
 import arrus
 import arrus.session
 import arrus.utils.imaging
 import arrus.utils.us4r
 import numpy as np
 from arrus.ops.us4r import (Scheme, Pulse, DataBufferSpec)
 from arrus.utils.imaging import ( Pipeline, Transpose,  BandpassFilter,  Decimation,  QuadratureDemodulation, EnvelopeDetection, LogCompression, Enqueue,  RxBeamformingImg,  ReconstructLri,  Sum,  Lambda,  Squeeze)
 from arrus.ops.imaging import ( PwiSequence )
 from arrus.utils.us4r import ( RemapToLogicalOrder )
 from arrus.utils.gui import ( Display2D )
 from utilities import RunForDlPackCapsule, Reshape
 import TensorFlow as tf
 import cupy as cp 

Next, instantiate the PWI Tx and Rx sequences. You define the parameters for the data that you’re pulling from the US4US Ultrasound system in the PwiSequence function.

seq = PwiSequence(
     angles=np.linspace(-5, 5, 7)*np.pi/180,# np.asarray([0])*np.pi/180,
     pulse=Pulse(center_frequency=6e6, n_periods=2, inverse=False),
     rx_sample_range=(0, 2048),
     downsampling_factor=2,
     speed_of_sound=1450,
     pri=200e-6,
     sri=20e-3,
     tgc_start=14,
     tgc_slope=2e2) 

After defining the sequence, load the deep learning model parameters.  For this, you have two different deep neural networks options for improving B-mode image output quality, both available for download through NGC.

The NN_Bmode model, from researchers at Stanford, produces despeckled images using a neural network from beamformed low-resolution images (LRI). The LRI is created after a single synthetic aperture transmission; in this case, a single plane wave insonification. A sequence of LRIs can be compounded into a high-resolution image (HRI) by coherently summing them together. 

The generative adversarial networks (GANs) model is used to imitate B-mode image post-processing found in commercial ultrasound systems. This algorithm uses the standard delay-and-sum (DAS) reconstruction and B-mode post-processing pipeline with the MimickNet CycleGAN. For more information, see MimickNet, Mimicking Clinical Image Post-Processing Under Black-Box Constraints.

For this example, you load the MimickNet CycleGAN. In addition to loading the weights, you are implementing simple normalize and mimicknet_predict wrapper functions required when you implement the scheme definition in the next step.

# Load MimickNet model weights
 model = tf.keras.models.load_model(model_weights)
 model.predict(np.zeros((1, z_size, x_size, 1), dtype=np.float32))
 
 def normalize(img):
     data = img-cp.min(img)
     data = data/cp.max(data)
     return data

 def mimicknet_predict(capsule):
     data = tf.experimental.dlpack.from_dlpack(capsule)
     result = model.predict_on_batch(data)
  
# Compensate a large variance of the image mean brightness.
     result = result-np.mean(result)
     result = result-np.min(result)
     result = result/np.max(result)
     return result 

You can put all of the pieces together using the Scheme function. The Scheme function takes parameters for the tx/rx sequence definition: an output data buffer, the ultrasound device work mode, and a data processing pipeline. These parameters define the workflow for data acquisition, data processing, and displaying the inference results.

The following code example shows the Scheme definition, which includes the sequence, MimickNet preprocessing, and inference wrapper function defined earlier. The placement parameter indicates that the processing pipeline runs on GPU:0, which provides GPU acceleration on the Clara AGX Dev Kit.

scheme = Scheme(
     tx_rx_sequence=seq,
     rx_buffer_size=2,
     output_buffer=DataBufferSpec(type="FIFO", n_elements=4),
     work_mode="HOST",
     processing=Pipeline(
         steps=(
             ...
             ReconstructLri(x_grid=x_grid, z_grid=z_grid),
             # Image preprocessing
             Lambda(normalize),
             Reshape(shape=(1, z_size, x_size, 1)),
             # Deep Learning inference wrapper
             RunForDlPackCapsule(mimicknet_predict)
             ...
             Enqueue(display_input_queue, block=False, ignore_full=True)
         ),
         placement="/GPU:0"
     )
 ) 

Connect to the US4US device, upload your scheme sequence, and start your display queue. 

     us4r = sess.get_device("/Us4R:0")
     us4r.set_hv_voltage(30)
 
     # Upload sequence on the us4r-lite device.
     buffer, const_metadata = sess.upload(scheme)
     display = Display2D(const_metadata, cmap="gray", value_range=(0.3, 0.9),
                         title="NNBmode", xlabel="Azimuth (mm)", ylabel="Depth (mm)",
                         show_colorbar=True, extent=extent)
     sess.start_scheme()
     display.start(display_input_queue)
     print("Display closed, stopping the script.") 

Your device now displays the results of the ultrasound imaging pipeline. You can also easily modify this pipeline to implement your own state-of-the-art deep learning algorithms. Figure 5 shows the example output from the demo comparing a conventional delay and sum algorithm (left) and the MimickNet model (right). 

The code for this demo is available in a downloadable Docker image through NGC at https://ngc.nvidia.com/catalog/containers/nvidia:clara-agx:agx-us4us-ultrasound.

Conclusion

Clara AGX is launching the era of software-defined medical instruments with re-configurable pipelines without changes to the hardware. Connecting the Clara AGX development kit with the us4R ultrasound development system creates a combination that helps you develop a real-time AI processing system easily and quickly. With the high performance of an RTX 6000 GPU and an Arm CPU, you get the best of the embedded hardware ecosystem to develop your own state-of-the-art, task-specific algorithms. 

For more information about the us4R-lite system, contact us4us. The Clara AGX Developer Kit is currently available exclusively for members of the NVIDIA Clara Developer Partner Program.

Categories
Offsites

Personalized ASR Models from a Large and Diverse Disordered Speech Dataset

Speech impairments affect millions of people, with underlying causes ranging from neurological or genetic conditions to physical impairment, brain damage or hearing loss. Similarly, the resulting speech patterns are diverse, including stuttering, dysarthria, apraxia, etc., and can have a detrimental impact on self-expression, participation in society and access to voice-enabled technologies. Automatic speech recognition (ASR) technologies have the potential to help individuals with such speech impairments by improving access to dictation and home automation and by enhancing communication. However, while the increased computational power of deep learning systems and the availability of large training datasets has improved the accuracy of ASR systems, their performance is still insufficient for many people with speech disorders, rendering the technology unusable for many of the speakers who could benefit the most.

In 2019, we introduced Project Euphonia and discussed how we could use personalized ASR models of disordered speech to achieve accuracies on par with non-personalized ASR on typical speech. Today we share the results of two studies, presented at Interspeech 2021, that aim to expand the availability of personalized ASR models to more users. In “Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia”, we present a greatly expanded collection of disordered speech data, composed of over 1 million utterances. Then, in “Automatic Speech Recognition of Disordered Speech: Personalized models outperforming human listeners on short phrases”, we discuss our efforts to generate personalized ASR models based on this corpus. This approach leads to highly accurate models that can achieve up to 85% improvement to the word error rate in select domains compared to out-of-the-box speech models trained on typical speech.

Impaired Speech Data Collection
Since 2019, speakers with speech impairments of varying degrees of severity across a variety of conditions have provided voice samples to support Project Euphonia’s research mission. This effort has grown Euphonia’s corpus to over 1 million utterances, comprising over 1400 hours from 1330 speakers (as of August 2021).

Distribution of severity of speech disorder and condition across all speakers with more than 300 utterances recorded. For conditions, only those with > 5 speakers are shown (all others aggregated into “OTHER” for k-anonymity).
ALS = amyotrophic lateral sclerosis; DS = Down syndrome; PD = Parkinson’s disease; CP = cerebral palsy; HI = hearing impaired; MD = muscular dystrophy; MS = multiple sclerosis

To simplify the data collection, participants used an at-home recording system on their personal hardware (laptop or phone, with and without headphones), instead of an idealized lab-based setting that would collect studio quality recordings.

To reduce transcription cost, while still maintaining high transcript conformity, we prioritized scripted speech. Participants read prompts shown on a browser-based recording tool. Phrase prompts covered use-cases like home automation (“Turn on the TV.”), caregiver conversations (“I am hungry.”) and informal conversations (“How are you doing? Did you have a nice day?”). Most participants received a list of 1500 phrases, which included 1100 unique phrases along with 100 phrases that were each repeated four more times.

Speech professionals conducted a comprehensive auditory-perceptual speech assessment while listening to a subset of utterances for every speaker providing the following speaker-level metadata: speech disorder type (e.g., stuttering, dysarthria, apraxia), rating of 24 features of abnormal speech (e.g., hypernasality, articulatory imprecision, dysprosody), as well as recording quality assessments of both technical (e.g., signal dropouts, segmentation problems) and acoustic (e.g., environmental noise, secondary speaker crosstalk) features.

Personalized ASR Models
This expanded impaired speech dataset is the foundation of our new approach to personalized ASR models for disordered speech. Each personalized model uses a standard end-to-end, RNN-Transducer (RNN-T) ASR model that is fine-tuned using data from the target speaker only.

Architecture of RNN-Transducer. In our case, the encoder network consists of 8 layers and the predictor network consists of 2 layers of uni-directional LSTM cells.

To accomplish this, we focus on adapting the encoder network, i.e. the part of the model dealing with the specific acoustics of a given speaker, as speech sound disorders were most common in our corpus. We found that only updating the bottom five (out of eight) encoder layers while freezing the top three encoder layers (as well as the joint layer and decoder layers) led to the best results and effectively avoided overfitting. To make these models more robust against background noise and other acoustic effects, we employ a configuration of SpecAugment specifically tuned to the prevailing characteristics of disordered speech. Further, we found that the choice of the pre-trained base model was critical. A base model trained on a large and diverse corpus of typical speech (multiple domains and acoustic conditions) proved to work best for our scenario.

Results
We trained personalized ASR models for ~430 speakers who recorded at least 300 utterances. 10% of utterances were held out as a test set (with no phrase overlap) on which we calculated the word error rate (WER) for the personalized model and the unadapted base model.

Overall, our personalization approach yields significant improvements across all severity levels and conditions. Even for severely impaired speech, the median WER for short phrases from the home automation domain dropped from around 89% to 13%. Substantial accuracy improvements were also seen across other domains such as conversational and caregiver.

WER of unadapted and personalized ASR models on home automation phrases.

To understand when personalization does not work well, we analyzed several subgroups:

  • HighWER and LowWER: Speakers with high and low personalized model WERs based on the 1st and 5th quintiles of the WER distribution.
  • SurpHighWER: Speakers with a surprisingly high WER (participants with typical speech or mild speech impairment of the HighWER group).

Different pathologies and speech disorder presentations are expected to impact ASR non-uniformly. The distribution of speech disorder types within the HighWER group indicates that dysarthria due to cerebral palsy was particularly difficult to model. Not surprisingly, median severity was also higher in this group.

To identify the speaker-specific and technical factors that impact ASR accuracy, we examined the differences (Cohen’s D) in the metadata between the participants that had poor (HighWER) and excellent (LowWER) ASR performance. As expected, overall speech severity was significantly lower in the LowWER group than in the HighWER group (p < 0.01). Intelligibility and severity were the most prominent atypical speech features in the HighWER group; however, other speech features also emerged, including abnormal prosody, articulation, and phonation. These speech features are known to degrade overall speech intelligibility.

The SurpHighWER group had fewer training utterances and lower SNR compared with the LowWER group (p < 0.01) resulting in large (negative) effect sizes, with all other factors having small effect sizes, except fastness. In contrast, the HighWER group exhibited medium to large differences across all factors.

Speech disorder and technical metadata effect sizes for the HighWER-vs-LowWER and SurpHighWER-vs-LowWER pairs. Positive effects indicated that the group values of the HighWER group were greater than LowWER groups.

We then compared personalized ASR models to human listeners. Three speech professionals independently transcribed 30 utterances per speaker. We found that WERs were, on average, lower for personalized ASR models compared to the WERs of human listeners, with gains increasing by severity.

Delta between the WERs of the personalized ASR models and the human listeners. Negative values indicate that personalized ASR performs better than human (expert) listeners.

Conclusions
With over 1 million utterances, Euphonia’s corpus is one of the largest and most diversely disordered speech corpora (in terms of disorder types and severities) and has enabled significant advances in ASR accuracy for these types of atypical speech. Our results demonstrate the efficacy of personalized ASR models for recognizing a wide range of speech impairments and severities, with potential for making ASR available to a wider population of users.

Acknowledgements
Key contributors to this project include Michael Brenner, Julie Cattiau, Richard Cave, Jordan Green, Rus Heywood, Pan-Pan Jiang, Anton Kast, Marilyn Ladewig, Bob MacDonald, Phil Nelson, Katie Seaver, Jimmy Tobin, and Katrin Tomanek. We gratefully acknowledge the support Project Euphonia received from members of many speech research teams across Google, including Françoise Beaufays, Fadi Biadsy, Dotan Emanuel, Khe Chai Sim, Pedro Moreno Mengibar, Arun Narayanan, Hasim Sak, Suzan Schwartz, Joel Shor, and many others. And most importantly, we wanted to say a huge thank you to the over 1300 participants who recorded speech samples and the many advocacy groups who helped us connect with these participants.

Categories
Misc

A new sub for ml engineering

Good day everyone and we hope you’re all doing ok. we felt a vacuum for a sub dedicated for ML engineering on Reddit. ML engineering as in “application of ML’ in real world. We are sure that a lot of people here do ML engineering as a job and they will be interested in having a place to share articles, ask questions, and in general, have a chill time with their passion. /r/ML_Eng focuses on the following:

– Application of Machine learning

– Implementing papers

– SMACK Stack and similar data pipeline tools

– Databases

– Model deployment

– DevOps related to ML

– Creating frontends for your model

/r/ML_End is a place where intermediate to advanced programmers who aren’t PhDs in ML can feel welcome. Anyting regarding application of ML is welcome. So join us there and we hope it all will be for the better!

submitted by /u/themeansquare
[visit reddit] [comments]

Categories
Misc

Is there a pre-trained model for detecting the throat(neck) of a person? I could only find models for other facial features like nose, eyes but not what I’m looking for.

submitted by /u/ColonelHugeCum
[visit reddit] [comments]

Categories
Misc

GFN Thursday to Stream Ubisoft’s ‘Far Cry 6’ and ‘Riders Republic’ at Launch

Two of Ubisoft’s biggest upcoming games will join GeForce NOW the day they’re released, and you can get ready to breach in a new season of Rainbow Six Siege with a free-to-play weekend. Plus, it’s time to find your true colors in the newest entry in the Life Is Strange series, one of the 10 Read article >

The post GFN Thursday to Stream Ubisoft’s ‘Far Cry 6’ and ‘Riders Republic’ at Launch appeared first on The Official NVIDIA Blog.

Categories
Misc

Explore the Latest NVIDIA SIGGRAPH Breakthroughs with the Demos

Man sitting at desk with computer.From award-winning research demos to photorealistic graphics created with NVIDIA RTX and Omniverse, see how NVIDIA is breaking boundaries in AI, graphics, and virtual collaboration. Man sitting at desk with computer.

It’s one thing to hear about something new, amazing, or downright mind-blowing. It’s a completely different experience when you can see those breakthroughs visualized and demonstrated. At SIGGRAPH 2021, NVIDIA introduced new and stunning demos showcasing how the latest technologies are transforming workflows across industries. 

From award-winning research demos to photorealistic graphics created with NVIDIA RTX and Omniverse, see how NVIDIA is breaking boundaries in AI, graphics, and virtual collaboration. 

Watch some of the exciting demos featured at SIGGRAPH:

Real-Time Live! Demo: I AM AI: AI-Driven Digital Avatar Made Easy​

This demo won the Best in Show award at SIGGRAPH. It showcases the latest AI tools that can generate digital avatars from a single photo, animate avatars with natural 3D facial motion, and convert text to speech.

Interactive Volumes with NanoVDB in Blender Cycles

See how NanoVDB makes volume rendering more GPU memory-efficient, so larger and more complex scenes can be interactively adjusted and rendered with NVIDIA RTX-accelerated ray tracing and AI denoising.

Interactive Visualization of Galactic Winds with NVIDIA Omniverse

Learn more about NVIDIA IndeX, a volumetric visualization tool for researchers to visualize large scientific datasets interactively, for deeper insights. With Omniverse, users can virtually collaborate in real time, from any location, while using multiple apps simultaneously.

Accelerating AI in Photoshop Neural Filters with NVIDIA RTX A2000

Watch how AI-enhanced Neural Filter in Adobe Photoshop accelerated with NVIDIA RTX A2000 takes photo editing to the next level. Combining NVIDIA RTX A2000 with Photoshop AI, tools like Skin Smoothing and Smart Portrait give photo editors the power of AI for creating stunning portraits.

Multiple Artists, One Server

Discover how to accelerate visual effects production with the NVIDIA EGX Platform, which enables multiple artists to work together on a powerful, secure server from anywhere.

Visit the SIGGRAPH page to watch all the newest demos, catch up on the latest announcements, and explore on-demand content.

Categories
Misc

Register for the NVIDIA Metropolis Developer Webinars on Sept. 22

Sign up for webinars with NVIDIA experts and Metropolis partners on Sept. 22, featuring developer SDKs, GPUs, go-to-market opportunities, and more.

Join NVIDIA experts and Metropolis partners on Sept. 22 for webinars exploring developer SDKs, GPUs, go-to-market opportunities, and more. All three sessions, each with unique speakers and content, will be recorded and will be available for on-demand viewing later. 

Register Now >>


Session 1: NVIDIA Fleet Command | Best Practices for Vision AI Development  

Wednesday, September 22, 2021, 1 PM PDT

  • Learn how to securely deploy, manage, and scale your AI applications with NVIDIA Fleet Command.
  • Hear from Data Monsters, a solution development partner, on how they use Metropolis to solve some of the world’s most challenging Vision AI problems.

Session 2: NVIDIA Ampere GPUs | Synthetic Data to Accelerate AI Training 

Wednesday, September 22, 2021, 4 PM CEST

  • Explore how the latest NVIDIA Ampere GPUs significantly reduce deployment costs and provide more flexible vision AI app options.
  • No data, no problem! Our synthetic data partner, SKY ENGINE AI, shows how to use synthetic data to fast-track your AI development.

Session 3: NVIDIA Pre-Trained Models | Go-To-Market Opportunities with Dell 

Wednesday, September 22, 2021, 1 PM JST

  • Learn how the Dell OEM team can help bring your Metropolis solutions to market faster.
  • Go from zero to world-class AI in days with NVIDIA pretrained models. NVIDIA experts will show how to use the PeopleNet model to build an application in minutes and harness TAO Toolkit to adapt the application to different environments.

Register Now >>

Categories
Misc

Analyzing the RNA-Sequence of 1.3M Mouse Brain Cells with RAPIDS on an NVIDIA DGX Server

Learn how the use of RAPIDS to accelerate the analysis of single-cell RNA-sequence on a single NVIDIA V100 GPU shows a massive performance increase.

Single-cell genomics research continues to advance drug discovery for disease prevention. For example, it has been pivotal in developing treatments for the current COVID-19 pandemic, identifying cells susceptible to infection, and revealing changes in the immune systems of infected patients. However, with the growing availability of large-scale single-cell datasets, it’s clear that computing inefficiencies are significantly impacting the speed at which science is done. Offloading these compute bottlenecks to the GPU has demonstrated intriguing results.

In a recent blog post, NVIDIA benchmarked the analysis on one million mouse brain cells sequenced by 10X Genomics. Results demonstrated that the end-to-end workflow took over three hours to run on a GCP CPU instance while the entire dataset was processed in 11 minutes on a single NVIDIA V100 GPU. In addition, running the RAPIDS analysis on the GCP GPU instance also costs 3x less than the CPU version. Read the blog here.

Follow this Jupyter notebook for RAPIDS analysis of this dataset. For the notebook to run, the files rapids_scanpy_funcs.py and utils.py must be in the same folder as the notebook. We provide a second notebook with the CPU version of this analysis here. In collaboration with the Google Dataproc team, we’ve built out a getting started guide to help developers run this trascriptomics use case quickly. Finally, check out this NVIDIA and Google Cloud co-authored blog post that showcases the impact of the work. 

Performing single-cell RNA analysis on the GPU

A typical workflow to perform single-cell analysis often begins with a matrix that maps the counts of each gene script measured in each cell. Preprocessing steps are performed to filter out noise, and the data are normalized to obtain expressions of every gene measured in every individual cell of the dataset. Machine learning is also commonly used in this step to correct unwanted artifacts from data collection. The number of genes can often be quite large, which can create many different variations and add a lot of noise when computing similarities between the cells. Feature selection and dimensionality reduction decrease this noise before identifying and visualizing clusters of cells with similar gene expression. The transcript expression of these cell clusters can also be compared to understand why different types of cells behave and respond differently.

Figure 1: Pipeline showing the steps in the analysis of single-cell RNA sequencing data. Starting from a matrix of gene activity in every cell, RAPIDS libraries can be used to convert the matrix into gene expressions, cluster and lay the cells out for visualization, and aid in analyzing genes with different activity across clusters.

The analysis demonstrates the use of RAPIDS to accelerate the analysis of single-cell RNA-sequence data from 1 million cells using a single GPU. However, the experiment only processed the first 1M cells, not the entire 1.3M cells. As a result, processing all 1.3M cells in a workflow for single-cell RNA data took almost twice the time on a single V100 GPU. On the other hand, the same workflow took only 11 minutes on a single NVIDIA A100 40GB GPU. Unfortunately, the nearly 2x degradation in performance on the V100 is due mainly to the ability to oversubscribe the GPU’s memory so that it spills to host memory when needed. We will cover this behavior in more detail in the following section, but what’s clear is that the GPU’s memory is a limiting factor to scale. So, processing larger workloads faster requires beefier GPUs like the A100 or/and spreading the processing over multiple GPUs.

Benefits of scaling preprocessing to multiple GPUs

When a workflow’s memory usage grows beyond the capacity of a single GPU, unified virtual memory (UVM) can be used to oversubscribe the GPU and automatically spill to main memory. This approach can be advantageous during the exploratory data analysis process because moderate oversubscription ratios can eliminate the need to rerun a workflow when the GPU runs out of memory.

However, relying strictly on UVM to oversubscribe a GPU’s memory by 2x or more can lead to poor performance. Even worse, it can cause the execution to hang indefinitely when any single computation requires more memory than is available on the GPU. Spreading computation across multiple GPUs affords the benefit of both increased parallelism and reduced memory footprint on each GPU. In some cases, it may eliminate the need for oversubscription. Figure 2 demonstrates that we can achieve linear scaling by spreading the preprocessing computations across multiple GPUs, with 8 GPUs resulting in slightly over 8x speedup compared to a single NVIDIA V100 GPU. Putting that into perspective, it takes less than 2 minutes to reduce the dataset of 1.3M cells and 18k genes down to approximately 1.29M cells and the 4k most highly variable genes on 8 GPUs. That’s over an 8.55x speedup as a single V100 took over 16 minutes to run the same preprocessing steps.

Figure 2: Comparison of runtime in seconds for a typical single-cell RNA workflow on 1.3M mouse brain cells with different hardware configurations. Performing these computations on the GPU shows a massive performance increase.
The preprocessing steps took over 75% of the end-to-end runtime on a single V100 and 70% of the runtime on a single A100. This is reduced to just over 32% when we spread the computations over 8x V100s.
Figure 3: The runtimes of the single GPU configurations are dominated by the preprocessing steps, taking over 75% of the end-to-end runtime on a single V100 and 70% of the runtime on a single A100. Utilizing all of the GPUs on a DGX1 lowers the ratio to just over 32%.

Scaling single-cell RNA notebooks to multiple GPUs with Dask and RAPIDS

Many preprocessing steps, such as loading the dataset, filtering noisy transcripts and cells, normalizing counts into expressions, and feature selection, are inherently parallel, leaving each GPU independently responsible for its subset. A common step that corrects the noisy effects of data collection uses proportions of contributions from unwanted genes, such as ribosomal genes, and fits many small linear regression models, one for each transcript in the dataset. Since the number of transcripts can often be in the tens of thousands, only a few thousand of the top best-represented genes are often selected, using a measure of dispersion or variability.

Dask is an excellent library for distributing data processing workflows over a set of workers. RAPIDS has enabled Dask to also use GPUs by mapping each worker process to their own GPU. In addition, Dask provides a distributed array object, much like a distributed version of a NumPy array (or CuPy, its GPU-accelerated look-alike), which allows users to distribute the steps for the above preprocessing operations on multiple GPUs, even across multiple physical machines, manipulating and transforming the data in much the same way we would a NumPy or CuPy array.

After preprocessing, we also distribute the Principal Components Analysis (PCA) reduction step by training on a subset of the data and distributing the inference, lowering the communication cost by bringing only the first 50 principal components back to a single GPU for the remaining clustering and visualization steps. The PCA-reduced matrix of cells is only 260 MB for this dataset, allowing the remaining clustering and visualization steps to be performed on a single GPU. With this design, even a dataset containing 5M cells would only require 1 GB of memory.

Visualization of the gene expressions for the 1.3M mouse brain cells clustered with Louvain and laid out in 2 dimensions with UMAP.
Figure 4: A sample visualization of the 1.3M mouse brain cells, reduced to 2 dimensions with UMAP from cuML and clustered with Louvain from cuGraph

Conclusion

At the rate in which our computational tools are advancing, we can assume it won’t be long before the data processing volumes catch up, especially for single-cell analysis workloads, forcing the need to scale ever higher. In the meantime, there are still opportunities to decrease the iteration times of the exploratory data analysis process even further by distributing the clustering and visualization steps over multiple GPUs. Faster iteration means better models, reduced time to insight, and more informed results. Except for T-SNE, all of the clustering and visualization steps of the multi-GPU workflow notebook can already be distributed over Dask workers on GPUs with RAPIDS cuML and cuGraph.

Categories
Misc

Performing Live: How AI-Based Perception Helps AVs Better Detect Speed Limits

Understanding speed limit signs may seem like a straightforward task, but it can quickly become more complex in situations in which different restrictions apply to different lanes (for example, a highway exit) or when driving in a new country. 

The post Performing Live: How AI-Based Perception Helps AVs Better Detect Speed Limits appeared first on The Official NVIDIA Blog.