Categories
Misc

Accelerating JPEG 2000 Decoding for Digital Pathology and Satellite Images Using the nvJPEG2000 Library

JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to … Continued

JPEG 2000 (.jp2, .jpg2, .j2k) is an image compression standard defined by the Joint Photographers Expert Group (JPEG) as the more flexible successor to the still popular JPEG standard. Part 1 of the JPEG 2000 standard, which forms the core coding system, was first approved in August 2002. To date, the standard has expanded to 17 parts, covering areas like Motion JPEG2000 (Part 3) which extends the standard for video, extensions for three-dimensional data (Part 10), and so on.

Features like mathematically lossless compression and large precision and higher dynamic range per component helped JPEG 2000 find adoption in digital cinema applications. JPEG 2000 is also widely used in digital pathology and geospatial imaging, where image dimensions exceed 4K but regions of interest (ROI) stay small.

GPU acceleration using the nvJPEG2000 library

The JPEG 2000 feature set provides ample opportunities for GPU acceleration when compared to its predecessor, JPEG. Through GPU acceleration, images can be decoded in parallel and larger images can be processed quicker. nvJPEG2000 is a new library that accelerates the decoding of JPEG 2000 images on NVIDIA GPUs. It supports codec features commonly used in geospatial imaging, remote sensing, and digital pathology. Figure 1 overviews the decoding stages that nvJPEG2000 accelerates.

The CPU runs the JPEG2000 and Tier 2 stages. GPU stages include Tier 1, dequantization, IDWT, inverse component transform, and the decoded image.
Figure 1. GPU-accelerated JPEG 2000 decode process. Stages run on the CPU are denoted by the first two blue boxes. All remaining stages are offloaded to the GPU as shown in green.

The Tier1 Decode (entropy decode) stage is the most compute-intensive stage of the entire decode process. The entropy decode algorithm used in the legacy JPEG codec was serial in nature and was hard to parallelize.

In JPEG 2000, the entropy decode stage is applied at a block-based granularity (typical block sizes are 64×64 and 32×32) that makes it possible to offload the entropy decode stage entirely to the GPU. For more information about the entropy decode process, see Section C of the JPEG 2000 Core coding system specification.

The JPEG 2000 core coding system allows for two types of wavelet transforms (5-3 Reversible and 9-7 Irreversible), both of which benefit from GPU acceleration. For more information about the wavelet transforms, see Section F of the JPEG 2000 Core coding system specification.

Decoding geospatial images

In this section, we concentrate on the new nvJPEG2000 API tailored for the geospatial domain, which enables decoding specific tiles within an image instead of decoding the full image. 

Sentinel2 image in a batch of 12 used to verify geospatial acceleration.
Figure 2. Sentinel2 Imaging (S2B_17RQK_20190908_0_L2A) JPEG2000 (Image Size 10980×10980, Tile Size 1024×1024, No of Tiles 11×11, Number of components 1).

Imaging data captured by the European Space Agency’s Sentinel 2 satellites are stored as JPEG 2000 bitstreams. Sentinel 2 level 2A data downloaded from the Copernicus hub can be used with the nvJPEG2000 decoding examples. The imaging data has 12 bands or channels and each of them is stored as an independent JPEG 2000 bitstream. The image in Figure 2 is subdivided into 121 tiles. To speed up the decode of multitile images, a new API called nvjpeg2kDecodeTile has been added in nvJPEG2000 v 0.2, which enables you to decode each tile independently.

For multitile images, decoding each tile sequentially would be suboptimal. The GitHub multitile decode sample demonstrates how to decode each tile on a separate cudaStream_t. By taking this approach, you can simultaneously decode multiple tiles on the GPU. Nsight Systems trace in Figure 3 shows the decoding of Sentinel 2 data set consisting of 12 bands. By using 10 CUDA streams, up to 10 tiles are being decoded in parallel at any point during the decode process.

Effective utilization of CUDA streams for multitile decoding
Figure 3. Nsight Systems trace demonstrating the decoding of multiple tiles on separate CUDA streams

Table 1 shows performance data comparing a single stream and multiple streams on a GV100 GPU.

# of CUDA streams Average decode time (ms) Speedup in % over single CUDA stream decode
1 0.888854
10 0.227408 75%
Table 1. Single stream vs multiple streams decode performance on a Quadro GV 100 for Sentinel2 Dataset

Using 10 CUDA streams reduces the total decode time of the entire dataset by about 75% on a Quadro GV100 GPU. For more information, see the Accelerating Geospatial Remote Sensing Workflows Using NVIDIA SDKs [S32150] GTC’21 talk. It discusses geospatial image-processing workflows in more detail and the role nvJPEG2000 plays there.

Decoding digital pathology images

JPEG 2000 is used in digital pathology to store whole slide images (WSI). Figure 4 gives an overview of various deep learning techniques that can be applied to WSI. Deep learning models can be used to distinguish between cancerous and healthy cells. Image segmentation methods can be used to identify a tumor location in the WSI. For more information, see Deep neural network models for computational histopathology: A survey.

Application work-flow in the digital pathology
Figure 4. Digital pathology workflows

Table 2 lists the key parameters and their commonly used values of a whole slide image (WSI) compressed using JPEG 2000​.

Image size 92000×201712
Tile size 92000×201712
# of tiles 1
# of components 3
Precision 8
Table 2. Key JPEG 2000 parameters and their values used in digital pathology.

The image in question is large and it is not possible to decode the entire image at one time due to the amount of memory required. The size of the decode output is around 53 GB (92000×201712 * 3). This is excluding the decoder memory requirements.

There are several approaches to handling such large images. In this post, we describe two of them:

  • Decoding an area of interest
  • Decoding the image at lower resolution

Both approaches can be easily performed using specific nvJPEG2000 APIs.

Decoding an area of interest in an image

The nvJPEG2000 library enables the decoding of a specific area of interest in an image supported as part of the  nvjpeg2kDecodeTile API. The following code example shows how to set the area of interest in terms of image coordinates. The nvjpeg2kDecodeParams_t type enables you to control the decode output settings, such as the area of interest to decode.

 nvjpeg2kDecodeParams_t decode_params;
 // all coordinate values are relative to the top-left corner of the image
 uint32_t top_coordinate, bottom_coordinate, left_coordinate, right_coordinate;
 uint32_t tile_id;
  
 nvjpeg2kDecodeParamsSetDecodeArea(decode_params, left_coordinate, right_coordinate, top_coordinate, bottom_coordinate);
  
 nvjpeg2kDecodeTile(nvjpeg2k_handle, nvjpeg2k_decode_state,
                 jpeg2k_stream, decode_params, tile_id, 0,
                 &nvjpeg2k_out, cuda_stream) 

For more information about how to partially decode an image with multiple tiles, see the Decode Tile Decode GitHub sample.

Decoding lower resolutions of an image

The second approach to decode a large image is to decode the image at lower resolutions. The ability to decode only the lower resolutions is a benefit of JPEG 2000 using wavelet transforms. In Figure 5, wavelet transform is applied up to two levels, which gives you access to the image at three resolutions. By controlling how the inverse wavelet transform is applied, you decode only the lower resolutions of an image.

JPEG 2000 decoding based on 2D wavelet transform. This image shows two-level decomposition of the wavelet.
Figure 5. Output of a 2D wavelet transform with two-level decomposition

The digital pathology image described in Table 2 has 12 resolutions. This information can be retrieved on a per-tile basis:

 uint32_t num_res;
 uint32_t tile_id = 0;
 nvjpeg2kStreamGetResolutionsInTile(jpeg2k_stream, tile_id, &num_res);

The image has a size of 92000×201712 with 12 resolutions. If you choose to discard the four higher resolutions and decode the image up to eight resolutions, that means you can extract an image of size 5750×12574. By dropping four higher resolutions, you are scaling the result by a factor of 16.

 uint32_t num_res_to_decode = 8;
 // if num_res_to_decode > num_res nvjpeg2kDecodeTile will return an INVALID //PARAMETER ERROR
  
 nvjpeg2kDecodeTile(nvjpeg2k_handle, nvjpeg2k_decode_state, jpeg2k_stream,              
     decode_params, tile_id, num_res_to_decode, &nvjpeg2k_out, cuda_stream) 

Performance benchmarks

To show the performance improvement that decoding JPEG2000 on GPU brings, compare GPU-based nvJPEG2000 with CPU-based OpenJPEG.

Figures 6 and 7 show the average speedup when decoding one image at a time. The following images are used in the measurements:

  • 1920×1080 8-bit image with 444 chroma subsampling
  • 3840×2160 8-bit image with 444 chroma subsampling
  • 3328×4096 12-bit grayscale
Lossless JPEG 2000 decoding speedup on various GPUs with regard to CPU (16 Threads): RTX A6000, A100, V100, RTX 8000, RTX 4000, T4.
Figure 6. Speed up for Lossless Decode (5-3 DWT) over CPU implementation using 16 threads.
Lossy JPEG 2000 decoding speedup on various GPUs w.r.t. CPU (16 Threads): RTX A6000, A100, V100, RTX 8000, RTX 4000, T4.
Figure 7. Speed for Lossy Decode (9-7 DWT) over CPU implementation using 16 threads

The tables were compiled with OpenJPEG CPU Performance – Intel Xeon Gold 6240@2GHz 3.9GHz Turbo (Cascade Lake) HT On, Number of CPU threads per image=16.

On NVIDIA Ampere Architecture GPUs such as NVIDIA RTX A6000, the speedup factor is more than 8x for decoding. This speedup is measured for single-image latency.

Even higher speedups can be achieved by batching the decode of multiple images. Figures 8 and 9 compare the speed of decoding a 1920×1080 8-bit image with 444 chroma subsampling (Full HD) in both lossless and lossy modes respectively across multiple GPUs.

Batch mode performance of Lossless JPEG 2000 decoding on various GPUs: A100, RTX A6000, V100, RTX 8000, RTX 4000, and T4.
Figure 8. Decode throughput comparison for a 1920×1080 8-bit 444 image using 5-3 wavelet transform (lossless decode).
Batch mode performance of Lossy JPEG 2000 decoding on various GPUs: A100, RTX A6000, V100, RTX 8000, RTX 4000, and T4.
Figure 9. Decode throughput comparison for a 1920×1080 8-bit 444 image using 9-7 wavelet transform (lossy decode).

Figures 8 and 9 demonstrate the benefits of batched decode using the nvJPEG2000 library. There’s a significant performance increase on GPUs with a large number of streaming multiprocessors (SMs), such as A100 and NVIDIA RTX A6000, than with smaller numbers of SMs, such as NVIDIA RTX 4000 and T4. By batching, you are making sure that the compute resources available are efficiently used.

As observed from Figure 8, the decode speed on an NVIDIA RTX A6000 is 232 images per second for a batch size of 20. This equates to an additional 3x speed over batch size = 1, based on a benchmark image with a low compression ratio. The compressed bitstream is only about 3x smaller than the uncompressed image. At higher compression ratios, the speedup is faster.

The following GitHub samples show how to achieve this speedup both at image and tile granularity:

Conclusion

The nvJPEG2000 library accelerates the decoding of JPEG2000 images both in size and volume using NVIDIA GPUs by targeting specific image-processing tasks of interest. Decoding JPEG 2000 images using the nvJPEG2000 library can be as much as 8x faster on GPU (NVIDIA RTX A6000) than on CPU. A further speedup of 3x (24x faster than CPU) is achieved by batching the decode of multiple images.

The simple nvJPEG2000 APIs make it easy to include in your applications and workflows. It is also integrated into the NVIDIA Data Loading Library (DALI), a data loading and preprocessing library to accelerate deep learning applications. Using nvJPEG2000 and DALI together makes it easy to use JPEG2000 images as part of deep learning training workflows.

For more information, see the following resources:

Categories
Misc

Test data generator – model.evaluate()

Hello, I’m trying to measure the performance (accuracy and loss) of my model and I discovered the evaluate() function for this.

My test data (34 pictures) is saved in a ‘test’ folder, so I tried to create an ImageDataGenerator and then to generate my data using flow_from_directory.

I receive a “Found 34 images belonging to 1 classes.” message. However, the result I get in the terminal for this code line result = seqModel.evaluate(data, batch_size=1, verbose=1) is a very weird one: 2/2 [==============================] – 0s 5ms/step – loss: 282.6923 – accuracy: 0.7353

Why do I receive a “2/2” everytime when running the script now, no matter what batch_size I choose? And why is my loss 282.6923, while accuracy is 0.7353? Doesn’t it look super weird? I know I’m doing something wrong, but I just can’t figure it out – maybe when creating the data generator or maybe when using flow_from_directory? (When I add the validationDataGenerator as first argument – in order to test it – it seems all fine, but here I just can’t figure it out.)

A little bit of help would be appreciated. 🙂

submitted by /u/burgundicorn
[visit reddit] [comments]

Categories
Misc

What is the shape of the C object corresponding to this TFLite output?

I have a YOLOv5 trained model converted to .tflite format having used this guide.

I use this code to print the input and output shape in python: “` interpreter = tf.lite.Interpreter( # model_path=”models/exported_resnet640.tflite”) # centernet_512x512 works correctly model_path=”models/yolov5_working.tflite”) # centernet_512x512 works correctly

interpreter.allocate_tensors()

Get input and output tensors.

input_details = interpreter.get_input_details() output_details = interpreter.get_output_details() print(“======================================================”) print(input_details) print(“======================================================”)

print(output_details)

for detail in output_details: print(detail) print(” “) “` and the output looks like this:

“` [{‘name’: ‘input_1’, ‘index’: 0, ‘shape’: array([ 1, 480, 480, 3], dtype=int32), ‘shape_signature’: array([ 1, 480, 480, 3], dtype=int32), ‘dtype’: <class ‘numpy.float32’>, ‘quantization’: (0.0, 0), ‘quantization_parameters’: {‘scales’: array([], dtype=float32), ‘zero_points’: array([], dtype=int32), ‘quantized_dimension’: 0}, ‘sparsity_parameters’: {}}]

{‘name’: ‘Identity’, ‘index’: 422, ‘shape’: array([ 1, 14175, 9], dtype=int32), ‘shape_signature’: array([ 1, 14175, 9], dtype=int32), ‘dtype’: <class ‘numpy.float32’>, ‘quantization’: (0.0, 0), ‘quantization_parameters’: {‘scales’: array([], dtype=float32), ‘zero_points’: array([], dtype=int32), ‘quantized_dimension’: 0}, ‘sparsity_parameters’: {}} After invoking the interpreter after giving some input, I get an output looking like this: Output: [[[0.01191081 0.01366316 0.02800988 … 0.1661754 0.31489396 0.4217688 ] [0.02396268 0.01650745 0.0442626 … 0.24655405 0.35853994 0.2839473 ] [0.04218047 0.01613732 0.0548977 … 0.13136038 0.25760946 0.5338376 ] … [0.82626414 0.9669814 0.4534862 … 0.18754318 0.11680853 0.18492043] [0.8983849 0.9680944 0.64181983 … 0.19781056 0.16431764 0.16926363] [0.9657682 0.9869368 0.5452545 … 0.13321301 0.12015155 0.15937251]]] “`

Using the Tensorflow Lite c_api.h, I am trying to get the same output in C, but I cannot understand how to create the object that get the data.

I have tried using a float*** with size 1 * 14715 * 9 * sizeof(float) and get the output like so: “` int number_of_detections = 14175; struct filedata o_boxes; float **box_coords = (float *)malloc(sizeof(float *) * 1);

box_coords[0] = (float **)malloc(sizeof(float *) * (int)number_of_detections); for (int i = 0; i < (int)number_of_detections; i++) { box_coords[0][i] = (float *)calloc(sizeof(float), 9); // box has 9 coordinates }

o_boxes.data = box_coords; o_boxes.size = 1 * (int)number_of_detections * 9 + 1;

const TfLiteTensor *output_tensor_boxes = TfLiteInterpreterGetOutputTensor(interpreter, 0); TfLiteTensorCopyToBuffer(output_tensor_boxes, o_boxes.data, o_boxes.size * sizeof(float));

box_coords = (float ***)&o_boxes.data;

for (int i = 0; i < o_boxes.size; i++) { for (int j = 0; j < 9; j++) { printf(“%f “, box_coords[0][i][j]); fflush(stdout); } printf(“n”); } where `struct filedata` is a simple struct: struct filedata { void *data; size_t size; }; “`

The result is some garbage big floats: 39688651931648.000000 0.000000 39805756899328.000000 0.000000 39807166185472.000000 0.000000 39807367512064.000000 0.000000 39807568838656.000000 and after the first iteration I get a Segmentation Fault.

How should I create/allocate my float array to get my data?

submitted by /u/morphinnas
[visit reddit] [comments]

Categories
Misc

NVIDIA Launches Morpheus Early Access Program to Enable Advanced Cybersecurity Solution Development

NVIDIA Morpheus gives security teams complete visibility into security threats with unmatched AI processing and real-time monitoring to protect every server and screen every packet in the data center.

NVIDIA is opening early access to its Morpheus AI development framework for cybersecurity applications. Selected developers have access to Morpheus starting today with more developers joining the program over the next few months.

Just announced at NVIDIA GTC in April 2021, NVIDIA Morpheus gives security teams complete visibility into security threats with unmatched AI processing and real-time monitoring to protect every server and screen every packet in the data center. Security applications built on Morpheus help them respond to anomalies and update policies immediately as threats are identified, by building on NVIDIA deep learning and data science tools including RAPIDS, CLX, Streamz, Triton Inference Server, and TensorRT. Data analysis runs on NVIDIA-Certified servers built on the NVIDIA EGX platform or in qualified cloud instances that support NVIDIA GPUs, while traffic collection and telemetry can run on a variety of servers or switches plus the NVIDIA BlueField-2 data processing unit (DPU).

Figure 1. NVIDIA Morpheus leverages NVIDIA data science frameworks and the NVIDIA EGX platform for data analysis, and the NVIDIA DPU for telemetry and pervasive traffic scanning.

Developers in the Morpheus early access program have immediate access to components through the NGC catalog and can load the components into an Amazon Web Services Elastic Compute (AWS EC2) G4 instance — featuring an NVIDIA T4 or A100 GPU — to begin immediate development of cybersecurity applications and solutions. Early access will soon support the use of Red Hat Enterprise Linux (RHEL) and Red Hat OpenShift on NVIDIA-Certified servers built on NVIDIA EGX for on-premises development/deployment, and RHEL on NVIDIA BlueField DPUs for enhanced data collection and traffic screening that can protect every server. Support for running Morpheus on Ubuntu is expected soon afterwards, followed by additional OS options.

Developers accepted to early access are being notified this week and NVIDIA plans to expand the early access program quickly to include more security ISV partners, end users, academics, and other security professionals who wish to develop scalable, adaptive, AI-powered cybersecurity solutions.

If you are a customer, partner or researcher interested in joining the Morpheus early access program, please apply here.

Additional Resources:

Categories
Misc

GFN Thursday Heats Up with ‘LEGO Builder’s Journey’ and ‘Phantom Abyss’ Game Launches, Plus First Look at Kena: Bridge of Spirits

It’s getting hot in here, so get your game on this GFN Thursday with 13 new games joining the GeForce NOW library, including LEGO Builder’s Journey, Phantom Abyss and the Dual Universe beta. Plus, get a sneak peek at Kena: Bridge of Spirits, coming to the cloud later this year. Break the Rules Build up Read article >

The post GFN Thursday Heats Up with ‘LEGO Builder’s Journey’ and ‘Phantom Abyss’ Game Launches, Plus First Look at Kena: Bridge of Spirits appeared first on The Official NVIDIA Blog.

Categories
Misc

More Than Meets the AI: How GANs Research Is Reshaping Video Conferencing

Roll out of bed, fire up the laptop, turn on the webcam — and look picture-perfect in every video call, with the help of AI developed by NVIDIA researchers. Vid2Vid Cameo, one of the deep learning models behind the NVIDIA Maxine SDK for video conferencing, uses generative adversarial networks (known as GANs) to synthesize realistic Read article >

The post More Than Meets the AI: How GANs Research Is Reshaping Video Conferencing appeared first on The Official NVIDIA Blog.

Categories
Misc

Fast-Track Production AI with Pretrained Models and Transfer Learning Toolkit 3.0

NVIDIA announced new pre-trained models and general availability of Transfer Learning Toolkit (TLT) 3.0, a core component of NVIDIA’s Train, Adapt and Optimize (TAO) platform guided workflow for creating AI.

Today, NVIDIA announced new pretrained models and general availability of Transfer Learning Toolkit (TLT) 3.0, a core component of NVIDIA’s Train, Adapt, and Optimize (TAO) platform guided workflow for creating AI. The new release includes a variety of highly accurate and performant pretrained models in computer vision and conversational AI, as well as a set of powerful productivity features that boost AI development by up to 10x. 

As enterprises race to bring AI-enabled solutions to market, your competitiveness relies on access to the best development tools. The development journey to deploy custom, high-accuracy, and performant AI models in production can be treacherous for many engineering and research teams attempting to train with open-source models for AI product creation. NVIDIA offers high-quality, pretrained models and TLT to help reduce costs with large-scale data collection and labeling. It also eliminates the burden of training AI/ML models from scratch. New entrants to the computer vision and speech-enabled service market can now deploy production-class AI without a massive AI development team. 

Highlights of the new release include:

  • A pose-estimation model that supports real-time inference on edge with 9x faster inference performance than the OpenPose model. 
  • PeopleSemSegNet, a semantic segmentation network for people detection.
  • A variety of computer vision pretrained models in various industry use cases, such as license plate detection and recognition, heart rate monitoring, emotion recognition, facial landmarks, and more.
  • CitriNet, a new speech-recognition model that is trained on various proprietary domain-specific and open-source datasets.
  • A new Megatron Uncased model for Question Answering, plus many other pretrained models that support speech-to-text, named-entity recognition, punctuation, and text classification.
  • Training support on AWS, GCP, and Azure.
  • Out-of-the-box deployment on NVIDIA Triton and DeepStream SDK for vision AI, and NVIDIA Jarvis for conversational AI.

Get Started Fast

  • Download Transfer Learning Toolkit and access to developer resources: Get started
  • Download models from NGC: Computer vision | Conversational AI 
  • Check out the latest developer tutorial: Training and Optimizing a 2D Pose-Estimation Model with the NVIDIA Transfer Learning Toolkit. Part 1 | Part 2 

Integration with Data-Generation and Labeling Tools for Faster and More Accurate AI

TLT 3.0 is also now integrated with platforms from several leading partners who provide large, diverse, and high-quality labeled data—enabling faster end-to-end AI/ML workflows. You can now use these partners’ services to generate and annotate data, seamlessly integrate with TLT for model training and optimization, and deploy the model using DeepStream SDK or Jarvis to create reliable applications in computer vision and conversational AI. 

Check out more partner blog post and tutorials about synthetic data and data annotation with TLT:

Learn more about NVIDIA pretrained models and Transfer Learning Toolkit > >

Categories
Misc

New on NGC: PyTorch Lightning Container Speeds Up Deep Learning Research

With PyTorch Lightning, you can scale your models to multiple GPUs and leverage state-of-the-art training features such as 16-bit precision, early stopping, logging, pruning and quantization, while enabling faster iteration and reproducibility.

Deep learning research requires working at scale. Training on massive data sets or multilayered deep networks is computationally intensive and can take an impractically long time as deep learning models are bound by memory. The key here is to compose the deep learning models in a structured way so that they are decoupled from the engineering and data, enabling researchers to conduct fast research.

PyTorch Lightning, developed by Grid.AI, is now available as a container on the NGC catalog, NVIDIA’s hub of GPU-optimized AI and HPC software. Pytorch Lightning was designed to remove the roadblocks in deep learning research and allows researchers to focus on science. Lightning is more of a style guide than a framework, enabling you to structure and organize your code while providing utilities for common functions. With PyTorch Lightning, you can scale your models to multiple GPUs and leverage state-of-the-art training features such as 16-bit precision, early stopping, logging, pruning and quantization, while enabling faster iteration and reproducibility.

Figure 1. PyTorch Lightning Philosophy

A Lightning model is composed of the following:

  • A LightningModule that encapsulates the model code
  • A Lightning DataModule that encapsulates transforms, dataset, and DataLoaders
  • A Lightning trainer that automates the training routine with 70+ flags to make advanced features trivial
  • Callbacks for users to customize Lightning using hooks

The Lightning objects are implemented as hooks that can be overridden, making every single aspect of deep learning training highly configurable. With Lightning, you have full control over every detail:

  • Change how the backward step is done.
  • Change how 16-bit is initialized.
  • Add your own way of doing distributed training.
  • Add learning rate schedulers.
  • Use multiple optimizers.
  • Change the frequency of optimizer updates.

Get started today with NGC PyTorch Lightning Docker Container from the NGC catalog.

Categories
Misc

Achieve up to 75% Performance Improvement for Communication Intensive HPC Applications with NVTAGS

NVTAGS automates intelligent GPU assignment by profiling HPC applications and launching them with a custom GPU assignment tailored to an application and system to minimize communication costs.

Many GPU-accelerated HPC applications spend a substantial portion of their time in non-uniform, GPU-to-GPU communications. Additionally, in many HPC systems, different GPU pairs share communication links with varying bandwidth and latency. As a result, GPU assignment can substantially impact time to solution. Furthermore, on multi-node / multi-socket systems, communication performance can degrade when GPUs communicate with CPUs and NICs outside their system affinity. Because resource selection is system dependent, it is challenging to select resources such that communication costs are minimized.

NVIDIA Topology-Aware GPU Selection (NVTAGS) abstracts away the complexity of efficient resource selection. NVTAGS automates intelligent GPU assignment by profiling HPC applications and launching them with a custom GPU assignment tailored to an application and system to minimize communication costs. NVTAGS ensures that, regardless of a system’s communication topology, MPI processes communicate with the CPUs and NICs or HCAs within their own affinity. 

NVTAGS improves performance of Chroma, MILC, and LAMMPS from 2% to 75% on one to 16 nodes.

Key NVTAGS Features:

  • Automated topology detection along with CPU and NIC/HCA binding, independent of the system and HPC application
  • Support for single- and multi-node, PCIe, and NVIDIA NVLink with NVIDIA Pascal, Volta, and Ampere architecture GPUs
  • Automatic caching of efficient GPU selection for future simulations
  • Straightforward integration with Slurm and Singularity

Download NVTAGS 1.0.0 today. 

Additional Resources:

NVTAGS Product Page
Blog: Overcoming Communication Congestion for HPC Applications with NVIDIA NVTAGS

Categories
Offsites

Improving Genomic Discovery with Machine Learning

Each person’s genome, which collectively encodes the biochemical machinery they are born with, is composed of over 3 billion letters of DNA. However, only a small subset of the genome (~4-5 million positions) varies between two people. Nonetheless, each person’s unique genome interacts with the environment they experience to determine the majority of their health outcomes. A key method of understanding the relationship between genetic variants and traits is a genome-wide association study (GWAS), in which each genetic variant present in a cohort is individually examined for correlation with the trait of interest. GWAS results can be used to identify and prioritize potential therapeutic targets by identifying genes that are strongly associated with a disease of interest, and can also be used to build a polygenic risk score (PRS) to predict disease predisposition based on the combined influence of variants present in an individual. However, while accurate measurement of traits in an individual (called phenotyping) is essential to GWAS, it often requires painstaking expert curation and/or subjective judgment calls.

In “Large-scale machine learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology”, we demonstrate how using machine learning (ML) models to classify medical imaging data can be used to improve GWAS. We describe how models can be trained for phenotypes to generate trait predictions and how these predictions are used to identify novel genetic associations. We then show that the novel associations discovered improve PRS accuracy and, using glaucoma as an example, that the improvements for anatomical eye traits relate to human disease. We have released the model training code and detailed documentation for its use on our Genomics Research GitHub repository.

Identifying genetic variants associated with eye anatomical traits
Previous work has demonstrated that ML models can identify eye diseases, skin diseases, and abnormal mammogram results with accuracy approaching or exceeding state-of-the-art methods by domain experts. Because identifying disease is a subset of phenotyping, we reasoned that ML models could be broadly used to improve the speed and quality of phenotyping for GWAS.

To test this, we chose a model that uses a fundus image of the eye to accurately predict whether a patient should be referred for assessment for glaucoma. This model uses the fundus images to predict the diameters of the optic disc (the region where the optic nerve connects to the retina) and the optic cup (a whitish region in the center of the optic disc). The ratio of the diameters of these two anatomical features (called the vertical cup-to-disc ratio, or VCDR) correlates strongly with glaucoma risk.

A representative retinal fundus image showing the vertical cup-to-disc ratio, which is an important diagnostic measurement for glaucoma.

We applied this model to predict VCDR in all fundus images from individuals in the UK Biobank, which is the world’s largest dataset available to researchers worldwide for health-related research in the public interest, containing extensive phenotyping and genetic data for ~500,000 pseudonymized (the UK Biobank’s standard for de-identification) individuals. We then performed GWAS in this dataset to identify genetic variants that are associated with the model-based predictions of VCDR.

Applying a VCDR prediction model trained on clinical data to generate predicted values for VCDR to enable discovery of genetic associations for the VCDR trait.

The ML-based GWAS identified 156 distinct genomic regions associated with VCDR. We compared these results to a VCDR GWAS conducted by another group on the same UK Biobank data, Craig et al. 2020, where experts had painstakingly labeled all images for VCDR. The ML-based GWAS replicates 62 of the 65 associations found in Craig et al., which indicates that the model accurately predicts VCDR in the UK Biobank images. Additionally, the ML-based GWAS discovered 93 novel associations.

Number of statistically significant GWAS associations discovered by exhaustive expert labeling approach (Craig et al., left), and by our ML-based approach (right), with shared associations in the middle.

The ML-based GWAS improves polygenic model predictions
To validate that the novel associations discovered in the ML-based GWAS are biologically relevant, we developed independent PRSes using the Craig et al. and ML-based GWAS results, and tested their ability to predict human-expert-labeled VCDR in a subset of UK Biobank as well as a fully independent cohort (EPIC-Norfolk). The PRS developed from the ML-based GWAS showed greater predictive ability than the PRS built from the expert labeling approach in both datasets, providing strong evidence that the novel associations discovered by the ML-based method influence VCDR biology, and suggesting that the improved phenotyping accuracy (i.e., more accurate VCDR measurement) of the model translates into a more powerful GWAS.

The correlation between a polygenic risk score (PRS) for VCDR generated from the ML-based approach and the exhaustive expert labeling approach (Craig et al.). In these plots, higher values on the y-axis indicate a greater correlation and therefore greater prediction from only the genetic data. [* — p ≤ 0.05; *** — p ≤ 0.001]

As a second validation, because we know that VCDR is strongly correlated with glaucoma, we also investigated whether the ML-based PRS was correlated with individuals who had either self-reported that they had glaucoma or had medical procedure codes suggestive of glaucoma or glaucoma treatment. We found that the PRS for VCDR determined using our model predictions were also predictive of the probability that an individual had indications of glaucoma. Individuals with a PRS 2.5 or more standard deviations higher than the mean were more than 3 times as likely to have glaucoma in this cohort. We also observed that the VCDR PRS from ML-based phenotypes was more predictive of glaucoma than the VCDR PRS produced from the extensive manual phenotyping.

The odds ratio of glaucoma (self-report or ICD code) stratified by the PRS for VCDR determined using the ML-based phenotypes (in standard deviations from the mean). In this plot, the y-axis shows the probability that the individual has glaucoma relative to the baseline rate (represented by the dashed line). The x-axis shows standard deviations from the mean for the PRS. Data are visualized as a standard box plot, which illustrates values for the mean (the orange line), first and third quartiles, and minimum and maximum.

Conclusion
We have shown that ML models can be used to quickly phenotype large cohorts for GWAS, and that these models can increase statistical power in such studies. Although these examples were shown for eye traits predicted from retinal imaging, we look forward to exploring how this concept could generally apply to other diseases and data types.

Acknowledgments
We would like to especially thank co-author Dr. Anthony Khawaja of Moorfields Eye Hospital for contributing his extensive medical expertise. We also recognize the efforts of Professor Jamie Craig and colleagues for their exhaustive labeling of UK Biobank images, which allowed us to make comparisons with our method. Several authors of that work, as well as Professor Stuart MacGregor and collaborators in Australia and at Max Kelsen have independently replicated these findings, and we value these scientific contributions as well.