Categories
Misc

GPU-Accelerated Tools Added to NVIDIA Clara Parabricks v3.6 for Cancer and Germline Analyses

The release of NVIDIA Clara Parabricks v3.6 brings new applications for variant calling, annotation, filtering, and quality control to its suite of powerful genomic analysis tools. Now featuring over 33 accelerated tools for every stage of genomic analysis, NVIDIA Clara Parabricks provides GPU-accelerated bioinformatic pipelines that can scale for any workload. As genomes and exomes … Continued

The release of NVIDIA Clara Parabricks v3.6 brings new applications for variant calling, annotation, filtering, and quality control to its suite of powerful genomic analysis tools. Now featuring over 33 accelerated tools for every stage of genomic analysis, NVIDIA Clara Parabricks provides GPU-accelerated bioinformatic pipelines that can scale for any workload.

As genomes and exomes are sequenced at faster speeds than ever before, increasing loads of raw instrument data must be mapped, aligned, and interpreted to decipher variants and their significance to disease. Bioinformatic pipelines need to keep up with genomic analysis tools. CPU-based analysis pipelines often take weeks or months to glean results, while GPU-based pipelines can analyze 30X whole human genomes in 22 minutes and whole human exomes in 4 minutes.

These fast turnaround times are necessary to keep pace with next generation sequencing (NGS) genomic instrument outputs. This is imperative for large-scale population, cancer center, pharmaceutical drug development, and genomic research projects that require quick results for publications.

NVIDIA Clara Parabricks v3.6 Incorporates:

  1. New GPU-accelerated variant callers 
  2. An easy-to-use vote-based VCF merging tool (VBVM)
  3. A database annotation tool (VCFANNO)
  4. A new tool for quickly filtering a VCF by allele frequency (FrequencyFiltration)
  5. Tools for VCF quality control (VCFQC and VCFQCbyBAM) for both somatic and germline pipelines.
Figure 1: Analysis runtimes for open-source CPU-based somatic variant calling tools compared to GPU-accelerated NVIDIA Clara Parabricks. Relative to the community versions, NVIDIA Clara Parabricks accelerates LoFreq by 6x, SomaticSniper by 16x, and Mutect2 by 42x. These benchmarks were run on 50X WGS matched tumor-normal data from the SEQC-II benchmark set on 4x V100s.

Accelerating LoFreq and Other Somatic Callers

With the addition of LoFreq alongside Strelka2, Mutect2, and SomaticSniper, Clara Parabricks now includes 4 somatic callers for cancer workflows. LoFreq is a fast and sensitive variant caller for inferring SNVs and indels from NGS data. LoFreq runs on a variety of aligned sequencing data such as Illumina, IonTorrent, and Pacbio. It can automatically adapt to changes in coverage and sequencing quality, and can be applied to somatic, viral/quasispecies, metagenomic, and bacterial datasets.

The Lofreq somatic caller in Clara Parabricks is 10X faster compared to its native instance and is ideal for calling low frequency mutations.Using base-call qualities and other sources of errors inherent in NGS data, LoFreq improves the accuracy for calling somatic mutations below the 10% allele frequency threshold. 

The accelerated LoFreq supports only SNV calling in v3.6, with Indel calling coming in a subsequent release.
Read more >>

Figure 2: Runtimes for open-source DeepVariant (blue) and GPU-accelerated NVIDIA Clara Parabricks (green). Runtimes for 30X Illumina short read data are on the left; runtimes for PacBio 35X long read data are on the right. NVIDIA Clara Parabricks’ DeepVariant is 10-15x faster than the open-source version (blue “DeepVariant” bars compared to green “DeepVariant” bars).

From Months to Hours with New Accelerated Tools

NVIDIA Clara Parabricks v3.6 also includes a bam2fastq tool, the addition of smoove variant callers, support for de novo mutations, and new tools for VCF processing (for example annotation, filtering, and merging). A standard WGS analysis for a 30x human genome finishes in 22 minutes on a DGX A100, which is over 80 times faster than CPU-based workflows on the same server. With this acceleration, projects taking months can now be done in hours. 

Bam2Fastq is an accelerated version of GATK Sam2fastq. It converts a BAM or CRAM file to FASTQ. This is useful for scenarios where samples need to be realigned to a new reference, but the original FASTQs were deleted to save on storage space. Now they can be regenerated from the BAMs and aligned to a new reference more quickly than ever before

Detection of de novo variants (DNVs) that occur in the germline genome when comparing sequence data for an offspring to its parents (aka trio analysis) is critical for studies of disease-related variation, along with creating a baseline for generational mutation rates. 

A GPU-based workflow to call DNVs is now included in NVIDIA Clara Parabricks v3.6 and utilizes Google’s DeepVariant, which has been tested on trio analyses and other pedigree sequencing projects.
Learn more >>

For structural variant calling, NVIDIA Clara Parabricks already includes Manta, and now smoove has been added. Smoove simplifies and speeds calling and genotyping structural variants for short reads. It also improves specificity by removing alignment signals indicative of low-level noise and often contribute to spurious calls.  
Learn more >>

Figure 3: GPU-accelerated genomics analysis tools in NVIDIA Clara Parabricks v3.6.

NVIDIA Clara Parabricks v3.6 also focuses on steps of the genomic pipeline after variant calling. BamBasedVCFQC is an NVIDIA-generated tool to help QC VCF outputs by using SamTools mPileUp results, using the original BAM. Vcfanno allows users to annotate VCF outputs using third-party data sources like dbSNP, adding allele frequencies to the VCF.

FrequencyFiltration allows variants within a VCF to be filtered based upon numeric fields containing allele frequency and read count information. Finally, vote-based somatic caller merger (vbvm) is for merging two or more VCF files and then filtering variants based upon a simple voting-based mechanism where variants can be filtered based upon the number of somatic callers that have identified a specific variant.

Categories
Misc

Jupyter on tf 2.5 not compatible?

Im trying to create a conda env with tf-gpu 2.5. When conda installing jupyter, I get all these package conflicts. I didn’t have this problem with tf 2.3. Does jupyter not support the latter versions of tensorflow?

submitted by /u/Inevitable_Charge828
[visit reddit] [comments]

Categories
Misc

Put text/image on empty spaces in image

Put text/image on empty spaces in image

I want to place my own objects in empty spaces. I can detect objects and empty spaces in the image, but I cannot implement them on empty spaces, which method should I use?

https://preview.redd.it/maa7y4id8bj71.jpg?width=820&format=pjpg&auto=webp&s=4156f7772b3ab2af1581e87e25b37cf337eb776a

submitted by /u/koalabey
[visit reddit] [comments]

Categories
Misc

Global Availability of NVIDIA AI Enterprise Makes AI Accessible for Every Industry

NVIDIA today announced the availability of NVIDIA AI Enterprise, a comprehensive software suite of AI tools and frameworks that enables the hundreds of thousands of companies running VMware vSphere to virtualize AI workloads on NVIDIA-Certified Systems™.

Categories
Misc

TFLite quantization with representative_dataset

I am trying to quantize my model for tflite with representative dataset with a section of my training dataset ( shape (7000,51,300,1) ). using the generator let’s say data_rep = np.array(data_prepped_train[0:100])

The trouble is I am having the below error.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

can anyone give me an insight on the matter?

submitted by /u/tagiyevv
[visit reddit] [comments]

Categories
Misc

How to use numpy functions in tensorflow custom loss?

I am trying to design a model to minimize the output value of a certain function which takes an input array and performs certain math operations with each elements of the input array and returns a final result. I have written this function using numpy and am trying to define a loss like –

function = function_using_numpy(input_array) #returns scalar float

loss_function(truth, prediction):

loss = k.abs(function(truth) – function (prediction))

return loss

The problem is tensorflow cannot convert a tensor to numpy array to compute the loss. Is the a way around this? Would be grateful for some pointers. Thanks in advance

submitted by /u/LatePenguins
[visit reddit] [comments]

Categories
Misc

Join the Virtual MONAI Bootcamp, Sept. 22-24

Apply for the Sept. 22-24 MONAI virtual bootcamp featuring presentations, hands-on labs, and a mini-challenge day.

Due to the success of the 2020 MONAI Virtual Bootcamp, MONAI is hosting another Bootcamp this year from September 22 to September 24, 2021—the week before MICCAI.

The MONAI Bootcamp will be a three-day virtual event with presentations, hands-on labs, and a mini-challenge day. Applicants are encouraged but not required to have some basic knowledge in deep learning and Python programming. 

Everyone is welcome to join and learn more about MONAI!

With the growth of MONAI, there will be content for everyone. 

  • Day one begins with a beginner-friendly introduction to MONAI, for those just getting started with deep learning in medical imaging.  
  • Day two focuses on more advanced topics in MONAI and expands on other releases, like MONAI Label and the upcoming MONAI Deploy project. 
  • Day three consists of a series of increasingly difficult mini-challenges, with beginner-friendly challenges and some that hopefully challenge even experienced researchers.

Find a tentative schedule below. A more detailed agenda will be available closer to the event.

Agenda

Day 1: September 22, 2021, 7:30am – 12:30 pm PST
Welcome and Introductions
What is MONAI? 
Lab 1 – Getting started with MONAI
Lightning Talks
Lab 2 – MONAI Deep Dive
Lab 3 – End-to-End Workflow with MONAI

Day 2: September 23, 2021, 7:30am – 12:30 pm PST
Opening Remarks and Overview
MONAI – Advanced Topics on Medical Imaging
MONAI Label
MONAI Deploy

Day 3: September 24, 2021, 7:30am – 12:30 pm PST
MONAI Mini-Challenges Day


The deadline to register is September 8. Apply today!

Categories
Misc

One-click Deployment of NVIDIA Triton Inference Server to Simplify AI Inference on Google Kubernetes Engine (GKE)

NVIDIA and Google Cloud have collaborated to make it easier for enterprises to take AI to production by combining the power of NVIDIA Triton Inference Server with Google Kubernetes Engine(GKE).

The rapid growth in artificial intelligence is driving up the size of data sets, as well as the size and complexity of networks. AI-enabled applications like e-commerce product recommendations, voice-based assistants, and contact center automation require tens to hundreds of trained AI models. Inference serving helps infrastructure managers deploy, manage and scale these models with a guaranteed real-time quality-of-service (QoS) in production. Additionally, infrastructure managers look to provision and manage the right compute infrastructure on which to deploy these AI models, with maximum utilization of compute resources and flexibility to scale up or down to optimize operational costs of deployment. Taking AI to production is both an inference serving and infrastructure management challenge.

NVIDIA and Google Cloud have collaborated to make it easier for enterprises to take AI to production by combining the power of NVIDIA Triton Inference Server, a universal inference serving platform for CPUs and GPUs with Google Kubernetes Engine(GKE), a managed environment to deploy, scale and manage containerized AI applications in a secure Google infrastructure.

Inference Serving on CPUs and GPUs on Google Cloud with NVIDIA Triton Inference Server

Operationalizing AI models within enterprise applications poses a number of challenges – serving models trained in multiple frameworks, handling different types of inference query types and building a serving solution that can optimize across multiple deployment platforms like CPUs and GPUs.

Triton Inference Server addresses these challenges by providing a single standardized inference platform that can deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, OpenVINO or a custom C++/Python framework), from local storage or Google Cloud’s managed storage on any GPU- or CPU-based infrastructure.

Figure 1. Triton Inference Server deployed on Google Kubernetes Engine (GKE)

One-Click Deployment of NVIDIA Triton Inference Server on GKE Clusters

Triton on Google Kubernetes Engine (GKE) delivers the benefit of a universal inference serving platform for AI models deployed on both CPUs and GPUs combined with the ease of Kubernetes cluster management, load balancing, and auto scaling compute based on demand.

Triton can be seamlessly deployed as a containerized microservice on a Google Kubernetes Engine (GKE) managed cluster using the new One-Click Triton Inference Server App for GKE on Google Marketplace.

The Triton Inference Server App for GKE is a helm chart deployer that automatically installs and configures Triton for use on a GKE cluster with NVIDIA GPU node pools, including the NVIDIA A100 Tensor Core GPUs and NVIDIA T4 Tensor Core GPUs, and leverages Istio on Google Cloud for traffic ingress and load balancing. It also includes a horizontal pod autoscaler (HPA) which relies on stack driver custom metrics adapter to monitor GPU duty cycle and auto scale the GPU nodes in the GKE cluster based on inference queries and SLA requirements.

To learn more about the One-Click Triton Inference Server in Google Kubernetes Engine (GKE), check out this in-depth blog by Google Cloud and NVIDIA and see how the solution scales to meet stringent latency budgets, and optimize operational costs for your AI deployments.

You can also register for “Building a Computer Vision Service Using NVIDIA NGC and Google Cloud” webinar on August 25 to learn how to build an end-to-end computer vision service on Google Cloud by combining NVIDIA GPU-optimized pretrained models and Transfer Learning Toolkit (TLT) from NGC Catalog and the Triton Inference Server App for GKE.

Categories
Misc

Accelerating IO in the Modern Data Center: Magnum IO Storage

This is the fourth post in the Accelerating IO series. It addresses storage issues and shares recent results and directions with our partners. We cover the new GPUDirect Storage release, benefits, and implementation. Accelerated computing needs accelerated IO. Otherwise, computing resources get starved for data. Given that the fraction of all workflows for which data … Continued

This is the fourth post in the Accelerating IO series. It addresses storage issues and shares recent results and directions with our partners. We cover the new GPUDirect Storage release, benefits, and implementation.

Accelerated computing needs accelerated IO. Otherwise, computing resources get starved for data. Given that the fraction of all workflows for which data fits in memory is shrinking, optimizing storage IO is of increasing importance. The value of stored data, efforts to pilfer or corrupt data, and regulatory requirements to protect it are also all ratcheting up. To that end, there is growing demand for data center infrastructure that can provide greater isolation of users from data that they shouldn’t access.

GPUDirect Storage

GPUDirect Storage streamlines the flow of data between storage and GPU buffers for applications that consume or produce data on the GPU without needing CPU processing. No extra copies that add latency and impede bandwidth are needed. This simple optimization leads to game-changing role reversals where data can be fed to GPUs faster from remote storage rather than CPU memory.

The newest member of the GPUDirect family

The GPUDirect family of technologies enables access and efficient data movement into and out of the GPU. Until recently, it was focused on memory-to-memory transfers. With the addition of GPUDirect Storage (GDS), access and data movement with storage are also accelerated. GPUDirect Storage makes the significant step of adding file IO between local and remote storage to CUDA.

Release v1.0 with CUDA 11.4

GPUDirect Storage has been vetted for more than two years and is currently available as production software. Previously available only through a separate installation, GDS is now incorporated into CUDA version 11.4 and later, and it can be either part of the CUDA installation or installed separately. For an installation of CUDA version X-Y, the libcufile-X-Y.so user library, gds-tools-X-Y are installed by default and the nvidia-fs.ko kernel driver is an optional install. For more information, see the GDS troubleshooting and installation documentation.

GDS is now available in RAPIDS. It is also available in a PyTorch container and an MXNet container.

GDS description and benefits

GPUDirect Storage enables a direct datapath between storage and GPU memory. Data is moved using the direct memory access (DMA) engine in local NVMe drives or in a NIC that communicates with remote storage.

Use of that DMA engine means that, although the setup of the DMA is a CPU operation, the CPU and GPU are totally uninvolved in the datapath, leaving them free and unencumbered (Figure 1). On the left, data from storage comes in through a PCIe switch, goes through the CPU to system memory and all the way back down to the GPU. On the right, the datapath skips the CPU and system memory. The benefits are summarized at the bottom.

datapath without GDS.

WITHOUT GPUDIRECT STORAGE

Limited by bandwidth into and out of the CPU. Incurs the latency of a CPU bounce buffer. Memory capacity is limited to O(1TB). Storage is not part of CUDA. No topology-based optimization.

datapath with GDS

WITH GPUDIRECT STORAGE

Bandwidth into GPUs limited only by NICs. Lower latency due to direct copy. Access to O(PB) capacity. Simple CUDA programming model. Adaptively route through NVLink, GPU buffers.

Figure 1. GDS software stack, where the applications use cuFile APIs, and the GDS-enabled storage drivers call out to the nvidia-fs.ko kernel driver to obtain the correct DMA address.

GPUDirect storage offers three basic performance benefits:

  • Increased bandwidth: By removing the need to go through a bounce buffer in the CPU, alternate paths become available on some platforms, including those that may offer higher bandwidth through a PCIe switch or over NVLink. While DGX platforms have both PCIe switches and NVLinks, not all platforms do. We recommend using both to maximize performance. The Mars lander example achieved an 8x bandwidth gain.
  • Decreased latency: Reduce latency by avoiding the delay of an extra copy through CPU memory and the overhead of managing memory, which can be severe in extreme cases. A 3x reduction in latency is common.
  • Decreased CPU utilization: Use of a bounce buffer introduces extra operations on the CPU, both to perform the extra copy and to manage the memory buffers. When CPU utilization becomes the bottleneck, effective bandwidth can drop significantly. We’ve measured 3x improvements in CPU utilization with multiple file systems.

Without GDS, there’s only one available datapath: from storage to the CPU and from the CPU to the relevant GPU with cudaMemcpy. With GDS, there are additional optimizations available:

  • The CPU threads used to interact with the DMA engine are affinitized to the closest CPU core.
  • If the storage and GPU hang off different sockets and NVLink is an available connection, then data may be staged through a fast bounce buffer in the memory of a GPU near the storage, and then transferred using CUDA to the final GPU memory target buffer. This can be considerably faster than using the intersocket path, for example, UPI.
  • There is no cudaMemcpy involved to take care of segmenting the IO transfer to fit in the GPU BAR1 aperture, whose size varies by GPU SKU, or into prepinned buffers in case the target buffer is not pinned with cuFileBufRegister. These operations are managed with the libcufile.so user library code.
  • Handle unaligned accesses, where the offset of the data within the file to be transferred does not align with a page boundary.
  • In future GDS releases, the cuFile APIs will support asynchronous and batched operations. This enables a CUDA kernel to be sequenced after a read in the CUDA stream that provides inputs to that kernel, and a write to be sequenced after a kernel that produces data to be written. In time, cuFile APIs will be usable in the context of CUDA Graphs as well.

Table 1 shows the peak and measured bandwidths on NVIDIA DGX-2 and DGX A100 systems. This data shows that the achievable bandwidth into GPUs from local storage exceeds the maximum bandwidth from up to 1 TB of CPU memory in ideal conditions. Commonly measured bandwidths from petabytes of remote memory can be well more than double of the bandwidth that CPU memory provides in practice.

Spilling data that won’t fit in GPU memory out to even petabytes of remote storage can exceed the achievable performance of paging it back to the 1 TB of memory in the CPU. This is a remarkable reversal of history.

Endpoint DGX-2
(Gen3), GB/s
DGX A100
(Gen4), GB/s
CPU 50, peak 100, peak
Switch/GPU 100, peak 200*, peak
Endpoint Capacity Measured
CPU sysmem O(1TB) 48-50 @ 4 PCIe 96-100 @ 4 PCIe
Local storage O(100TB) 53+ @ 16 drives 53+ @ 8 drives
RAID cards O(100TB) 112 (MicroChip) @ 8 N/A
NICs O(1PB) 90+ @ 8 NICs 185+ @ 8 NICs
Table 1. Access to petabytes of data is possible at bandwidths that exceed those to only 1 TB of CPU memory.

* Performance numbers shown here with NVIDIA GPUDirect Storage on NVIDIA DGX A100 slots 0-3 and 6-9 are not the officially supported network configuration and are for experimental use only. Sharing the same network adapters for both compute and storage may impact the performance of standard or other benchmarks previously published by NVIDIA on DGX A100 systems.

How GDS works

NVIDIA seeks to embrace existing standards wherever possible, and to judiciously extend them where necessary. The POSIX standard’s pread and pwrite provide copies between storage and CPU buffers, but do not yet enable copies to GPU buffers. This shortcoming of not supporting GPU buffers in the Linux kernel will be addressed over time.

A solution, called dma_buf, that enables copies among devices like a NIC or NVMe and GPU, which are peers on the PCIe bus, is in progress to address that gap. In the meantime, the performance upside from GDS is too large to wait for an upstreamed solution to propagate to all users. Alternate GDS-enabled solutions have been provided by a variety of vendors, including MLNX_OFED (Table 2). The GDS solution involves new APIs, cuFileRead or cuFileWrite, that are similar to POSIX pread and pwrite.

Optimizations like dynamic routing, use of NVLink, and async APIs for use in CUDA streams that are only available from GDS makes the cuFile APIs an enduring feature of the CUDA programming model, even after gaps in the Linux file system are addressed.

Here’s what the GDS implementation does. First, the fundamental problem with the current Linux implementation is passing a GPU buffer address as a DMA target down through the virtual file system (VFS) so that the DMA engine in a local NVMe or a network adapter can perform a transfer to or from GPU memory. This leads to an error condition. We have a way around this problem for now: Pass down an address for a buffer in CPU memory instead.

When the cuFile APIs like cuFileRead or cuFileWrite are used, the libcufile.so user-level library captures the GPU buffer address and substitutes a proxy CPU buffer address that’s passed to VFS. Just before the buffer address is used for a DMA, a call from a GDS-enabled driver to nvidia-fs.ko identifies the CPU buffer address and provides a substitute GPU buffer address again so that the DMA can proceed correctly.

The logic in libcufile.so performs the various optimizations described earlier, like dynamic routing, use of prepinned buffers, and alignment. Figure 2 shows the stack used for this optimization. The cuFile APIs are an example of the Magnum IO architectural principles of flexible abstraction that enable platform-specific innovation and optimization, like selective buffering and use of NVLink.

The software stack to enable GDS includes the application, cuFile user library, NVIDIA kernel driver, and standard or proprietary storage drivers.
Figure 2. GDS software stack, where the applications use cuFile APIs, and the GDS-enabled storage drivers call out to the nvidia-fs.ko kernel driver to obtain the correct DMA address.

To learn more

The GPUDirect Storage post was the original introduction to GPUDirect Storage. We recommend the NVIDIA GPUDirect Storage Design Guide to end customers and OEMs, and the NVIDIA GPUDirect Storage Overview Guide to end customers, OEMs, and partners. For more information about programming with GDS, see the cuFile API Reference Guide.

Categories
Misc

Software Ate the World — That Means Hardware Matters Again

Software ate the world, now new silicon is taking a seat at the table. Ten years ago venture capitalist Marc Andreessen proclaimed that “software is eating the world.” His once-radical concept — now a truism — is that innovation and corporate value creation lie in software. That led some to believe that hardware matters less. Read article >

The post Software Ate the World — That Means Hardware Matters Again appeared first on The Official NVIDIA Blog.