Categories
Misc

New on NGC: NVIDIA NeMo, HPC SDK, DOCA, PyTorch Lightning, and More 

Learn about the latest additions and software updates to the NVIDIA NGC catalog, a hub of GPU-optimized software that simplifies and accelerates workflows.

The NVIDIA NGC catalog is a hub for GPU-optimized deep learning, machine learning, and HPC applications. With highly performant software containers, pretrained models, industry-specific SDKs, and Helm Charts the content available on the catalog helps simplify and accelerate end-to-end workflows. 

A few additions and software updates to the NGC catalog include:

NVIDIA NeMo

NVIDIA NeMo (Neural Modules) is an open source toolkit for conversational AI. It is designed for data scientists and researchers to build new state-of-the-art speech and NLP networks easily through API compatible building blocks that can be connected.

The latest version of NeMo adds support for Conformer ONNX conversion and streaming inference of long AU files, and improves performance of speaker clustering, verification, and diarization. Furthermore, it adds multiple datasets, right to left models, noisy channel reranking, ensembling for NMT. It also improves NMT training efficiency and adds tutorial notebooks for NMT data cleaning and preprocessing.

NVIDIA HPC SDK

The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools essential to maximizing developer productivity, performance, and portability of HPC applications.

The latest version includes full support for the NVIDIA Arm HPC Developer Kit and CUDA 11.4. It also offers HPC compilers with Arm-specific performance enhancements, including improved vectorization and optimized math functions.

NVIDIA Data Center Infrastructure-on-a-Chip Architecture (NVIDIA DOCA)

The NVIDIA DOCA SDK enables developers to rapidly create applications and services on top of BlueField data processing units (DPUs).

The NVIDIA DOCA container and resource helps deploy NVIDIA DOCA applications and development setups on the BlueField DPU. The deployment is based on Kubernetes and this resource bundles ready-to-use .yaml configuration files required for the different DOCA containers.

NVIDIA System Management (NVSM)

NVSM is a software framework for monitoring DGX nodes in a data center and provides active health monitoring, system alerts, and log generation. NVSM provides DGX Stations the health of the system and diagnostic information.

Deep Learning Software

Our most popular deep learning frameworks for training and inference are updated monthly. Pull the latest version (v21.07) of:

PyTorch Lightning

PyTorch Lightning is a lightweight framework for training models at scale, on multi-GPU, multi-node configurations. It does so without changing your code, and turns on advanced training optimizations with a switch of a flag.

The v1.4.0 adds support for Fully Sharded Parallelism, and fits much larger models onto multiple GPUs into memory, reaching over 40 billion parameters on an A100. 

Additionally, it supports the new DeepSpeed Infinity plug-in and new cluster environments including KubeflowEnvironment and LSFEnvironment. 

See the entire list of new v1.4.0 features >>

NVIDIA Magnum IO Developer Environment

NVIDIA Magnum IO is the collection of I/O technologies that make up the I/O subsystem of the modern data center, and enable applications at scale. 

The Magnum IO Developer Environment container serves two primary purposes: 

  1. Allows developers to begin scaling applications on a laptop, desktop, workstation, or in the cloud. 
  2. Serve as the basis for a build container locally or in a CI/CD system.

Visit the NGC catalog to see how the GPU-optimized software can help simplify workflows and speed up solutions times.

Categories
Misc

Free of Charge: GE Renewable Energy Integrates Wind Energy Into the Power Grid

At GE Renewable Energy, CTO Danielle Merfeld and technical leader Arvind Rangarajan are making wind, er, mind-blowing advances throughout renewable energy. Merfeld and Rangarajan spoke with NVIDIA AI Podcast host Noah Kravitz about how the company uses AI and a human-in-the-loop process to make renewable energy more widespread. The AI Podcast · GE’s Danielle Merfeld Read article >

The post Free of Charge: GE Renewable Energy Integrates Wind Energy Into the Power Grid appeared first on The Official NVIDIA Blog.

Categories
Misc

NVIDIA Turbocharges Extreme-Scale AI for Argonne National Laboratory’s Polaris Supercomputer

Largest GPU-Powered Supercomputer for U.S. Department of Energy’s Argonne Lab Will Enable Scientific Breakthroughs in Era of Exascale AISANTA CLARA, Calif., Aug. 25, 2021 (GLOBE NEWSWIRE) — …

Categories
Misc

Accelerating SE(3)-Transformers Training Using an NVIDIA Open-Source Model Implementation

SE(3)-Transformers are versatile graph neural networks unveiled at NeurIPS 2020. NVIDIA just released an open-source optimized implementation that uses 9x less memory and is up to 21x faster than the baseline official implementation. SE(3)-Transformers are useful in dealing with problems with geometric symmetries, like small molecules processing, protein refinement, or point cloud applications. They can … Continued

SE(3)-Transformers are versatile graph neural networks unveiled at NeurIPS 2020. NVIDIA just released an open-source optimized implementation that uses 9x less memory and is up to 21x faster than the baseline official implementation.

SE(3)-Transformers are useful in dealing with problems with geometric symmetries, like small molecules processing, protein refinement, or point cloud applications. They can be part of larger drug discovery models, like RoseTTAFold and this replication of AlphaFold2. They can also be used as standalone networks for point cloud classification and molecular property prediction (Figure 1).

High-level block diagram of SE(3)-Transformer architecture.
Figure 1. Architecture of a typical SE(3)-Transformer used for molecular property prediction.

In the /PyTorch/DrugDiscovery/SE3Transformer repository, NVIDIA provides a recipe to train the optimized model for molecular property prediction tasks on the QM9 dataset. The QM9 dataset contains more than 100k small organic molecules and associated quantum chemical properties.

A 21x higher training throughput

The NVIDIA implementation provides much faster training and inference overall compared with the baseline implementation. This implementation introduces optimizations to the core component of SE(3)-Transformers, namely tensor field networks (TFN), as well as to the self-attention mechanism in graphs.

These optimizations mostly take a form of fusion of operations, given that some conditions on the hyperparameters of attention layers are met.

Thanks to these, the training throughput is increased by up to 21x compared to the baseline implementation, taking advantage of Tensor Cores on recent NVIDIA GPUs.

Training throughput in molecules per second: baseline: 83, NVIDIA: 1680, NVIDIA with AMP: 1780.
Figure 2. Training throughput on an A100 GPU. QM9 dataset with a batch size of 100.

In addition, the NVIDIA implementation allows the use of multiple GPUs to train the model in a data-parallel way, fully using the compute power of a DGX A100 (8x A100 80GB).

Putting everything together, on an NVIDIA DGX A100, SE(3)-Transformers can now be trained in 27 minutes on the QM9 dataset. As a comparison, the authors of the original paper state that the training took 2.5 days on their hardware (NVIDIA GeForce GTX 1080 Ti).

Faster training enables you to iterate quickly during the search for the optimal architecture. Together with the lower memory usage, you can now train bigger models with more attention layers or hidden channels, and feed larger inputs to the model.

A 9x lower memory footprint

SE(3)-Transformers were known to be memory-heavy models, meaning that feeding large inputs like large proteins or many batched small molecules was challenging. This was a bottleneck for users with limited GPU memory.

This has now changed with the NVIDIA implementation, open-sourced on DeepLearningExamples. Figure 3 shows that, thanks to NVIDIA optimizations and support for mixed precision, the training memory usage is reduced by up to 9x compared to the baseline implementation.

Peak memory usage: baseline: 27 GB, NVIDIA: 5.7 GB, NVIDIA with AMP: 3.8 GB, NVIDIA with AMP and low-memory mode: 3 GB.
Figure 3. Comparison of training peak memory consumption between the baseline implementation and NVIDIA implementation of SE(3)-Transformers. Using 100 molecules per batch on the QM9 dataset. V100 32-GB GPU.

In addition to the improvements done for single and mixed precision, a low-memory mode is provided. When this flag is enabled, and the model runs either on TF32 (NVIDIA Ampere Architecture) or FP16 (NVIDIA Ampere Architecture, NVIDIA Turing Architecture, and NVIDIA Volta Architecture) precisions, the model switches to a mode that trades throughput for extra memory savings.

In practice, on the QM9 dataset with a V100 32-GB GPU, the baseline implementation can scale up to a batch size of 100 before running out of memory. The NVIDIA implementation can fit up to 1000 molecules per batch (mixed precision, low-memory mode).

For researchers handling proteins with amino acid residue as nodes, this means that you can feed longer sequences and increase the receptive field of each residue.

SE(3)-Transformer optimizations

Here are some of the optimizations that the NVIDIA implementation provides compared to the baseline. For more information, see the source code and documentation on the DeepLearningExamples/PyTorch/DrugDiscovery/SE3Transformer repository.

Fused keys and values computation

Inside the self-attention layers, keys, queries, and values tensors are computed. Queries are graph node features and are a linear projection of the input features. Keys and values, on the other hand, are graph edge features. They are computed using TFN layers. This is where most computation happens in SE(3)-Transformers and where most of the parameters live.

The baseline implementation uses two separate TFN layers to compute keys and values. In the NVIDIA implementation, those are fused together in one TFN with the number of channels doubled. This reduces by half the number of small CUDA kernels launched, and better exploits GPU parallelism. Radial profiles, which are fully connected networks inside TFNs, are also fused with this optimization. An overview is shown in Figure 4.

Block diagram of keys, queries, and values computation.
Figure 4. Keys, queries, and values computation inside the NVIDIA implementation. Keys and values are computed together and then chunked along the channel dimension.

Fused TFNs

Features inside SE(3)-Transformers have, in addition to their number of channels, a degree d , which is a positive integer. A feature of degree  has a dimensionality 2d + 1 of . A TFN takes in features of different degrees, combines them using tensor products, and outputs features of different degrees.

For a layer with 4 degrees as input and 4 degrees as output, all combinations of degrees are considered: there are in theory 4×4=16 sublayers that must be computed.

These sublayers are called pairwise TFN convolutions. Figure 5 shows an overview of the sublayers involved, along with the input and output dimensionality for each. Contributions to a given output degree (columns) are summed together to obtain the final features.

4 by 4 grid of all pairwise TFN convolutions.
Figure 5. Pairwise convolutions involved in a TFN layer with 4 degrees as input and 4 degrees as output.

NVIDIA provides multiple levels of fusion to accelerate these convolutions when some conditions on the TFN layers are met. Fused layers enable Tensor Cores to be used more effectively by creating shapes with dimensions being multiples of 16. Here are three cases where fused convolutions are applied:

  • Output features have the same number of channels
  • Input features have the same number of channels
  • Both conditions are true

The first case is when all the output features have the same number of channels, and output degrees span the range from 0 to the maximum degree. In this case, fused convolutions that output fused features are used. This fusion level is used for the first TFN layer of SE(3)-Transformers.

1 by 4 grid of 4 fused convolutions
Figure 6. Partially fused TFN per output degree.

The second case is when all the input features have the same number of channels, and input degrees span the range from 0 to the maximum degree. In this case, fused convolutions that operate on fused input features are used. This fusion level is used for the last TFN layer of SE(3)-Transformers.

4 by 1 grid of 4 fused convolution
Figure 7. Partially fused TFN per input degree.

In the last case, fully fused convolutions are used when both conditions are met. These convolutions take as input fused features, and output fused features. This means that only one sublayer is necessary per TFN layer. Internal TFN layers use this fusion level.

single fully fused convolution
Figure 8. Fully fused TFN

Base precomputation

In addition to input node features, TFNs need basis matrices as input. There exists a set of matrices for each graph edge, and these matrices depend on the relative positions between the destination and source nodes.

In the baseline implementation, these matrices are computed in the beginning of the forward pass and shared across all TFN layers. They depend on spherical harmonics, which can be expensive to compute. Because the input graphs do not change (no data augmentation, no iterative position refinement) with the QM9 dataset, this introduces redundant computation across epochs.

The NVIDIA implementation provides the option to precompute those bases at the beginning of the training. The full dataset is iterated one time and the bases are cached in RAM. The process of computing bases at the beginning of forward passes is replaced by a faster CPU to GPU memory copy.

Conclusion

I encourage you to check the implementation of the SE(3)-Transformer model in the NVIDIA /PyTorch/DrugDiscovery/SE3Transformer GitHub repository. In the comments, share how you plan to adopt and extend this project.

Categories
Misc

GPU-Accelerated Tools Added to NVIDIA Clara Parabricks v3.6 for Cancer and Germline Analyses

The release of NVIDIA Clara Parabricks v3.6 brings new applications for variant calling, annotation, filtering, and quality control to its suite of powerful genomic analysis tools. Now featuring over 33 accelerated tools for every stage of genomic analysis, NVIDIA Clara Parabricks provides GPU-accelerated bioinformatic pipelines that can scale for any workload. As genomes and exomes … Continued

The release of NVIDIA Clara Parabricks v3.6 brings new applications for variant calling, annotation, filtering, and quality control to its suite of powerful genomic analysis tools. Now featuring over 33 accelerated tools for every stage of genomic analysis, NVIDIA Clara Parabricks provides GPU-accelerated bioinformatic pipelines that can scale for any workload.

As genomes and exomes are sequenced at faster speeds than ever before, increasing loads of raw instrument data must be mapped, aligned, and interpreted to decipher variants and their significance to disease. Bioinformatic pipelines need to keep up with genomic analysis tools. CPU-based analysis pipelines often take weeks or months to glean results, while GPU-based pipelines can analyze 30X whole human genomes in 22 minutes and whole human exomes in 4 minutes.

These fast turnaround times are necessary to keep pace with next generation sequencing (NGS) genomic instrument outputs. This is imperative for large-scale population, cancer center, pharmaceutical drug development, and genomic research projects that require quick results for publications.

NVIDIA Clara Parabricks v3.6 Incorporates:

  1. New GPU-accelerated variant callers 
  2. An easy-to-use vote-based VCF merging tool (VBVM)
  3. A database annotation tool (VCFANNO)
  4. A new tool for quickly filtering a VCF by allele frequency (FrequencyFiltration)
  5. Tools for VCF quality control (VCFQC and VCFQCbyBAM) for both somatic and germline pipelines.
Figure 1: Analysis runtimes for open-source CPU-based somatic variant calling tools compared to GPU-accelerated NVIDIA Clara Parabricks. Relative to the community versions, NVIDIA Clara Parabricks accelerates LoFreq by 6x, SomaticSniper by 16x, and Mutect2 by 42x. These benchmarks were run on 50X WGS matched tumor-normal data from the SEQC-II benchmark set on 4x V100s.

Accelerating LoFreq and Other Somatic Callers

With the addition of LoFreq alongside Strelka2, Mutect2, and SomaticSniper, Clara Parabricks now includes 4 somatic callers for cancer workflows. LoFreq is a fast and sensitive variant caller for inferring SNVs and indels from NGS data. LoFreq runs on a variety of aligned sequencing data such as Illumina, IonTorrent, and Pacbio. It can automatically adapt to changes in coverage and sequencing quality, and can be applied to somatic, viral/quasispecies, metagenomic, and bacterial datasets.

The Lofreq somatic caller in Clara Parabricks is 10X faster compared to its native instance and is ideal for calling low frequency mutations.Using base-call qualities and other sources of errors inherent in NGS data, LoFreq improves the accuracy for calling somatic mutations below the 10% allele frequency threshold. 

The accelerated LoFreq supports only SNV calling in v3.6, with Indel calling coming in a subsequent release.
Read more >>

Figure 2: Runtimes for open-source DeepVariant (blue) and GPU-accelerated NVIDIA Clara Parabricks (green). Runtimes for 30X Illumina short read data are on the left; runtimes for PacBio 35X long read data are on the right. NVIDIA Clara Parabricks’ DeepVariant is 10-15x faster than the open-source version (blue “DeepVariant” bars compared to green “DeepVariant” bars).

From Months to Hours with New Accelerated Tools

NVIDIA Clara Parabricks v3.6 also includes a bam2fastq tool, the addition of smoove variant callers, support for de novo mutations, and new tools for VCF processing (for example annotation, filtering, and merging). A standard WGS analysis for a 30x human genome finishes in 22 minutes on a DGX A100, which is over 80 times faster than CPU-based workflows on the same server. With this acceleration, projects taking months can now be done in hours. 

Bam2Fastq is an accelerated version of GATK Sam2fastq. It converts a BAM or CRAM file to FASTQ. This is useful for scenarios where samples need to be realigned to a new reference, but the original FASTQs were deleted to save on storage space. Now they can be regenerated from the BAMs and aligned to a new reference more quickly than ever before

Detection of de novo variants (DNVs) that occur in the germline genome when comparing sequence data for an offspring to its parents (aka trio analysis) is critical for studies of disease-related variation, along with creating a baseline for generational mutation rates. 

A GPU-based workflow to call DNVs is now included in NVIDIA Clara Parabricks v3.6 and utilizes Google’s DeepVariant, which has been tested on trio analyses and other pedigree sequencing projects.
Learn more >>

For structural variant calling, NVIDIA Clara Parabricks already includes Manta, and now smoove has been added. Smoove simplifies and speeds calling and genotyping structural variants for short reads. It also improves specificity by removing alignment signals indicative of low-level noise and often contribute to spurious calls.  
Learn more >>

Figure 3: GPU-accelerated genomics analysis tools in NVIDIA Clara Parabricks v3.6.

NVIDIA Clara Parabricks v3.6 also focuses on steps of the genomic pipeline after variant calling. BamBasedVCFQC is an NVIDIA-generated tool to help QC VCF outputs by using SamTools mPileUp results, using the original BAM. Vcfanno allows users to annotate VCF outputs using third-party data sources like dbSNP, adding allele frequencies to the VCF.

FrequencyFiltration allows variants within a VCF to be filtered based upon numeric fields containing allele frequency and read count information. Finally, vote-based somatic caller merger (vbvm) is for merging two or more VCF files and then filtering variants based upon a simple voting-based mechanism where variants can be filtered based upon the number of somatic callers that have identified a specific variant.

Categories
Misc

Jupyter on tf 2.5 not compatible?

Im trying to create a conda env with tf-gpu 2.5. When conda installing jupyter, I get all these package conflicts. I didn’t have this problem with tf 2.3. Does jupyter not support the latter versions of tensorflow?

submitted by /u/Inevitable_Charge828
[visit reddit] [comments]

Categories
Misc

Put text/image on empty spaces in image

Put text/image on empty spaces in image

I want to place my own objects in empty spaces. I can detect objects and empty spaces in the image, but I cannot implement them on empty spaces, which method should I use?

https://preview.redd.it/maa7y4id8bj71.jpg?width=820&format=pjpg&auto=webp&s=4156f7772b3ab2af1581e87e25b37cf337eb776a

submitted by /u/koalabey
[visit reddit] [comments]

Categories
Misc

Global Availability of NVIDIA AI Enterprise Makes AI Accessible for Every Industry

NVIDIA today announced the availability of NVIDIA AI Enterprise, a comprehensive software suite of AI tools and frameworks that enables the hundreds of thousands of companies running VMware vSphere to virtualize AI workloads on NVIDIA-Certified Systems™.

Categories
Misc

TFLite quantization with representative_dataset

I am trying to quantize my model for tflite with representative dataset with a section of my training dataset ( shape (7000,51,300,1) ). using the generator let’s say data_rep = np.array(data_prepped_train[0:100])

The trouble is I am having the below error.

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

can anyone give me an insight on the matter?

submitted by /u/tagiyevv
[visit reddit] [comments]

Categories
Misc

How to use numpy functions in tensorflow custom loss?

I am trying to design a model to minimize the output value of a certain function which takes an input array and performs certain math operations with each elements of the input array and returns a final result. I have written this function using numpy and am trying to define a loss like –

function = function_using_numpy(input_array) #returns scalar float

loss_function(truth, prediction):

loss = k.abs(function(truth) – function (prediction))

return loss

The problem is tensorflow cannot convert a tensor to numpy array to compute the loss. Is the a way around this? Would be grateful for some pointers. Thanks in advance

submitted by /u/LatePenguins
[visit reddit] [comments]