Category: Misc

Misc

Accelerating ReLu and GeLu Activation Functions, and Batched Sparse GEMM in cuSPARSELt v0.2.0

Post author By
Post date November 15, 2021
No Comments on Accelerating ReLu and GeLu Activation Functions, and Batched Sparse GEMM in cuSPARSELt v0.2.0

NVIDIA cuSPARSELt v0.2 now supports ReLu and GeLu activation functions, bias vector, and batched Sparse GEMM.

Today, NVIDIA is announcing the availability of cuSPARSELt, version 0.2.0, which increases performance on activation functions, bias vectors, and Batched Sparse GEMM. This software can be downloaded now free of charge.

Download the cuSPARSELt software.

What’s New?

Support for activation functions and bias vector:
- ReLU + upper bound and threshold setting for all kernels.
- GeLU for INT8 I/O, INT32 Tensor Core compute kernels.
Support for Batched Sparse GEMM:
- Single sparse matrix / Multiple dense matrices (Broadcast).
- Multiple sparse and dense matrices.
- Batched bias vector.
Compatibility notes:
- cuSPARSELt does not require the nvrtc library anymore.
- Support for Ubuntu 16.04 (gcc-5) is now deprecated and it will be removed in future releases.

For more technical information, see the cuSPARSELt Release Notes.

About cuSPARSELt

NVIDIA cuSPARSELt is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a sparse matrix:

$D=alpha op(A)*op(B)+beta op(C)$

In this equation, $op(A)$ and $op(B)$ refer to in-place operations such as transpose and nontranspose.

The cuSPARSELt APIs provide flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types.

Key Features

NVIDIA Sparse MMA Tensor Core support.
Mixed-precision computation support:
- FP16 I/O, FP32 Tensor Core accumulate.
- BFLOAT16 I/O, FP32 Tensor Core accumulate.
- INT8 I/O, INT32 Tensor Core compute.
- FP32 I/O, TF32 Tensor Core compute.
- TF32 I/O, TF32 Tensor Core compute.

Matrix pruning and compression functionalities.
Auto-tuning functionality (see cusparseLtMatmulSearch()).

Learn more

For more about Math Libraries, see Recent Developments in NVIDIA Math Libraries (GTC 2021 #S31754).
To get the latest on HPC software, see A Deep Dive into the latest HPC software (GTC 2021 #S31286).
Catch up on Tensor Core-Accelerated Math Libraries for Dense and Sparse Linear Algebra in AI and HPC (GTC 2021 #CWES1098).
Read technical details in our cuSPARSELt Product Documentation.

Recent Developer posts

For advanced matrix multiply techniques, read Accelerating Matrix Multiplication with Block Sparse Format and NVIDIA Tensor Cores.
To leverage NVIDIA Ampere architecture performance, read Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt.
To benefit from A100 acceleration, read Getting Immediate Speedups with NVIDIA A100 TF32.
To gain AI training benefits, see Accelerating AI Training with NVIDIA TF32 Tensor Cores.

Misc

Develop the Next Generation of HPC Applications with the NVIDIA Arm HPC Developer Kit

Post author By
Post date November 15, 2021
No Comments on Develop the Next Generation of HPC Applications with the NVIDIA Arm HPC Developer Kit

Graphic of HPC SDK. NVIDIA and partners have been working hard to get the NVIDIA Arm HPC Developer Kit units into the hands of developers and enhance the software stack.

In July of 2021, NVIDIA announced the availability of the NVIDIA Arm HPC Developer Kit for preordering, along with the NVIDIA HPC SDK. Since then NVIDIA and its partners have been working hard to get units into the hands of developers, to increase global availability, and enhance the software stack.

Global Availability

The NVIDIA Arm HPC Developer Kit is based on the GIGABYTE G242-P32 2U server. It includes an Arm CPU, two A100 GPUs, two NVIDIA BlueField-2 data processing units (DPUs), and the NVIDIA HPC SDK suite of tools.

This delivers support for both single and multinode configurations. Units are available to order for global delivery through GIGABYTE.

Users Already Running HPC codes

The first units are already being used at sites, including Los Alamos National Laboratory (LANL), the University of Leicester, Oak Ridge National Laboratory (ORNL), and the National Center for High-performance Computing (NCHC) in Taiwan. They have successfully deployed multinode configurations and opened the systems to users to run HPC codes.

Los Alamos National Laboratory

“Los Alamos National Laboratory has a broad set of requirements related to our national security mission spaces. With this as a backdrop, we evaluate, deploy, and integrate many advanced technologies into our ecosystem. The consistent goal of these technologies is to improve our responses to mission requirements.

“As part of our 2023 HPE/NVIDIA system, which will utilize NVIDIA’s Grace Arm-based CPU, Los Alamos has been working with the Arm ecosystem software and hardware. With that in mind, we have already deployed early development test systems where we see good success migrating and developing new codes. One such code, which we are actively codesigning both HW and SW, is an astrophysics code-named Phoebus.” – Steve Poole, Chief Architect at LANL.

University of Leicester

“The University of Leicester, thanks to the contribution of the ExCALIBUR Hardware and Enabling Software Programme and STFC DiRAC HPC facility, has recently completed the deployment of 4x NVIDIA Arm HPC Developer Kit accessible by all UK developers interested in testing, porting, and optimizing strategic UK applications on NVIDIA Ampere Architecture Computing Altra CPU and NVIDIA A100 GPU.

“The UK remains at the forefront of computing thanks to initiatives like ExCALIBUR. The addition of this accelerated Arm-based system opens new opportunities to evaluate the role of accelerators in a fast-growing and diversified Arm HPC ecosystem. We welcome the close partnership of NVIDIA in pushing the ecosystem forward into the next era of accelerated computing.” – Mark Wilkinson, Professor of Theoretical Astrophysics and Director at DiRAC HPC Facility.

Oak Ridge National Laboratory

“Here at ORNL, we are looking forward to working with NVIDIA to explore the deployment of a wide array of applications on the NVIDIA Arm HPC developer kit as performance portability continues to gain prominence in HPC.” – Ross Miller, Systems Integration Programmer in the National Center for Computational Sciences at ORNL.

Software Stack Enhancements

NVIDIA continues to make rapid progress on enhancing the HPC SDK and supporting its full stack of ML tools on Arm. Separate from the HPC SDK, NVIDIA is announcing support for two of the most popular deep learning frameworks: PyTorch and TensorFlow.

In addition, the RAPIDS suite of software libraries and the NVIDIA Triton Inference Server will be available on Arm by the end of year.

The NVIDIA Arm HPC Developer Kit is the first step in enabling an Arm HPC ecosystem with GPU acceleration. NVIDIA is committed to full support for Arm for HPC and AI applications.

Learn more

NVIDIA Arm Resources >>
ISC Session, A Deep Dive into the Latest HPC Software >>
GTC On-demand session, Inside the NVIDIA HPC SDK >>

Misc

Using Fully Redesigned Batch API and Performance Optimizations in nvCOMP v2.1.0

Post author By
Post date November 15, 2021
No Comments on Using Fully Redesigned Batch API and Performance Optimizations in nvCOMP v2.1.0

New nvCOMP v2.1.0 Library with Redesigned Batch API and Performance Optimizations

Today, NVIDIA is announcing the availability of nvCOMP, version 2.1.0. This software can be downloaded now free of charge.

Download now.

What’s New?

Enhancements to the low-level interface by adding configuration options, a new error-reporting mechanism, and functions that calculate the size of the decompressed output.
Redesigned Batch APIs
- Low-level is targeting advanced users — metadata and chunking must be handled outside of nvCOMP. Perform batch compression/decompression of multiple streams, lightweight, and fully asynchronous.
- High-level is provided for ease of use — metadata and chunking is handled internally by nvCOMP. The easiest way to ramp-up and use nvCOMP in applications.
All compressors are available through the updated low-level API (including Cascaded and Bitcomp – new in 2.1).
Performance optimizations for Snappy, LZ4, and GDeflate.
New high-throughput and high-compression-ratio GPU compressors in GDeflate.

See the nvCOMP Release Notes for more information.

About nvCOMP

nvCOMP is a CUDA library that features generic compression interfaces to enable developers to use high-performance GPU compressors in their applications.

Supported nvCOMP Compression algorithms:

Cascaded: Novel high-throughput compressor ideal for analytical or structured/tabular data.
LZ4: General-purpose no-entropy byte-level compressor well suited for a wide range of datasets.
Snappy: Similar to LZ4, this byte-level compressor is a popular existing format used for tabular data.
GDeflate: Proprietary compressor with entropy encoding and LZ77, high compression ratios on arbitrary data.
Bitcomp: Proprietary compressor designed for floating point data in Scientific Computing applications.

Learn more

Gain insights from Optimizing Lossless Compression Algorithms on the GPU (GTC 2021: S32401).
Increase your knowledge from Lossless Compression on the GPU (GTC 2021: CWES1174).
Get more technical details in our nvCOMP Project Documentation.

Recent Developer posts

Read about Optimizing Data Transfer Using Lossless Compression with NVIDIA nvCOMP.

Misc

Accelerated Molecular Simulation Using Deep Potential Workflow with NGC

Post author By
Post date November 15, 2021
No Comments on Accelerated Molecular Simulation Using Deep Potential Workflow with NGC

AI neural network deep potential to combine classical MD simulation with DFT calculation accuracy.

Molecular simulation communities have faced the accuracy-versus-efficiency dilemma in modeling the potential energy surface and interatomic forces for decades. Deep Potential, the artificial neural network force field, solves this problem by combining the speed of classical molecular dynamics (MD) simulation with the accuracy of density functional theory (DFT) calculation.¹ This is achieved by using the GPU-optimized package DeePMD-kit, which is a deep learning package for many-body potential energy representation and MD simulation.²

This post provides an end-to-end demonstration of training a neural network potential for the 2D material graphene and using it to drive MD simulation in the open-source platform Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS).³ Training data can be obtained either from the Vienna Ab initio Simulation Package (VASP)⁴, or Quantum ESPRESSO (QE).⁵

A seamless integration of molecular modeling, machine learning, and high-performance computing (HPC) is demonstrated with the combined efficiency of molecular dynamics with ab initio accuracy — that is entirely driven through a container-based workflow. Using AI techniques to fit the interatomic forces generated by DFT, the accessible time and size scales can be boosted several orders of magnitude with linear scaling.

Deep potential is essentially a combination of machine learning and physical principles, which start a new computing paradigm as shown in Figure 1.

The image shows the new computing paradigm that combines molecular modeling, machine learning and high-performance computing to understand the interatomic forces of molecules compared to the traditional methods. — *Figure 1. A new computing paradigm composed of molecular modeling, AI, and HPC. (Figure courtesy: Dr. Linfeng Zhang, DP Technology)*

The entire workflow is shown in Figure 2. The data generation step is done with VASP and QE. The data preparation, model training, testing, and compression steps are done using DeePMD-kit. The model deployment is in LAMMPS.

This figure displays the workflow of training and deploying a deep potential model. The workflow includes data generation, data preparation, model training, model testing, model compression, and model deployment. — *Figure 2. Diagram of the DeePMD workflow.*

Why Containers?

A container is a portable unit of software that combines the application, and all its dependencies, into a single package that is agnostic to the underlying host OS.

The workflow in this post involves AIMD, DP training, and LAMMPS MD simulation. It is nontrivial and time-consuming to install each software package from source with the correct setup of compiler, MPI, GPU library, and optimization flags.

Containers solve this problem by providing a highly optimized GPU-enabled computing environment for each step, and eliminates the time to install and test software.

The NGC catalog, a hub of GPU-optimized HPC and AI software, carries a whole of HPC and AI containers that can be readily deployed on any GPU system. The HPC and AI containers from the NGC catalog are updated frequently and are tested for reliability and performance — necessary to speed up the time to solution.

These containers are also scanned for Common Vulnerabilities and Exposure (CVEs), ensuring that they are devoid of any open ports and malware. Additionally, the HPC containers support both Docker and Singularity runtimes, and can be deployed on multi-GPU and multinode systems running in the cloud or on-premises.

Training data generation

The first step in the simulation is data generation. We will show you how you can use VASP and Quantum ESPRESSO to run AIMD simulations and generate training datasets for DeePMD. All input files can be downloaded from the GitHub repository using the following command:

git clone https://github.com/deepmodeling/SC21_DP_Tutorial.git

VASP

A two-dimensional graphene system with 98-atoms is used as shown in Figure 3.⁶ To generate the training datasets, 0.5ps NVT AIMD simulation at 300 K is performed. The time step chosen is 0.5fs. The DP model is created using 1000 time steps from a 0.5ps MD trajectory at a fixed temperature.

Due to the short simulation time, the training dataset contains consecutive system snapshots, which are highly correlated. Generally, the training dataset should be sampled from uncorrelated snapshots with various system conditions and configurations. For this example, we used a simplified training data scheme. For production DP training, using DP-GEN is recommended to utilize the concurrent learning scheme to efficiently explore more combinations of conditions.⁷

The projector-augmented wave pseudopotentials are employed to describe the interactions between the valence electrons and frozen cores. The generalized gradient approximation exchange−correlation functional of Perdew−Burke−Ernzerhof. Only the Γ-point was used for k-space sampling in all systems.

This figure displays the top view of a single layer graphene system with 98 carbon atoms. — *Figure 3. A graphene system composed of 98 carbon atoms is used in AIMD simulation.*

Quantum Espresso

The AIMD simulation can also be carried out using Quantum ESPRESSO, available as a container from the NGC Catalog. Quantum ESPRESSO is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale based on density-functional theory, plane waves, and pseudopotentials. The same graphene structure is used in the QE calculations. The following command can be used to start the AIMD simulation:

$ singularity exec --nv docker://nvcr.io/hpc/quantum_espresso:qe-6.8 cp.x

Training data preparation

Once the training data is obtained from AIMD simulation, we want to convert its format using dpdata so that it can be used as input to the deep neural network. The dpdata package is a format conversion toolkit between AIMD, classical MD, and DeePMD-kit.

You can use the convenient tool dpdata to convert data directly from the output of first-principles packages to the DeePMD-kit format. For deep potential training, the following information of a physical system has to be provided: atom type, box boundary, coordinate, force, viral, and system energy.

A snapshot, or a frame of the system, contains all these data points for all atoms at one-time step, which can be stored in two formats, that is raw and npy.

The first format raw is plain text with all information in one file, and each line of the file represents a snapshot. Different system information is stored in different files named as box.raw, coord.raw, force.raw, energy.raw, and virial.raw. We recommended you follow these naming conventions when preparing the training files.

An example of force.raw:

$ cat force.raw
-0.724  2.039 -0.951  0.841 -0.464  0.363
 6.737  1.554 -5.587 -2.803  0.062  2.222
-1.968 -0.163  1.020 -0.225 -0.789  0.343

This force.raw contains three frames, with each frame having the forces of two atoms, resulting in three lines and six columns. Each line provides all three force components of two atoms in one frame. The first three numbers are the three force components of the first atom, while the next three numbers are the force components of the second atom.

The coordinate file coord.raw is organized similarly. In box.raw, the nine components of the box vectors should be provided on each line. In virial.raw, the nine components of the virial tensor should be provided on each line in the order XX XY XZ YX YY YZ ZX ZY ZZ. The number of lines of all raw files should be identical. We assume that the atom types do not change in all frames. It is provided by type.raw, which has one line with the types of atoms written one by one.

The atom types should be integers. For example, the type.raw of a system that has two atoms with zero and one:

$ cat type.raw
0 1

It is not a requirement to convert the data format to raw, but this process should give a sense on the types of data that can be used as inputs to DeePMD-kit for training.

The easiest way to convert the first-principles results to the training data is to save them as numpy binary data.

For VASP output, we have prepared an outcartodata.py script to process the VASP OUTCAR file. By running the commands:


$ cd SC21_DP_Tutorial/AIMD/VASP/
$ singularity exec --nv docker://nvcr.io/hpc/deepmd-kit:v2.0.3 python outcartodata.py
$ mv deepmd_data ../../DP/

For QE output:

$ cd SC21_DP_Tutorial/AIMD/QE/
$ singularity exec --nv docker://nvcr.io/hpc/deepmd-kit:v2.0.3 python logtodata.py
$ mv deepmd_data ../../DP/

A folder called deepmd_data is generated and moved to the training directory. It generates five sets 0/set.000, 1/set.000, 2/set.000, 3/set.000, 4/set.000, with each set containing 200 frames. It is not required to take care of the binary data files in each of the set.* directories. The path containing the set.* folder and type.raw file is called a system. If you want to train a nonperiodic system, an empty nopbc file should be placed under the system directory. box.raw is not necessary as it is a nonperiodic system.

We are going to use three of the five sets for training, one for validating, and the remaining one for testing.

Deep Potential model training

The input of the deep potential model is a descriptor vector containing the system information mentioned previously. The neural network contains several hidden layers with a composition of linear and nonlinear transformations. In this post, a three layer-neural network with 25, 50 and 100 neurons in each layer is used. The target value, or the label, for the neural network to learn is the atomic energies. The training process optimizes the weights and the bias vectors by minimizing the loss function.

The training is initiated by the command where input.json contains the training parameters:

$ singularity exec --nv docker://nvcr.io/hpc/deepmd-kit:v2.0.3 dp train input.json

The DeePMD-kit prints detailed information on the training and validation data sets. The data sets are determined by training_data and validation_data as defined in the training section of the input script. The training dataset is composed of three data systems, while the validation data set is composed of one data system. The number of atoms, batch size, number of batches in the system, and the probability of using the system are all shown in Figure 4. The last column presents if the periodic boundary condition is assumed for the system.

This image is a screenshot of the DP training output. Summaries of the training and validation dataset are shown with detailed information on the number of atoms, batch size, number of batches in the system and the probability of using the system. — *Figure 4. Screenshot of the DP training output.*

During the training, the error of the model is tested every disp_freq training step with the batch used to train the model and with numb_btch batches from the validating data. The training error and validation error are printed correspondingly in the file disp_file (default is lcurve.out). The batch size can be set in the input script by the key batch_size in the corresponding sections for training and validation data set.

An example of the output:

#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
      0      3.33e+01    3.41e+01      1.03e+01    1.03e+01      8.39e-01    8.72e-01    1.0e-03
    100      2.57e+01    2.56e+01      1.87e+00    1.88e+00      8.03e-01    8.02e-01    1.0e-03
    200      2.45e+01    2.56e+01      2.26e-01    2.21e-01      7.73e-01    8.10e-01    1.0e-03
    300      1.62e+01    1.66e+01      5.01e-02    4.46e-02      5.11e-01    5.26e-01    1.0e-03
    400      1.36e+01    1.32e+01      1.07e-02    2.07e-03      4.29e-01    4.19e-01    1.0e-03
    500      1.07e+01    1.05e+01      2.45e-03    4.11e-03      3.38e-01    3.31e-01    1.0e-03

The training error reduces monotonically with training steps as shown in Figure 5. The trained model is tested on the test dataset and compared with the AIMD simulation results. The test command is:

$ singularity exec --nv docker://nvcr.io/hpc/deepmd-kit:v2.0.3 dp test -m frozen_model.pb -s deepmd_data/4/ -n 200 -d detail.out

This image shows the total training loss, energy loss, force loss and learning rate decay with training steps from 0 to 1,000,000. Both the training and validation loss decrease monotonically with training steps. — *Figure 5. Training loss with steps*

The results are shown in Figure 6.

This image displays the inferenced energy and force in the y-axis, and the ground true on the x-axis. The inferenced values soundly coincide with the ground truth with all data distributed in the diagonal direction. — *Figure 6. Test of the prediction accuracy of trained DP model with AIMD energies and forces.*

Model export and compression

After the model has been trained, a frozen model is generated for inference in MD simulation. The process of saving neural network from a checkpoint is called “freezing” a model:

$ singularity exec --nv docker://nvcr.io/hpc/deepmd-kit:v2.0.3 dp freeze -o graphene.pb

After the frozen model is generated, the model can be compressed without sacrificing its accuracy; while greatly speeding up the inference performance in MD. Depending on simulation and training setup, model compression can boost performance by 10X, and reduce memory consumption by 20X when running on GPUs.

The frozen model can be compressed using the following command where -i refers to the frozen model and -o points to the output name of the compressed model:

$ singularity exec --nv docker://nvcr.io/hpc/deepmd-kit:v2.0.3 dp compress -i graphene.pb -o graphene-compress.pb

Model deployment in LAMMPS

A new pair-style has been implemented in LAMMPS to deploy the trained neural network in prior steps. For users familiar with the LAMMPS workflow, only minimal changes are needed to switch to deep potential. For instance, a traditional LAMMPS input with Tersoff potential has the following setting for potential setup:

pair_style		tersoff
pair_coeff		* * BNC.tersoff C

To use deep potential, replace previous lines with:

pair_style		deepmd graphene-compress.pb
pair_coeff		* *

The pair_style command in the input file uses the DeePMD model to describe the atomic interactions in the graphene system.

The graphene-compress.pb file represents the frozen and compressed model for inference.
The graphene system in MD simulation contains 1,560 atoms.
Periodic boundary conditions are applied in the lateral x– and y-directions, and free boundary is applied to the z-direction.
The time step is set as 1 fs.
The system is placed under NVT ensemble at temperature 300 K for relaxation, which is consistent with the AIMD setup.

The system configuration after NVT relaxation is shown in Figure 7. It can be observed that the deep potential can describe the atomic structures with small ripples in the cross-plane direction. After 10ps NVT relaxation, the system is placed under NVE ensemble to check system stability.

The image displays the side view of the single layer graphene system after thermal relaxation in LAMMPS. — *Figure 7. Atomic configuration of the graphene system after relaxation with deep potential.*

The system temperature is shown in Figure 8.

The image displays the temperature profiles of the graphene system under NVT and NVE ensembles from 0 to 20 picoseconds. The first 10 picosecond is NVT and the second 10 picosecond is NVE. — *Figure 8. System temperature under NVT and NVE ensembles. The MD system driven by deep potential is very stable after relaxation.*

To validate the accuracy of the trained DP model, the calculated radial distribution function (RDF) from AIMD, DP and Tersoff, are plotted in Figure 9. The DP model-generated RDF is very close to that of AIMD, which indicates that the crystalline structure of graphene can be well presented by the DP model.

This image displays the plotted radial distribution function from three different methods, including DP, Tersoff and AIMD, which are denoted in black, red and blue solid lines respectively. — *Figure 9. Radial distribution function calculated by AIMD, DP and Tersoff potential, respectively. It can be observed that the RDF calculated by DP is very close to that of AIMD.*

Conclusion

This post demonstrates a simple case study of graphene under given conditions. The DeePMD-kit package streamlines the workflow from AIMD to classical MD with deep potential, providing the following key advantages:

Highly automatic and efficient workflow implemented in the TensorFlow framework.
APIs with popular DFT and MD packages such as VASP, QE, and LAMMPS.
Broad applications in organic molecules, metals, semiconductors, insulators, and more.
Highly efficient code for HPC with MPI and GPU support.
Modularization for easy adoption by other deep learning potential models.

Furthermore, the use of GPU-optimized containers from the NGC catalog simplifies and accelerates the overall workflow by eliminating the steps to install and configure software. To train a comprehensive model for other applications, download the DeepMD Kit Container from the NGC catalog.

Acknowledgements

We thank the helpful discussions with Dr. Chunyi Zhang from Temple University, Dan Han, Dr. Xinyu Wang from Shandong University, and Dr. Linfeng Zhang, Yuzhi Zhang, Jinzhe Zeng, Duo Zhang, and Fengbo Yuan from the DeepModeling community.

References

[1] Jia W, Wang H, Chen M, Lu D, Lin L, Car R, E W and Zhang L 2020 Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning IEEE Press 5 1-14

[2] Wang H, Zhang L, Han J and E W 2018 DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics Computer Physics Communications 228 178-84

[3] Plimpton S 1995 Fast Parallel Algorithms for Short-Range Molecular Dynamics Journal of Computational Physics 117 1-19

[4] Kresse G and Hafner J 1993 Ab initio molecular dynamics for liquid metals Physical Review B 47 558-61

[5] Giannozzi P, Baroni S, Bonini N, Calandra M, Car R, Cavazzoni C, Ceresoli D, Chiarotti G L, Cococcioni M, Dabo I, Dal Corso A, de Gironcoli S, Fabris S, Fratesi G, Gebauer R, Gerstmann U, Gougoussis C, Kokalj A, Lazzeri M, Martin-Samos L, Marzari N, Mauri F, Mazzarello R, Paolini S, Pasquarello A, Paulatto L, Sbraccia C, Scandolo S, Sclauzero G, Seitsonen A P, Smogunov A, Umari P and Wentzcovitch R M 2009 QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials Journal of Physics: Condensed Matter 21 395502

[6] Humphrey W, Dalke A and Schulten K 1996 VMD: Visual molecular dynamics Journal of Molecular Graphics 14 33-8

[7] Yuzhi Zhang, Haidi Wang, Weijie Chen, Jinzhe Zeng, Linfeng Zhang, Han Wang, and Weinan E, DP-GEN: A concurrent learning platform for the generation of reliable deep learning based potential energy models, Computer Physics Communications, 2020, 107206.

Misc

World’s Fastest Supercomputers Changing Fast

Post author By
Post date November 15, 2021
No Comments on World’s Fastest Supercomputers Changing Fast

Modern computing workloads — including scientific simulations, visualization, data analytics, and machine learning — are pushing supercomputing centers, cloud providers and enterprises to rethink their computing architecture. The processor or the network or the software optimizations alone can’t address the latest needs of researchers, engineers and data scientists. Instead, the data center is the new Read article >

The post World’s Fastest Supercomputers Changing Fast appeared first on The Official NVIDIA Blog.

Misc

Supporting Low-Latency Streaming Video for AI-Powered Medical Devices with Clara Holoscan

Post author By
Post date November 15, 2021
No Comments on Supporting Low-Latency Streaming Video for AI-Powered Medical Devices with Clara Holoscan

Closeup of a surgeon at work. NVIDIA Clara Holoscan and Clara AGX Developer Kit accelerates development of AI for endoscopy, laparoscopy, and other surgical procedures.

NVIDIA Clara Holoscan provides a scalable medical device computing platform for developers to create AI microservices and deliver insights in real time. The platform optimizes every stage of the data pipeline: from high-bandwidth data streaming and physics-based analysis to accelerated AI inference, and graphic visualizations.

The NVIDIA Clara AGX Developer Kit, which is now available, combines the efficient Arm-based embedded computing of the AGX Xavier SoC with the powerful NVIDIA RTX 6000 GPU and the 100 GbE connectivity of the NVIDIA ConnectX-6 network processor. This brings real-time AI acceleration to the next generation of intelligent, software-defined, embedded medical devices. Developers using the Clara AGX Developer Kit for surgical video applications—such as AI-enhanced endoscopy, laparoscopy, or other minimally invasive procedures—require the minimum possible end-to-end latency in their video processing path. Customers can use the Clara Holoscan SDK v0.1 on the Clara AGX Developer Kit today and on the next-generation developer kit in the second half of 2022.

The demands of surgical video necessitate consistent and reliable low-latency, between the image captured by the endoscope and the image projected on a monitor. This provides surgeons with real-time control of their tools and monitoring of the patient.

In a typical endoscopy system, the image is digitized at the camera sensor in the endoscope, serialized by an FPGA or ASIC and transmitted to a video processor where it is written to an input frame buffer, processed, written to an output frame buffer, and then transmitted serially to the monitor. Each of these steps adds latency to the video pipeline. Developers who wish to add advanced GPU-accelerated AI processing are then faced with additional transmission latency due to the need to write the data from the video capture card to system memory, then transfer it via the CPU and PCIe bus to the GPU.

GPU compute performance is a key component of the NVIDIA Clara Holoscan platform. To optimize GPU-based video processing applications, NVIDIA has partnered with AJA Video Systems to integrate their line of video capture cards with the Clara AGX Developer Kit. AJA provides a wide range of proven, professional video I/O devices. The partnership between NVIDIA and AJA has led to the addition of Clara AGX Developer Kit support in the AJA NTV2 SDK and device drivers as of the NTV2 SDK 16.1 release.

The AJA drivers and SDK now offer GPUDirect support for NVIDIA GPUs. This feature uses remote direct memory access (RDMA) to transfer video data directly from the capture card to GPU memory. This significantly reduces latency and system PCIe bandwidth for GPU video processing applications, as system memory to GPU copies are eliminated from the processing pipeline.

AJA devices now also incorporate RDMA support into the AJA GStreamer plug-in to enable zero-copy GPU buffer integration with the DeepStream SDK. DeepStream applications can now process video data along the entire pipeline, from the initial capture to final display, without leaving GPU memory.

NVIDIA Clara Holoscan SDK v0.1 builds on the features of the previous Clara AGX SDK and adds tools to allow for detailed measurement of video transfer latency between video I/O cards, the CPU, and the GPU. This will enable users to measure latency with various configurations, allowing them to focus on improving bottlenecks and optimizing their workflows for minimum end-to-end latency.

Data transfer latency was measured using the Clara AGX Developer Kit with an AJA capture card using the internal PCIe Gen3 x8 connection. The following tables demonstrate the latency reduction that can be achieved using GPUDirect.

Format	Width	Height	Bytes/pixel	Frames/sec
720p YUV	1280	720	2	60
1080p YUV	1920	1080	2	60
4K UHD YUV	3840	2160	2	60
720p RGBA	1280	720	4	60
1080p RGBA	1920	1080	4	60
4K UHD RGBA	3840	2160	4	60

Table 1. Results of video formats tested

The total time for video data transfer to and from the GPU, as well as time remaining for processing in the GPU, was then measured with and without GPUDirect enabled:

Format	Without GPUDirect		GPUDirect
	Transfer time, no processing (ms)	Time remaining for processing (ms)	Transfer time, no processing (ms)	Time remaining for processing (ms)
720p YUV	1.945	14.721	0.956	15.710
1080p YUV	3.865	12.801	1.723	14.943
4K UHD YUV	12.805	3.861	6.256	10.410
720p RGBA	3.451	13.215	1.548	15.118
1080p YUV	6.816	9.850	3.225	13.444
4K UHD RGBA	23.686	-7.020	12.406	4.260

Table 2. Latency (ms) with and without GPUDirect.

Note that GPUDirect cuts transfer time approximately in half by removing the need for writes to system memory. GPUDirect allows for the transfer and processing of 4K UHD RGBA inputs at 60 fps. This can now be transferred under the 16.666 ms frame time, whereas without GPUDirect this format could not be transferred at 60 fps. This allows for uncompressed high-resolution video to be natively alpha-blended with overlays from AI workflows. There is no need for conversion from YUV to RGBA formats, and no compromise in the 60 fps frame rate.

For instructions on how to set up and use an AJA device with the Clara AGX Developer Kit, including RDMA and DeepStream integration, go to Chapter 9 of the Clara HoloClara Holoscan SDK User Guide.

Misc

Close Knowledge Gaps and Elevate Training with Digital Twin NVIDIA Air

Post author By
Post date November 15, 2021
No Comments on Close Knowledge Gaps and Elevate Training with Digital Twin NVIDIA Air

4 people sitting around laptops in an office. Learn about the NVIDIA Air platform, a fully functional digital twin of a production environment.

Training resources are always a challenge for IT departments. There is a fine line between letting new team members do more without supervision and keeping the lights on by making sure no mistakes are made in the production environment. Leaning towards the latter method and limiting new team members’ access to production deployments may lead to knowledge gaps. How can new team members learn if they never get time on the network?

To close the knowledge gaps, IT teams can leverage a networking digital twin. A digital twin provides a fully functional replica of a production deployment, so each member of the team can learn in a safe and sandboxed environment. A digital twin eliminates the risk of making mistakes that could materially impact the business. Changes can be implemented and validated in a sandbox environment before pushing any changes to production, providing a new level of confidence.

Train with NVIDIA Air

NVIDIA has such an offering with the Air Infrastructure Simulation Platform (Air). The program supports organizations training their staff by leveraging a digital twin approach and providing a full production experience for all team members. With everyone able to contribute (either by training in Air or working directly in production), IT departments can use staff more effectively and boost operational efficiency.

With the use of NVIDIA Air, IT teams give team members their own replica of the production environment on which to learn. No more waiting for hardware resources to be racked and stacked, or balancing limited lab time across multiple users. Staff can use the platform for free, build an exact network digital twin, validate configurations, confirm security policies, and test CI/CD pipelines. In addition to CLI access, the platform provides full software functionality and access to the core system components such as Docker containers and APIs.

This allows less-skilled team members to be an integral part of the network operation, helping them to catch up with the team’s experience, enhance the sense of belonging within the team, and gain confidence to work on production when the time is right.

Get Started

Air is free to use and easy to work with. Build your own digital twin today, and help your team to learn the production environment, practice procedures, and test changes without introducing risk.

Misc

Gordon Bell Finalists Fight COVID, Advance Science With NVIDIA Technologies

Post author By
Post date November 15, 2021
No Comments on Gordon Bell Finalists Fight COVID, Advance Science With NVIDIA Technologies

Two simulations of a billion atoms, two fresh insights into how the SARS-CoV-2 virus works, and a new AI model to speed drug discovery. Those are results from finalists for Gordon Bell awards, considered a Nobel prize in high performance computing. They used AI, accelerated computing or both to advance science with NVIDIA’s technologies. A Read article >

The post Gordon Bell Finalists Fight COVID, Advance Science With NVIDIA Technologies appeared first on The Official NVIDIA Blog.

Misc

Universities Expand Research Horizons with NVIDIA Systems, Networks

Post author By
Post date November 15, 2021
No Comments on Universities Expand Research Horizons with NVIDIA Systems, Networks

Just as the Dallas/Fort Worth airport became a hub for travelers crisscrossing America, the north Texas region will be a gateway to AI if folks at Southern Methodist University have their way. SMU is installing an NVIDIA DGX SuperPOD, an accelerated supercomputer it expects will power projects in machine learning for its sprawling metro community Read article >

The post Universities Expand Research Horizons with NVIDIA Systems, Networks appeared first on The Official NVIDIA Blog.

Misc

Siemens Energy Taps NVIDIA to Develop Industrial Digital Twin of Power Plant in Omniverse

Post author By
Post date November 15, 2021
No Comments on Siemens Energy Taps NVIDIA to Develop Industrial Digital Twin of Power Plant in Omniverse

Siemens Energy, a leading supplier of power plant technology in the trillion-dollar worldwide energy market, is relying on the NVIDIA Omniverse platform to create digital twins to support predictive maintenance of power plants. In doing so, Siemens Energy joins a wave of companies across various industries that are using digital twins to enhance their operations. Read article >

The post Siemens Energy Taps NVIDIA to Develop Industrial Digital Twin of Power Plant in Omniverse appeared first on The Official NVIDIA Blog.