Categories
Misc

Deploying AI-Accelerated Medical Devices with NVIDIA Clara Holoscan

The ability to deploy real-time AI in clinics and research facilities is critical to enable the next frontiers in surgery, diagnostics, and drug discovery. From…

The ability to deploy real-time AI in clinics and research facilities is critical to enable the next frontiers in surgery, diagnostics, and drug discovery. From robotic surgery to studying new approaches in biology, doctors and scientists need medical devices to evolve into continuous sensing systems to research and treat disease. 

To realize the next generation of intelligent medical devices, a unique combination of AI, accelerated computing, and advanced visualization are needed. NVIDIA Clara Holoscan includes the Clara AGX Developer Kit and the Clara Holoscan SDK that combine to provide a powerful development environment for creating AI-enabled medical devices. To deploy these devices at the clinical edge, a production hardware based on NVIDIA IGX Orin, and a software platform designed for medical-grade certification, are highly desirable.

NVIDIA Clara Holoscan accelerates deployment of production-quality medical applications by providing a set of OpenEmbedded build recipes and reference configurations that can be leveraged to customize and build Clara Holoscan-compatible Linux4Tegra (L4T) embedded board support packages (BSP). With the release of Clara Holoscan SDK v0.3, developers can deploy medical AI even faster using customized OpenEmbedded distributions.

Creating customized Linux distributions with OpenEmbedded

OpenEmbedded is a build framework that allows developers to create fully customized Linux distributions for embedded systems. Developers can fully customize distributions using just the software components and configuration specific to the application. In contrast, commercial Linux distributions provide full operating systems from predefined software collections that often include graphical user interfaces, package management software, GNU tools and libraries, and system configuration tools. 

Customizability is particularly important for embedded deployments such that the memory, speed, safety, and security of the embedded device can be optimized while simultaneously simplifying the deployment process using a single preconfigured BSP. In the regulated medical device industry, this customizability is also important from a process overhead point of view, since it allows limiting analysis, testing, and documentation of Software of Unknown Provenance (SOUP), only to the minimal set of software components required for the essential performance of the medical device.

Comparison to HoloPack

HoloPack is the implementation of NVIDIA JetPack SDK specific to Clara Holoscan. It provides a full development environment for Clara Holoscan developer kits and includes Jetson Linux with bootloader, Linux Kernel, Ubuntu desktop environment, and a complete set of libraries for acceleration of GPU computing, multimedia, graphics, and computer vision. This is the Clara Holoscan development stack.

Using customized OpenEmbedded distributions allows you, as the developer, to include just the software components that are actually needed for your application’s deployment. The final runtime BSP can be easily optimized with respect to memory usage, speed, security, and power requirements. This is the Clara Holoscan deployment stack.

To illustrate this, the following tables compare various measurements of a HoloPack installation versus an OpenEmbedded-based Clara Holoscan build, both including the Clara Holoscan Embedded SDK available on GitHub.

Resource usage after initial boot (when idle):

Development Stack Deployment Stack Difference
Processes 408 198 210 (51.4% less processes)
Disk Used 22GB 7GB 15GB (68.1% less disk usage)
Memory Used 1,621MB 2744MB 877MB (54.1% less memory usage)

RTX6000 measurements when running the tracking_replayer Clara Holoscan SDK application:

Development Stack Deployment Stack Difference
Power 71W 67W 4W (5.6% less power)
Temperature 50C 48C 2C (4% cooler)
GPU Usage 15% 11% 4% (26.7% less GPU usage)

Job runtime statistics (in milliseconds) as reported by the tracking_replayer Clara Holoscan SDK application:

Development Stack Deployment Stack Difference
Visualizer 4.51 3.18 29.4%
Visualizer Format Converter 1.13 0.85 24.7%
Inference 10.69 5.73 46.3%
Inference Format Converter 1.00 0.93 7%
Replayer 31.11 30.09 3.2%
Total 48.44 40.78 15.8%

The customized OpenEmbedded/Yocto distribution only includes the minimal set of packages which are actually needed for running the Clara Holoscan SDK application. It therefore helps save disk space, memory, and CPU/GPU cycles that result in higher overall performance running Clara Holoscan sample applications.

Although the flexibility of having a desktop experience with HoloPack is desired during the early stages of development (easy installation of new apt packages, for example), this study shows some of the clear benefits of using the customized deployment stack using OpenEmbedded/Yocto for later stages of productization for medical devices.

Get started with NVIDIA Clara Holoscan

Clara Holoscan OpenEmbedded/Yocto recipes is open source and kept up to date alongside the releases of the NVIDIA Clara Holoscan SDK.

The Clara Holoscan OpenEmbedded/Yocto recipes, and the BSP build in general, depend on other open-source OpenEmbedded components that include (but are not limited to):

If you are already familiar with OpenEmbedded or Yocto, check out the meta-tegra-clara-holoscan-mgx repo on GitHub. The README within that repo provides a guide and full list of requirements needed to build and flash a Clara Holoscan BSP.

NVIDIA also provides the Clara Holoscan OpenEmbedded Builder on the NVIDIA GPU Cloud (NGC) website to simplify the process of getting started with these recipes. It includes all the tools and dependencies that are needed either within the container or as part of a setup script that initializes a local build tree such that building and flashing a Clara Holoscan BSP can be done in just a few simple commands.

To build a Clara Holoscan BSP for IGX Orin Developer Kit using the default configuration, which includes the Clara Holoscan SDK and sample applications, first ensure that your Docker runtime is logged into NGC. Then run the following commands in a new directory:

$ export IMAGE=nvcr.io/nvidia/clara-holoscan/holoscan-mgx-oe-builder:v0.3.0
$ docker run --rm -v $(pwd):/workspace ${IMAGE} setup.sh ${IMAGE} $(id -u) $(id -g)
$ ./bitbake.sh core-image-x11

Note that this build will require at least 200 GB of free disk space, and a first full build will take three or more hours. Once the build is complete, the IGX Orin Developer Kit can be put into recovery mode and flashed with the following command:

$ ./flash.sh core-image-x11

One major feature of the Clara Holoscan Deployment stack is the support of both iGPU and dGPU for Developer Kits. When using the iGPU configuration, the majority of the runtime components come from the standard Tegra packages used by the meta-tegra layer and allows developers to use the onboard HDMI or DisplayPort connection on the developer kit. You can check out more details by visiting meta-tegra-clara-holoscan-mgx on GitHub.

Develop custom medical AI utilizing ultra high speed frame rates

With customized OpenEmbedded distributions on Clara Holoscan SDK v0.3, it is easier than ever to deploy production-quality AI for unique medical applications at the clinical edge. The SDK provides a lightning-fast frame rate of 240 Hz for 4K video, enabling developers to combine data from more sensors for building accelerated AI pipelines.

To learn how to get started with NVIDIA Clara Holoscan, follow the instructions on the Clara Holoscan SDK page.

Categories
Misc

Scaling VASP with NVIDIA Magnum IO

You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are…

You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are named not for leaders or civilizations but for the materials that defined them: Stone Age, Bronze Age, and so on. The current digital or information age could be renamed the Silicon or Semiconductor Age and retain the same meaning.

Though silicon and other semiconductor materials may be the most significant materials driving change today, there are several other materials in research that could equally drive the next generation of changes, including any of the following:

  • High-temperature superconductors
  • Photovoltaics
  • Graphene batteries
  • Supercapacitors

Semiconductors are at the heart of building chips that enable the extensive and computationally complex search for such novel materials.

In 2011, the United States’ Materials Genome Initiative pushed for the identification of new materials using simulation. However, at that time and to an extent even today, calculating material properties from first principles can be painfully slow even on modern supercomputers.

The Vienna Ab initio Simulation Package (VASP) is one of the most popular software tools for such predictions, and it has been written to leverage acceleration technologies and to minimize the time to insight.

New material review: Hafnia

This post examines the computation of the properties of a material called hafnia or hafnium oxide (HfO2).

On its own, hafnia is an electric insulator. It is heavily used in semiconductor manufacturing, as it can serve as a high-κ dielectric film when building dynamic random-access memory (DRAM) storage. It can also act as a gate insulator in metal–oxide–semiconductor field-effect transistors (MOSFETs). Hafnia is of high interest for nonvolatile resistive RAM, which could make booting computers a thing of the past.

While an ideal, pure HfO2 crystal can be calculated affordably using only 12 atoms, it is nothing but a theoretical model. Such crystals have impurities in practice.

At times, a dopant must be added to yield the desired material properties beyond insulation. This doping can be done at the purity level, which means that out of 100 eligible atoms, one atom is replaced by a different element. There are minimally 12 atoms out of which only four are Hf. It soon becomes apparent that such calculations easily call for hundreds of atoms.

This post demonstrates how such calculations can be parallelized efficiently over hundreds and even thousands of GPUs. Hafnia serves as an example, but the principles demonstrated here can of course be applied to similarly sized calculations just as well.

Term definitions

  • Speedup: A nondimensional measure of performance relative to a reference. For this post, the reference is single-node performance using 8x A100 80 GB SXM4 GPUs without NCCL enabled. Speedup is calculated by dividing the reference runtime by the elapsed runtime.
  • Linear scaling: The speedup curve for an application that is perfectly parallel. In Amdahl’s law terms, it is for an application that is 100% parallelized and the interconnect is infinitely fast. In such a situation, 2x the compute resources results in half the run time and 10x the compute resources results in one-tenth the run time. When plotting speedup compared to number of compute resources, the performance curve is a line sloping up and to the right at 45 degrees. The effect of a parallelized run outperforms this proportional relation. That is, the slope would be steeper than 45 degrees and it is called super-linear scaling.
  • Parallel efficiency: A nondimensional measure in percent of how close a particular application execution is to the ideal linear scaling. Parallel efficiency is calculated by dividing the achieved speedup by the linear scaling speedup for that number of compute resources. To avoid wasting compute time, most data centers have policies on minimum parallel efficiency targets (50-70%).

VASP use cases and differentiation

VASP is one of the most widely used applications for electronic-structure calculations and first-principles molecular dynamics. It offers state-of-the-art algorithms and methods to predict material properties like the ones discussed earlier.

The GPU acceleration is implemented using OpenACC. GPU communications can be carried out using the Magnum IO MPI libraries in NVIDIA HPC-X or NVIDIA Collective Communications Library (NCCL).

Use cases and differentiation of hybrid DFT

This section focuses on using a quantum-chemical method known as density functional theory (DFT) that reaches higher-accuracy predictions by mixing in exact-exchange calculations to approximations within the DFT and is then called hybrid DFT. This added accuracy helps to determine band gaps in closer accordance with experimental results.

Band gaps are the property that classifies materials as insulators, semiconductors, or conductors. For materials based on hafnia, this extra accuracy is crucial, but comes at an increased computational complexity.

Combining this with the need for using many atoms demonstrates the demand for scaling to many nodes on GPU-accelerated supercomputers. Fortunately, even higher accuracy methods are available in VASP. For more information about the additional features, see VASP6.

At a higher level, VASP is a quantum-chemistry application that is different from other, and possibly even more familiar, high-performance computing (HPC) computational-chemistry applications like NAMD, GROMACS, LAMMPS, and AMBER. These codes focus on molecular dynamics (MD) using simplifications to the interactions between atoms such as treating them as point charges. This makes simulations of the movement of those atoms, say because of temperature, computationally inexpensive.

VASP, on the other hand, treats the interaction between atoms on the quantum level, in that it calculates how the electrons interact with each other and can form chemical bonds. It can also derive forces and move atoms for a quantum or ab-initio-MD (AIMD) simulation. This can indeed be interesting to the scientific problem discussed in this post.

However, such a simulation would consist of repeating the hybrid-DFT calculation step many times. While subsequent steps might converge faster, the computational profile of each individual step would not change. This is why we only show a single, ionic step here.

Running single-node or multi-node

Many VASP calculations employ chemical systems that are small enough not to require execution on HPC facilities. Some users might be uncomfortable with scaling VASP on multiple nodes and suffer through the time-to-solutions, maybe even to the extent that a power outage or some other failure becomes probable. Others may limit their simulation sizes so that runtimes are not as onerous as they would be if better-suited system sizes were investigated.

There are multiple reasons that would drive you toward running simulations multi-node:

  • Simulations that would take an unacceptable amount of time to run on a single node, even though the latter might be more efficient.
  • Large calculations that require large amounts of memory and cannot fit on a single node require distributed parallelism. While some computational quantities must be replicated across the nodes, most of them can be decomposed. Therefore, the amount of memory required per node is cut roughly by the number of nodes participating in the parallel task.                                                                                
Diagram compares single node runs of VASP to multi-node, using the increasing memory required and the single-node runtime. Multi-node is required when the single node memory limit is reached. As runtimes grow beyond 1+ hours, multi-node is more desirable.
Figure 1. When to choose single node or multi-node

For more information about multi-node parallelism and compute efficiency, see the recent HPC for the Age of AI and Cloud Computing ebook.

NVIDIA published a study of multi-node parallelism using the dataset Si256_VJT_HSE06. In this study, NVIDIA asked the question, “For this dataset, and an HPC environment of V100 systems and InfiniBand networking, how far can we reasonably scale?”

Magnum IO communication tools for parallelism

VASP uses the NVIDIA Magnum IO libraries and technologies that optimize multi-GPU and multi-node programming to deliver scalable performance. These are part of the NVIDIA HPC SDK.

In this post, we look at two communication libraries:

  • Message Passing Interface (MPI): The standard for programming distributed-memory scalable systems.
  • NVIDIA Collective Communications Library (NCCL): Implements highly optimized multi-GPU and multi-node collective communication primitives using MPI-compatible all-gather, all-reduce, broadcast, reduce, reduce-scatter, and point-to-point routines to take advantage of all available GPUs within and across your HPC server nodes.  

VASP users can choose at runtime what communication library should be used. As performance most often improves significantly when MPI is replaced with NCCL, this is the default in VASP.

There are a couple of strong reasons for the observed differences when using NCCL over MPI.

With NCCL, communications are GPU-initiated and stream-aware. This eliminates the need for GPU-to-CPU synchronization, which is otherwise needed before each CPU-initiated MPI communication to ensure that all GPU operations have completed before the MPI library touches the buffers. NCCL communications can be enqueued on a CUDA stream just like a kernel and can facilitate asynchronous operation. The CPU can enqueue further operations to keep the GPU busy.

In the MPI case, the GPU is idle at least for the time that it takes the CPU to enqueue and launch the next GPU operation after the MPI communication is done. Minimizing GPU idle times contributes to higher parallel efficiencies.

With two separate CUDA streams, you can easily use one stream to do the GPU computations and the other one to communicate. Given that these streams are independent, the communication can take place in the background and potentially be hidden entirely behind the computation. Achieving the latter is a big step forward to high parallel efficiencies. This technique can be used in any program that enables a double-buffering approach.

Nonblocking MPI communications can expose similar benefits. However, you still must handle the synchronizations between the GPU and CPU manually with the described performance downsides.

There is another layer of complexity added as the nonblocking MPI communications must be synchronized on the CPU side as well. This requires much more elaborate code from the outset, compared to using NCCL. However, with MPI communications being CPU-initiated, there often is no hardware resource that automatically makes the communications truly asynchronous.

You can spawn CPU threads to ensure communications progress if your application has CPU cores to spend, but that again increases code complexity. Otherwise, the communication might only take place when the process enters MPI_Wait, which offers no advantage over using blocking calls.

Another difference to be aware of is that for reductions, the data is summed up on the CPU. In the case where your single-threaded CPU memory bandwidth is lower than the network bandwidth, this can be an unexpected bottleneck as well.

NCCL, on the other hand, uses the GPU for summations and is aware of the topology. Intranode, it can use available NVLink connections and optimizes internode communication using Mellanox Ethernet, InfiniBand, or similar fabrics.

Computational modeling test case with HfO2

A hafnia crystal is built from two elements: hafnium (Hf) and oxygen (O). In an ideal system free from dopants or vacancies, for each Hf atom, there will be two O atoms. The minimum number of atoms to describe the structure of the infinitely extended crystal required is four Hf (yellowish) and eight O (red) atoms. Figure 2 shows the structure.

A 3D diagram of a small portion of a hafnia (HfO2) crystal showing four hafnium atoms and eight oxygen atoms connected in a lattice.
Figure 2. Visualization of the unit cell for a hafnia (HfO2) crystal

The box wireframe designates the so-called unit cell. It is repeated in all three dimensions of space to yield the infinitely extended crystal. The picture alludes to that by duplicating the atoms O5, O6, O7, and O8 outside of the unit cell to show their respective bonds to the Hf atoms. This cell has a dimension of 51.4 by 51.9 by 53.2 nm. This is not a perfect cuboid because one of its angles is 99.7° instead of 90°.

The minimal model only treats the 12 atoms enclosed in the box in Figure 2 explicitly. However, you can also prolong the box in one or more directions of space by an integer multiple of the according edge and copy the structure of atoms into the newly created space. Such a result is called a supercell and can help to treat effects that are inaccessible within the minimal model, like a 1% vacancy of oxygen.

Of course, treating a larger cell with more atoms is computationally more demanding. When you add one more cell, so that there are two total cells, in the direction of a while leaving b and c as is, this is called a 2x1x1 supercell with 24 atoms.

For the purposes of this study, we only considered supercells that are costly enough to justify the usage of at least a handful of supercomputer nodes:

  • 2x2x2: 96 atoms, 512 orbitals
  • 3x3x2: 216 atoms, 1,280 orbitals
  • 3x3x3: 324 atoms, 1,792 orbitals
  • 4x4x3: 576 atoms, 3,072 orbitals
  • 4x4x4: 768 atoms, 3,840 orbitals

Keep in mind that computational effort is not directly proportional to the number of atoms or the volume of the unit cell. A rough estimate used in this case study is that it scales cubically with either.

A set of five 3D diagrams of the crystal lattice for hafnia (HfO2) crystals for 96, 216, 324, 576, 768 total atom counts representing the simulations being studied here.
Figure 3. Visualizations of hafnium oxide supercells for atom counts: 96, 216, 324, 576, 768

The hafnia system used here is only one example, of course. The lessons are transferable to other systems that employ similar-sized cells and hybrid DFT as well because the underlying algorithms and communication patterns do not change.

If you want to do some testing yourself with HfO2, you can download the input files used for this study. For copyright reasons, we may not redistribute the POTCAR file. This file is the same across all supercells. As a VASP licensee, you can easily create it yourself from the supplied files by the following Linux command:

# cat PAW_PBE_54/Hf_sv/POTCAR PAW_PBE_54/O/POTCAR > POTCAR

For these scaling experiments, we enforced a constant number of employed crystal orbitals, or bands. This slightly increases the workload beyond the minimum required but has no effect on computational accuracy.

If this wasn’t done, VASP would automatically select a number that is integer-divisible by the number of GPUs and this might increase the workload for certain node counts. We chose the number of orbitals that is integer-divisible by all GPU counts employed. Also, for better computational comparability, the number of k-points is kept fixed at 8, even though larger supercells might not require this in practice.

Supercell modeling test method with VASP

All benchmarks presented in the following are using the latest VASP release 6.3.2, which was compiled using the NVIDIA HPC SDK 22.5 and CUDA 11.7.

For full reference, makefile.include is available for download. They were run on the NVIDIA Selene supercomputer that consists of 560 DGX A100 nodes, each of which provides eight NVIDIA A100-SXM4-80GB GPUs, eight NVIDIA ConnectX-6 HDR InfiniBand network interface cards (NICs), and two AMD EPYC 7742 CPUs.

To ensure the best performance, the processes and threads were pinned to the NUMA nodes on the CPU that offer ideal connectivity to the respective GPUs and NICs that they will use. The reverse NUMA node numbering on AMD EPYC, yields the following process binding for the best hardware locality.

Node local rank CPU NUMA node GPU ID NIC ID
0 3 0 mlx5_0
1 2 1 mlx5_1
2 1 2 mlx5_2
3 0 3 mlx5_3
4 7 4 mlx5_6
5 6 5 mlx5_7
6 5 6 mlx5_8
7 4 7 mlx5_9
Table 1. Compute node GPU and NIC ID mapping

Included in the set of downloadable files is a script called selenerun-ucx.sh. This script is wrapping the call to VASP by performing the following in the workload manager (for example, Slurm) job script:

# export EXE=/your/path/to/vasp_std
# srun ./selenerun-ucx.sh

The selenerun-ucx.sh file must be customized to match your environment, depending on the resource configuration available. For example, the number of GPUs or number of NICs per node may be different from Selene and the script must reflect those differences.

To keep computation time for benchmarking as low as possible, we have restricted all calculations to only one electronic step by setting NELM=1 in the INCAR files. We can do this because we are not concerned with the scientific results like the total energy and running one electronic step suffices to project the performance of a full run. Such a run took 19 iterations to converge with the 3x3x2 supercell.

Of course, each different cell setup could require a different number of iterations until convergence. To benchmark scaling behavior, you want to compare fixed numbers of iterations anyway to keep the workload comparable.

However, evaluating the performance of runs with only one electronic iteration would mislead you because the profile is lopsided. Initialization time would take a much larger share relative to the net iterations and so would the post-convergence parts like the force calculation.

Luckily, the electronic iterations all require the same effort and time. You can project the total runtime of a representative run using the following equation:

t_{total} = t_{init} + 19 cdot t_{iter} +t_{post}

You can extract the time for one iteration t_{init} from the VASP internal LOOP timer, while the time spent in post-iteration steps t_{post} is given by the difference between the LOOP+ and LOOP timers.

The initialization time t_{init}, on the other hand, is the difference between the total time reported in VASP as Elapsed time and LOOP+. There is a slight error in such a projection as the first iterations take a little longer due to instances such as one-time allocations. However, the error was checked to be less than 2%.

Parallel efficiency results for a hybrid DFT iteration in VASP

We first reviewed the smallest dataset with 96 atoms: the 2x2x2 supercell. This dataset hardly requires a supercomputer these days. Its full run, with 19 iterations, finishes in around 40 mins on one DGX A100.

Still, with MPI, it can scale to two nodes with 93% parallel efficiency before dropping to 83% on four and even 63% on eight nodes.

On the other hand, NCCL enables nearly ideal scaling of 97% on two nodes, 90% on four nodes, and even on eight nodes it still reaches 71%. However, the biggest advantage by NCCL is clearly demonstrated at 16 nodes. You can still see a >10x relative speedup compared to 6x with MPI only.

The negative scaling beyond 64 nodes needs explanation. To run 128 nodes with 1024 GPUs, you must use 1024 orbitals as well. The other calculations used only 512, so here the workload increases. We didn’t want to include such an excessive orbital count for the lower node runs, though.

Line chart compares relative speedup to the number of compute nodes showing scalability curves for the 96 atom case. Curve #1 is with NCCL OFF with a maximum speedup of 10x at 64 nodes relative to the one node runtime. Curve #2 is with NCCL ON with a maximum speedup of 16x at 64 nodes.
Figure 4. Scaling and performance for 96-atom case. NCCL-enabled results have been scaled relative to the single node performance with NCCL disabled.

The next example is already a computationally challenging problem. The full calculation of the 3x3x2 supercell featuring 216 atoms takes more than 7.5 hours to complete on 8xA100 on a single node.

With more demand on computation, there is more time to conclude the communications asynchronously in the background using NCCL. VASP remains above 91% until 16 nodes and only closely falls short of 50% on 128 nodes.

With MPI, VASP does not hide the communications effectively and does not reach 90% even on eight nodes and drops to 41% on 64 nodes already.

Figure 5 shows that the trends regarding the scaling behavior remain the same for the next bigger 3x3x3 supercell with 324 atoms, which would take a full day until the solution on a single node. However, the spreads between using NCCL and MPI increase significantly. On 128 nodes with NCCL, you gain a 2x better relative speedup.

ine chart of relative speedup vs the number of compute nodes showing two scalability curves for the 216 and 326 atom cases. Curve #1 is for 216 atoms NCCL OFF with a maximum speedup of 30x at 128 nodes relative to the 1 node runtime. Curve #2 is with NCCL ON with a maximum speedup of 30x at 128 nodes. Curve #3 is for 324 atoms NCCL OFF with a maximum speedup of 42x at 128 nodes relative to the single-node runtime. Curve #4 is with NCCL ON with a maximum speedup of 84x at 128 nodes.
Figure 5. Scaling and performance for the 216-atom and 324-atom cases. NCCL-enabled results have been scaled relative to the single-node performance with NCCL disabled.

Going to an even larger, 4x4x3 supercell containing 576 atoms, you would have to wait more than 5 days for the full calculation using one DGX A100.

However, with such a demanding dataset, a new effect must be discussed: Memory capacity and parallelization options. VASP offers to distribute the workload over k-points while replicating the memory in such setups. While this is much more effective for standard-DFT runs, it also helps with performance on hybrid-DFT calculations and there is no need to leave available memory unused.

For the smaller datasets, even parallelizing over all k-points fits easily into 8xA100 GPUs with 80 GB of memory each. With the 576-atom dataset, on a single node, this is no longer the case though and we must reduce the k-point parallelism. From two nodes onwards, we could fully employ it again.

While it is indistinguishable in Figure 6, there is minor super-linear scaling in the MPI case (102% parallel efficiency) on two nodes. This is because of the necessarily reduced parallelism on one node that is lifted on two or more nodes. However, that is what you would do in practice as well.

We face a similar situation for the 4x4x4 supercell with 768 atoms on one and two nodes, but the super-linear scaling effect is even less pronounced there.

We scaled the 4x4x3 and 4x4x4 supercell to 256 nodes. This equates to 2,048 A100 GPUs. With NCCL, they achieved 67% or even 75% of parallel efficiency. This enables you to yield your results in less than 1.5 hours, in what would have previously taken almost 12 days on one node! The usage of NCCL enables an almost 3x higher relative speedup for such large calculations over MPI.

Line chart compares relative speedup to the number of compute nodes showing scalability curves for the 576 and 768 atom cases. Curve #1 is for 576 atoms NCCL OFF with a maximum speedup of 64x at 256 nodes relative to the single-node runtime. Curve #2 is with NCCL ON with a maximum speedup of 175x at 256 nodes. Curve #3 is for 768 atoms NCCL OFF with a maximum speedup of 78x at 256 nodes relative to the one-node runtime. Curve #4 is with NCCL ON with a maximum speedup of 198x at 256 nodes.
Figure 6. Scaling and performance for the 576- and 768-atom cases. NCCL-enabled results have been scaled relative to the single node performance with NCCL disabled.

Recommendations for using NCCL for a VASP simulation

VASP 6.3.2 calculating HfO2 supercells ranging from 96 to 768 atoms achieves significant performance by using NVIDIA NCCL across many nodes when an NVIDIA GPU-accelerated HPC environment enhanced by NVIDIA InfiniBand networking is available.

A 2D diagram of # atoms on the vertical axis and number of nodes on the horizontal axis showing where NCCL makes a positive difference in scalability and where it does not. NCCL starts to make a difference for all cases larger than 4 nodes for 96 atoms to 16 nodes and 768 atoms.
Figure 7. A general guideline for when NCCL is beneficial for a VASP simulation similar to HfO2 running on A100 GPUs with multiple HDR InfiniBand interconnectivity

Based on this testing, we recommend that users with access to capable HPC environments consider the following:

  • Run all but the smallest calculations using GPU acceleration.
  • Consider running larger systems of atoms using both GPUs and multiple nodes to minimize time to insight.
  • Launch all multi-node calculations using NCCL as it only increases efficiency when running large models.

The slight added overhead to initialize NCCL will be worth the tradeoff.

Summary

In conclusion, you’ve seen that scalability for hybrid DFT in VASP depends on the size of the dataset. This is somewhat expected given that the smaller the dataset is, the earlier each individual GPU will run out of computational load.

NCCL also helps to hide the required communications. Figure 7 shows the levels of parallel efficiency that you can expect for certain dataset sizes with varying node counts. For most computationally intensive datasets, VASP reaches >80% of parallel efficiency on 32 nodes. For most demanding datasets as some of our customers request them, scale-out runs at 256 nodes are possible with good efficiency.

Line chart compares parallel efficiency to the number of compute nodes showing curves for all the executed cases NCCL ON and NCCL OFF. For small atom counts like 96, efficiency drops quickly to less than 10% at 128 nodes. For large atom counts like 576 and 768 with NCCL enabled, efficiency stays well above 60% out to 256 nodes.
Figure 8. Parallel efficiency as a function of node count (log scale)

VASP user experience

From our experience with VASP users, running VASP on GPU-accelerated infrastructure is a positive and productive experience that enables you to consider larger and more sophisticated models for your research.

In unaccelerated scenarios, you may be running models smaller than you’d like because you expect runtimes to grow to intolerable levels. Using high-performance, low-latency I/O infrastructure with GPUs, and InfiniBand with Magnum IO acceleration technologies like NCCL, makes efficient, multi-node parallel computing practical and puts larger models within reach for investigators.

HPC system administrator benefits

HPC centers, especially commercial ones, often have policies that prohibit users from running jobs at low parallel efficiency. This prevents users on short deadlines or who need high turn-over rates from using more computational resources at the expense of other users’ job wait time. More often than not, a simple rule of thumb is that 50% parallel efficiency dictates the maximum number of nodes that a user might request and hence increases the time to solution.

We have shown here that, by using NCCL as part of NVIDIA Magnum IO, users of an accelerated HPC system can stay well within efficiency limits and scale their jobs significantly farther than possible when using MPI alone. This means that while keeping overall throughput at its highest across the HPC system, you can minimize runtime and maximize the number of simulations to get new and exciting science done.

HPC application developer advantages

As an application developer, you can benefit from the advantages observed here with VASP just as well. To get started:

Categories
Misc

Building the Future of Real-Time Graphics with NVIDIA and Unreal Engine 5.1

The Unreal Engine 5.1 release includes cutting-edge advancements that make it easier to incorporate realistic lighting and accelerate graphics workflows. Using…

The Unreal Engine 5.1 release includes cutting-edge advancements that make it easier to incorporate realistic lighting and accelerate graphics workflows. Using the NVIDIA RTX branch of Unreal Engine (NvRTX), you can significantly increase hardware ray-traced and path-traced operations by up to 40%. 

Unreal Engine 5.1 features Lumen, a real-time global illumination solution, which enables developers to create more dynamic scenes where indirect lighting changes on the fly. Realistic lighting is an essential component when creating scenes in games, and Lumen can provide high-quality, scalable global illumination and hardware ray-traced reflections.

Nanite, the Unreal Engine (UE) virtualized geometry system, enables film-quality art consisting of billions of polygons to be directly imported into UE, all while maintaining the highest image quality in real time. 

In addition to Lumen and Nanite, Unreal Engine 5.1 advances important features that speed up development cycles like Virtual Shadow Maps, Programmable Rasterizer, Virtual Assets, and automated pipeline state object caching for DX12.

NVIDIA is accelerating this new feature set through a combination of NVIDIA RTX 4090, Shader Execution Reordering, and hardware-accelerated ray tracing cores. Thousands of developers have already experienced the benefits of Unreal Engine with NVIDIA technologies. Over the past few years, NVIDIA has delivered GPUs, libraries, and APIs to support the latest features of Unreal Engine.

Next-generation RTX lighting

Achieving the most accurate lighting in computer graphics requires replicating how light physics simulate in the real world. Path-traced lighting has been used in offline rendering in films to achieve physically accurate results. However, that is an expensive and time-consuming process. 

Continued advancements in hardware ray-traced shadows in Unreal Engine 5.1 improve shadow quality using an algorithm that more closely matches offline path tracing. This allows you to create more realistic scenes in real time.

RTX Direct Illumination (RTXDI), available through NvRTX, allows you to take dynamic light counts from single digits into the hundreds. RTXDI uses the same algorithm for direct lighting as the offline path tracer, taking a step closer to unlimited lighting and photorealism. 

The next evolution of this technology is in gaming and real-time rendering, which considerably accelerates the time in which frames are processed and rendered.

A wood and bamboo entrance lit in real time compared to the same scene created using the offline path tracer in Unreal Engine 5.1.
Figure 1. A scene lit in real time (left) compared to the same scene created using the offline path tracer in Unreal Engine 5.1 (right)

Shader Execution Reordering

A new technology called Shader Execution Reordering (SER) can help solve the challenge of accurately simulating light. SER provides performance gains in ray tracing operations and optimization for specific use cases. NVIDIA is accelerating real-time ray tracing and offline path tracing by leveraging SER through NvRTX. 

Shader Execution Reordering diagram: ray bounce off an object in different directions, hitting different materials (left): reordering threads, grouping similar work together (center): SMs execute shaders with increased coherence (right).
Figure 2. Shader Execution Reordering in NvRTX enables significant performance gains in ray tracing operations 

NvRTX features SER integration to support optimization of many of its ray tracing paths. Developers will see additional frame rate optimization on 40 series cards with up to 40% speed increases in ray tracing operations, and zero impact on quality and content authoring. This improves the efficiency of complex ray tracing calculations, and provides greater gains in scenes that take advantage of ray tracing benefits. 

Offline path tracing, which is arguably the most complex tracing operation, will see the largest benefit from SER in Unreal Engine 5.1, with speed improvements of 40% or more. Hardware ray-traced reflections and translucency, which have complex interactions with materials and lighting, will also see benefits.

For more information about SER in Unreal Engine 5.1, see the Shader Execution Reordering Whitepaper and Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering

Summary

Epic Games and NVIDIA are leading the way into the next generation of rendering, moving the industry toward the future of graphics. With improvement leaps made in each version release of Unreal Engine, developers can expect even more groundbreaking advancements in this space. 

Learn more about NVIDIA technologies and Unreal Engine. 

Categories
Misc

Attention, Sports Fans! WSC Sports’ Amos Berkovich on How AI Keeps the Highlights Coming

It doesn’t matter if you love hockey, basketball or soccer. Thanks to the internet, there’s never been a better time to be a sports fan.  But editing together so many social media clips, long-form YouTube highlights and other videos from global sporting events is no easy feat. So how are all of these craveable video Read article >

The post Attention, Sports Fans! WSC Sports’ Amos Berkovich on How AI Keeps the Highlights Coming appeared first on NVIDIA Blog.

Categories
Misc

Explainer: What Is the Metaverse

The metaverse is the “next evolution of the internet, the 3D internet,” according to NVIDIA CEO Jensen Huang.

The metaverse is the “next evolution of the internet, the 3D internet,” according to NVIDIA CEO Jensen Huang.

Categories
Misc

Going Green: New Generation of NVIDIA-Powered Systems Show Way Forward

With the end of Moore’s law, traditional approaches to meet the insatiable demand for increased computing performance will require disproportionate increases in costs and power. At the same time, the need to slow the effects of climate change will require more efficient data centers, which already consume more than 200 terawatt-hours of energy each year, Read article >

The post Going Green: New Generation of NVIDIA-Powered Systems Show Way Forward appeared first on NVIDIA Blog.

Categories
Misc

Nuance Communications and NVIDIA Bring Medical-Imaging AI Models Directly Into Clinical Settings

HLTH—Nuance Communications, Inc., and NVIDIA today announced a partnership that for the first time puts AI-based diagnostic tools directly into the hands of radiologists and other clinicians at scale, enabling the delivery of improved patient care at lower costs.

Categories
Misc

NVIDIA Omniverse Opens Portals for Scientists to Explore Our Universe

SC22 — NVIDIA today announced that NVIDIA Omniverse™ — an open computing platform for building and operating metaverse applications — now connects to leading scientific computing visualization software and supports new batch-rendering workloads on systems powered by NVIDIA A100 and H100 Tensor Core GPUs.

Categories
Misc

Supercomputing Superpowers: NVIDIA Brings Digital Twin Simulation to HPC Data Center Operators

The technologies powering the world’s 7 million data centers are changing rapidly. The latest have allowed IT organizations to reduce costs even while dealing with exponential data growth. Simulation and digital twins can help data center designers, builders and operators create highly efficient and performant facilities. But building a digital twin that can accurately represent Read article >

The post Supercomputing Superpowers: NVIDIA Brings Digital Twin Simulation to HPC Data Center Operators appeared first on NVIDIA Blog.

Categories
Misc

Going the Distance: NVIDIA Platform Solves HPC Problems at the Edge

Collaboration among researchers, like the scientific community itself, spans the globe. Universities and enterprises sharing work over long distances require a common language and secure pipeline to get every device — from microscopes and sensors to servers and campus networks — to see and understand the data each is transmitting. The increasing amount of data Read article >

The post Going the Distance: NVIDIA Platform Solves HPC Problems at the Edge appeared first on NVIDIA Blog.