DataBloom - Part 238

Mixture-of-Experts with Expert Choice Routing

Post author By
Post date November 16, 2022
No Comments on Mixture-of-Experts with Expert Choice Routing

Posted by Yanqi Zhou, Research Scientist, Google Research Brain Team

The capacity of a neural network to absorb information is limited by the number of its parameters, and as a consequence, finding more effective ways to increase model parameters has become a trend in deep learning research. Mixture-of-experts (MoE), a type of conditional computation where parts of the network are activated on a per-example basis, has been proposed as a way of dramatically increasing model capacity without a proportional increase in computation. In sparsely-activated variants of MoE models (e.g., Switch Transformer, GLaM, V-MoE), a subset of experts is selected on a per-token or per-example basis, thus creating sparsity in the network. Such models have demonstrated better scaling in multiple domains and better retention capability in a continual learning setting (e.g., Expert Gate). However, a poor expert routing strategy can cause certain experts to be under-trained, leading to an expert being under or over-specialized.

In “Mixture-of-Experts with Expert Choice Routing”, presented at NeurIPS 2022, we introduce a novel MoE routing algorithm called Expert Choice (EC). We discuss how this novel approach can achieve optimal load balancing in an MoE system while allowing heterogeneity in token-to-expert mapping. Compared to token-based routing and other routing methods in traditional MoE networks, EC demonstrates very strong training efficiency and downstream task scores. Our method resonates with one of the vision for Pathways, which is to enable heterogeneous mixture-of-experts via Pathways MPMD (multi program, multi data) support.

Overview of MoE Routing

MoE operates by adopting a number of experts, each as a sub-network, and activating only one or a few experts for each input token. A gating network must be chosen and optimized in order to route each token to the most suited expert(s). Depending on how tokens are mapped to experts, MoE can be sparse or dense. Sparse MoE only selects a subset of experts when routing each token, reducing computational cost as compared to a dense MoE. For example, recent work has implemented sparse routing via k-means clustering, linear assignment to maximize token-expert affinities, or hashing. Google also recently announced GLaM and V-MoE, both of which advance the state of the art in natural language processing and computer vision via sparsely gated MoE with top-k token routing, demonstrating better performance scaling with sparsely activated MoE layers. Many of these prior works used a token choice routing strategy in which the routing algorithm picks the best one or two experts for each token.

Token Choice Routing. The routing algorithm picks the top-1 or top-2 experts with highest affinity scores for each token. The affinity scores can be trained together with model parameters.

The independent token choice approach often leads to an imbalanced load of experts and under-utilization. In order to mitigate this, previous sparsely gated networks introduced additional auxiliary losses as regularization to prevent too many tokens being routed to a single expert, but the effectiveness was limited. As a result, token choice routings need to overprovision expert capacity by a significant margin (2x–8x of the calculated capacity) to avoid dropping tokens when there is a buffer overflow.

In addition to load imbalance, most prior works allocate a fixed number of experts to each token using a top-k function, regardless of the relative importance of different tokens. We argue that different tokens should be received by a variable number of experts, conditioned on token importance or difficulty.

Expert Choice Routing

To address the above issues, we propose a heterogeneous MoE that employs the expert choice routing method illustrated below. Instead of having tokens select the top-k experts, the experts with predetermined buffer capacity are assigned to the top-k tokens. This method guarantees even load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance. EC routing speeds up training convergence by over 2x in an 8B/64E (8 billion activated parameters, 64 experts) model, compared to the top-1 and top-2 gating counterparts in Switch Transformer, GShard, and GLaM.

Expert Choice Routing. Experts with predetermined buffer capacity are assigned top-k tokens, thus guaranteeing even load balancing. Each token can be received by a variable number of experts.

In EC routing, we set expert capacity k as the average tokens per expert in a batch of input sequences multiplied by a capacity factor, which determines the average number of experts that can be received by each token. To learn the token-to-expert affinity, our method produces a token-to-expert score matrix that is used to make routing decisions. The score matrix indicates the likelihood of a given token in a batch of input sequences being routed to a given expert.

Similar to Switch Transformer and GShard, we apply an MoE and gating function in the dense feedforward (FFN) layer, as it is the most computationally expensive part of a Transformer-based network. After producing the token-to-expert score matrix, a top-k function is applied along the token dimension for each expert to pick the most relevant tokens. A permutation function is then applied based on the generated indexes of the token, to create a hidden value with an additional expert dimension. The data is split across multiple experts such that all experts can execute the same computational kernel concurrently on a subset of tokens. Because a fixed expert capacity can be determined, we no longer overprovision expert capacity due to load imbalancing, thus significantly reducing training and inference step time by around 20% compared to GLaM.

Evaluation

To illustrate the effectiveness of Expert Choice routing, we first look at training efficiency and convergence. We use EC with a capacity factor of 2 (EC-CF2) to match the activated parameter size and computational cost on a per-token basis to GShard top-2 gating and run both for a fixed number of steps. EC-CF2 reaches the same perplexity as GShard top-2 in less than half the steps and, in addition, we find that each GShard top-2 step is 20% slower than our method.

We also scale the number of experts while fixing the expert size to 100M parameters for both EC and GShard top-2 methods. We find that both work well in terms of perplexity on the evaluation dataset during pre-training — having more experts consistently improves training perplexity.

Evaluation results on training convergence: EC routing yields 2x faster convergence at 8B/64E scale compared to top-2 gating used in GShard and GLaM (top). EC training perplexity scales better with the scaling of number of experts (bottom).

To validate whether improved perplexity directly translates to better performance in downstream tasks, we perform fine-tuning on 11 selected tasks from GLUE and SuperGLUE. We compare three MoE methods including Switch Transformer top-1 gating (ST Top-1), GShard top-2 gating (GS Top-2) and a version of our method (EC-CF2) that matches the activated parameters and computational cost of GS Top-2. The EC-CF2 method consistently outperforms the related methods and yields an average accuracy increase of more than 2% in a large 8B/64E setting. Comparing our 8B/64E model against its dense counterpart, our method achieves better fine-tuning results, increasing the average score by 3.4 points.

Our empirical results indicate that capping the number of experts for each token hurts the fine-tuning score by 1 point on average. This study confirms that allowing a variable number of experts per token is indeed helpful. On the other hand, we compute statistics on token-to-expert routing, particularly on the ratio of tokens that have been routed to a certain number of experts. We find that a majority of tokens have been routed to one or two experts while 23% have been routed to three or four experts and only about 3% tokens have been routed to more than four experts, thus verifying our hypothesis that expert choice routing learns to allocate a variable number of experts to tokens.

Final Thoughts

We propose a new routing method for sparsely activated mixture-of-experts models. This method addresses load imbalance and under-utilization of experts in conventional MoE methods, and enables the selection of different numbers of experts for each token. Our model demonstrates more than 2x training efficiency improvement when compared to the state-of-the-art GShard and Switch Transformer models, and achieves strong gains when fine-tuning on 11 datasets in the GLUE and SuperGLUE benchmark.

Our approach for expert choice routing enables heterogeneous MoE with straightforward algorithmic innovations. We hope that this may lead to more advances in this space at both the application and system levels.

Acknowledgements

Many collaborators across google research supported this work. We particularly thank Nan Du, Andrew Dai, Yanping Huang, and Zhifeng Chen for the initial ground work on MoE infrastructure and Tarzan datasets. We greatly appreciate Hanxiao Liu and Quoc Le for contributing the initial ideas and discussions. Tao Lei, Vincent Zhao, Da Huang, Chang Lan, Daiyi Peng, and Yifeng Lu contributed significantly on implementations and evaluations. Claire Cui, James Laudon, Martin Abadi, and Jeff Dean provided invaluable feedback and resource support.

Misc

New Release: NVIDIA RTX Global Illumination 1.3

Post author By
Post date November 16, 2022
No Comments on New Release: NVIDIA RTX Global Illumination 1.3

NVIDIA RTX Global Illumination (RTXGI) 1.3 includes highly requested features such as dynamic library support, an increased maximum probe count per DDGI volume…

NVIDIA RTX Global Illumination (RTXGI) 1.3 includes highly requested features such as dynamic library support, an increased maximum probe count per DDGI volume by 2x, support for Shader Model 6.6 Dynamic Resources in D3D12, and more.

Misc

Evaluating Applications Using the NVIDIA Arm HPC Development Kit

Post author By
Post date November 16, 2022
No Comments on Evaluating Applications Using the NVIDIA Arm HPC Development Kit

The NVIDIA Arm HPC Developer Kit is an integrated hardware and software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing…

The NVIDIA Arm HPC Developer Kit is an integrated hardware and software platform for creating, evaluating, and benchmarking HPC, AI, and scientific computing applications on a heterogeneous GPU- and CPU-accelerated computing system. NVIDIA announced its availability in March of 2021.

The kit is designed as a stepping stone to the next-generation NVIDIA Grace Hopper Superchip for HPC and AI applications. It can be used to identify non-obvious x86 dependencies and ensure software readiness ahead of NVIDIA Grace Hopper systems available in 1H23. For more details, see the NVIDIA Grace Hopper Superchip Architecture whitepaper.

The Oak Ridge National Laboratory Leadership Computing Facility (OLCF) integrated the NVIDIA Arm HPC Developer Kit into their existing Wombat Arm cluster. Application teams worked to build, validate, and benchmark several HPC applications to evaluate application readiness for the next generation of Arm- and GPU-based HPC systems. The teams have jointly submitted for publication in the IEEE Transactions on Parallel and Distributed Systems Journal demonstrating that the suite of software and tools available for GPU-accelerated Arm systems are ready for production environments. To learn more, see Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform.

OLCF Wombat Cluster

Wombat is an experimental cluster equipped with Arm-based processors from various vendors. It is operational from 2018. The cluster is managed by the OLCF and is freely accessible to users and researchers.

At the time of the study, the cluster consisted of three types of compute nodes:

4 HPE Apollo 70 nodes, each equipped with dual Cavium (now Marvell) ThunderX2 CN9980 processors and two NVIDIA V100 Tensor Core GPUs
16 HPE Apollo 80 nodes, each equipped with a single Fujitsu A64FX processor
8 NVIDIA Arm HPC Developer Kit nodes, each equipped with a single Ampere Computing Altra Q80–30 CPU and 2 NVIDIA A100 GPUs

These three types of nodes share a common TX2-based login node, Arm-based, and all nodes are connected through InfiniBand EDR and HDR.

HPC application evaluation

Eleven different teams carried out the evaluation work. Teams included researchers from Oak Ridge National Laboratory, Sandia National Laboratories, University of Illinois at Urbana – Champaign, Georgia Institute of Technology, University of Basel, Swiss National Supercomputing Center (SNSC), Helmholtz-Zentrum Dresden-Rossendorf, University of Delaware, and NVIDIA.

Table 1 summarizes the final list of applications and their various characteristics. The applications cover eight different scientific domains and include codes written in Fortran, C, and C++. The parallel programming models used were MPI, OpenMP/OpenACC, Kokkos, Alpaka, and CUDA. No changes were made to the application codes during the porting activities. The evaluation process primarily focused on application porting and testing, with less emphasis on absolute performance considering the experimental nature of the testbed.

App Name	Science Domain	Language	Parallel Programming Model
ExaStar	Stellar Astrophysics	Fortran	OpenACC, OpenMP offload
GPU-I-TASSER	Bioinformatics	C	OpenACC
LAMMPS	Molecular Dynamics	C++	OpenMP, KOKKOS
MFC	Fluid Dynamics	Fortran	OpenACC
MILC	QCD	C/C++	CUDA
MiniSweep	Sn Transport	C	OpenMP, CUDA
NAMD/VMD	Molecular Dynamics	C++	CUDA
PIConGPU	Plasma Physics	C++	Alpaka, CUDA
QMCPACK	Chemistry	C++	OpenMP offload, CUDA
SPECHPC 2021	Variety of Apps	C/C++/Fortran	OpenMP offload, OpenMP
SPH-EXA2	Hydrodynamics	C++	OpenMP, CUDA

Table 1. Applications evaluated on the Wombat test bed

This post covers results for four of the applications. To learn more about the other applications, see Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform.

Bioinformatics for protein structure and function prediction

GPU-I-TASSER is a GPU-capable bioinformatics method for protein structure and function prediction. The I-TASSER suite predicts protein structures through four main steps. These include threading template identification, iterative structure assembly simulation, model selection, and refinement. The final step is structure-based function annotation. The structure folding and reassembling stage is conducted by replica exchange Monte Carlo simulations.

Bar chart comparing the performance of GPU-I-TASSER on Wombat and Summit. — *Figure 1. Performance of GPU-I-TASSER on Wombat and Summit*

Figure 1 shows the performance of Wombat’s ThunderX2 and Ampere Altra processors and NVIDIA A100 and V100 GPUs relative to the POWER9 processor on Summit. For Ampere Ultra, NVIDIA V100, and A100, speedups of 1.8x, 6.9x, and 13.3x, respectively, were observed.

Fluid flow solver for physical problems

Multi-component Flow Code (MFC) is an open-source fluid flow solver that provides high-order accurate solutions to a wide variety of physical problems, including multi-phase compressible flows and sub-grid dispersions.

Table 2 shows average wall-clock times and relative performance metrics for the different hardware. The Time column has little absolute meaning, with the relative performance being the most meaningful (also shown in the last column). All comparisons use either the NVHPC v22.1 or GCC v11.1 compilers as indicated. The CPU wall-clock times are normalized by the number of CPU cores per chip. The results show that the A100 GPU is 1.72x faster than the V100 on Summit.

	Compiler	Time (sec)	Speedup
NVIDIA A100	NVHPC	0.28	15.71
NVIDIA V100	NVHPC	0.5	8.80
2xXeon 6248	NVHPC	2.7	1.63
2xXeon 6248	GCC	2.1	2.10
Ampera Altra	NVHPC	3.9	1.13
Ampera Altra	GCC	2.7	1.63
2xPOWER9	NVHPC	4.4	1.00
2xPOWER9	GCC	3.5	1.26
2xThunderX2	NVHPC	21	0.21
2xThunderX2	GCC	5.4	0.81
A64FX	NVHPC	4.3	1.02
A64FX	GCC	13	0.34

Table 2. Comparison of wall-clock times per time step on various architectures. Bold indicates use of NVIDIA Arm HPC Development Kit hardware.

NAMD and VMD for biomolecular dynamics simulation and visualization

NAMD and VMD are biomolecular modeling applications for molecular dynamics simulation (NAMD) and for preparation, analysis, and visualization (VMD). Researchers use NAMD and VMD to study biomolecular systems ranging from individual proteins, large multiprotein complexes, photosynthetic organelles, and entire viruses.

Table 3 shows that the simulations on A100 for NAMD are as much as 50% faster than on the V100. Similar performance is demonstrated between Cavium ThunderX2 and IBM POWER9, with the latter benefiting from its low latency NVIDIA NVLink connection between CPU and GPU.

CPU	GPU	Compiler	Perf (ns/day)
2x EPYC 7742	A100-SXM4	GCC	187.5
1x Ampera Altra	A100-PCIe	GCC	182.2
2x Xeon 6134	A100-PCIe	ICC	181.4
2x POWER9	V100-NVLINK	XLC	125.7
2x ThunderX2	V100-PCIe	GCC	124.9

Table 3. NAMD single-GPU performance for 1M-atom STMV simulation, NVE ensemble with 12A cutoff, rigid bond constraints, multiple time stepping with 2fs fast time step, and 4fs for PME. Bold indicates use of NVIDIA Arm HPC Development Kit hardware.

For VMD, the GPU-accelerated results in Table 4 showcase the performance gains provided by the much higher peak arithmetic throughput and memory bandwidth provided by GPUs, relative to existing CPU platforms. The GPU molecular orbital results highlight GPU performance and host-GPU interconnect bandwidth.

CPU	Compiler	SIMD	Time (sec)
AMD TR 3975WX	ICC	AVX2	1.32
AMD TR 3975WX	ICC	SSE2	2.89
1x Ampere Alta	ArmClang	NEON	1.35
2x ThunderX2	ArmClang	NEON	3.02
A64FX	ArmClang	SVE	4.15
A64FX	ArmClang	NEON	13.89
2x POWER9	ArmClang	VSX	6.43

Table 4. Comparison of VMD molecular orbital runtime on each platform. Bold indicates use of NVIDIA Arm HPC Development Kit hardware.

QMCPACK

QMCPACK is an open-source, high-performance Quantum Monte Carlo (QMC) package that solves the many-body Schrödinger equation using a variety of statistical approaches. The few approximations made in QMC can be systematically tested and reduced, potentially allowing the uncertainties in the predictions to be quantified at a trade-off of the significant computational expense compared to more widely used methods such as density functional theory.

Applications include weakly bound molecules, two dimensional nanomaterials, and solid-state materials such as metals, semiconductors, and insulators.

Graph showing QMCPACK DMC throughput for Wombat and Summit nodes as a function of the number of electrons in the NiO benchmark. — *Figure 2. QMCPACK DMC throughput for Wombat and Summit nodes as a function of the number of electrons in the NiO benchmark*

As shown in Figure 2, single A100 GPU runs on Wombat outperform those on V100s, with significantly larger throughput for nearly all problem sizes. Wombat’s A100 2 GPUs are significantly more performant for the largest and most computationally challenging case. For these system sizes, greater GPU memory is the most significant factor in increased performance.

NVIDIA Arm HPC Developer Kit evaluation results

The research teams working with the NVIDIA Arm HPC Developer Kit as part of the Wombat cluster said, “In our deployment of Wombat testbed nodes incorporating NVIDIA VGPUs, we found that general cluster setup was made easier by contributions across the stack from Arm Server Ready firmware OSes, software, libraries, and end-user packages.”

“Many of the GPU-accelerated applications tested in this study derived most of their performance from application kernels optimized for the GPU architecture,” they added. “This does not negate the importance of testing new Arm and GPU platforms. We noted that the biggest limitations seemed to be related to limited GPU memory sizes and the mechanisms used to migrate and keep data near the GPU accelerators.”

The path to NVIDIA Grace Hopper systems

The NVIDIA Arm HPC Developer Kit was developed to offer customers a stable hardware and software platform for development and performance analysis of accelerated HPC, AI, and scientific computing applications in the Arm Ecosystem. The NVIDIA Grace Hopper Superchip combines the very high single threaded performance of 72 Arm Neoverse V2 CPU cores with the next generation of NVIDIA Hopper H100 GPUs to offer unparalleled performance for HPC and AI applications. The NVIDIA Grace Hopper Superchip innovates by connecting the CPU to the GPU through NVLink-C2C, which is 7x faster than the PCIe Gen5 and supporting 3.5 TB/s of memory bandwidth through LPDDR5X and HBM3 memory.

The NVIDIA Grace Hopper Superchip has already been adopted by leading HPC customers, including the Swiss National Supercomputing Centre (CSCS), Los Alamos National Laboratory (LANL), and King Abdullah University of Science and Technology (KAUST).

Systems based on the NVIDIA Grace Hopper Superchip will be available from leading Original Equipment Manufacturers in the first half of 2023. Customers interested in getting a head start on moving applications to the Arm Ecosystem can still purchase an NVIDIA Arm HPC Developer Kit from Gigabyte Systems.

To learn more about how the NVIDIA Grace Hopper Architecture delivers next-generation performance and ease of programming, see the NVIDIA Grace Hopper Superchip Architecture whitepaper.

Misc

NVIDIA Teams With Microsoft to Build Massive Cloud AI Computer

Post author By
Post date November 16, 2022
No Comments on NVIDIA Teams With Microsoft to Build Massive Cloud AI Computer

NVIDIA today announced a multi-year collaboration with Microsoft to build one of the most powerful AI supercomputers in the world, powered by Microsoft Azure’s advanced supercomputing infrastructure combined with NVIDIA GPUs, networking and full stack of AI software to help enterprises train, deploy and scale AI, including large, state-of-the-art models.

Misc

GeForce RTX 4080 GPU Launches, Unlocking 1.6x Performance for Creators This Week ‘In the NVIDIA Studio’

Post author By
Post date November 16, 2022
No Comments on GeForce RTX 4080 GPU Launches, Unlocking 1.6x Performance for Creators This Week ‘In the NVIDIA Studio’

Content creators can now pick up the GeForce RTX 4080 GPU, available from top add-in card providers including ASUS, Colorful, Gainward, Galaxy, GIGABYTE, INNO3D, MSI, Palit, PNY and ZOTAC, as well as from system integrators and builders worldwide.

The post GeForce RTX 4080 GPU Launches, Unlocking 1.6x Performance for Creators This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Misc

Upcoming Webinar: Kick-Start Deep Learning on the Cloud

Post author By
Post date November 15, 2022
No Comments on Upcoming Webinar: Kick-Start Deep Learning on the Cloud

Join experts from NVIDIA and Microsoft on November 30 to discover the latest development in deep learning through hands-on demos and practical guidance for…

Join experts from NVIDIA and Microsoft on November 30 to discover the latest development in deep learning through hands-on demos and practical guidance for getting started on the cloud.

Misc

Deploying AI-Accelerated Medical Devices with NVIDIA Clara Holoscan

Post author By
Post date November 15, 2022
No Comments on Deploying AI-Accelerated Medical Devices with NVIDIA Clara Holoscan

The ability to deploy real-time AI in clinics and research facilities is critical to enable the next frontiers in surgery, diagnostics, and drug discovery. From…

The ability to deploy real-time AI in clinics and research facilities is critical to enable the next frontiers in surgery, diagnostics, and drug discovery. From robotic surgery to studying new approaches in biology, doctors and scientists need medical devices to evolve into continuous sensing systems to research and treat disease.

To realize the next generation of intelligent medical devices, a unique combination of AI, accelerated computing, and advanced visualization are needed. NVIDIA Clara Holoscan includes the Clara AGX Developer Kit and the Clara Holoscan SDK that combine to provide a powerful development environment for creating AI-enabled medical devices. To deploy these devices at the clinical edge, a production hardware based on NVIDIA IGX Orin, and a software platform designed for medical-grade certification, are highly desirable.

NVIDIA Clara Holoscan accelerates deployment of production-quality medical applications by providing a set of OpenEmbedded build recipes and reference configurations that can be leveraged to customize and build Clara Holoscan-compatible Linux4Tegra (L4T) embedded board support packages (BSP). With the release of Clara Holoscan SDK v0.3, developers can deploy medical AI even faster using customized OpenEmbedded distributions.

Creating customized Linux distributions with OpenEmbedded

OpenEmbedded is a build framework that allows developers to create fully customized Linux distributions for embedded systems. Developers can fully customize distributions using just the software components and configuration specific to the application. In contrast, commercial Linux distributions provide full operating systems from predefined software collections that often include graphical user interfaces, package management software, GNU tools and libraries, and system configuration tools.

Customizability is particularly important for embedded deployments such that the memory, speed, safety, and security of the embedded device can be optimized while simultaneously simplifying the deployment process using a single preconfigured BSP. In the regulated medical device industry, this customizability is also important from a process overhead point of view, since it allows limiting analysis, testing, and documentation of Software of Unknown Provenance (SOUP), only to the minimal set of software components required for the essential performance of the medical device.

Comparison to HoloPack

HoloPack is the implementation of NVIDIA JetPack SDK specific to Clara Holoscan. It provides a full development environment for Clara Holoscan developer kits and includes Jetson Linux with bootloader, Linux Kernel, Ubuntu desktop environment, and a complete set of libraries for acceleration of GPU computing, multimedia, graphics, and computer vision. This is the Clara Holoscan development stack.

Using customized OpenEmbedded distributions allows you, as the developer, to include just the software components that are actually needed for your application’s deployment. The final runtime BSP can be easily optimized with respect to memory usage, speed, security, and power requirements. This is the Clara Holoscan deployment stack.

To illustrate this, the following tables compare various measurements of a HoloPack installation versus an OpenEmbedded-based Clara Holoscan build, both including the Clara Holoscan Embedded SDK available on GitHub.

Resource usage after initial boot (when idle):

	Development Stack	Deployment Stack	Difference
Processes	408	198	210 (51.4% less processes)
Disk Used	22GB	7GB	15GB (68.1% less disk usage)
Memory Used	1,621MB	2744MB	877MB (54.1% less memory usage)

RTX6000 measurements when running the tracking_replayer Clara Holoscan SDK application:

	Development Stack	Deployment Stack	Difference
Power	71W	67W	4W (5.6% less power)
Temperature	50C	48C	2C (4% cooler)
GPU Usage	15%	11%	4% (26.7% less GPU usage)

Job runtime statistics (in milliseconds) as reported by the tracking_replayer Clara Holoscan SDK application:

	Development Stack	Deployment Stack	Difference
Visualizer	4.51	3.18	29.4%
Visualizer Format Converter	1.13	0.85	24.7%
Inference	10.69	5.73	46.3%
Inference Format Converter	1.00	0.93	7%
Replayer	31.11	30.09	3.2%
Total	48.44	40.78	15.8%

The customized OpenEmbedded/Yocto distribution only includes the minimal set of packages which are actually needed for running the Clara Holoscan SDK application. It therefore helps save disk space, memory, and CPU/GPU cycles that result in higher overall performance running Clara Holoscan sample applications.

Although the flexibility of having a desktop experience with HoloPack is desired during the early stages of development (easy installation of new apt packages, for example), this study shows some of the clear benefits of using the customized deployment stack using OpenEmbedded/Yocto for later stages of productization for medical devices.

Get started with NVIDIA Clara Holoscan

Clara Holoscan OpenEmbedded/Yocto recipes is open source and kept up to date alongside the releases of the NVIDIA Clara Holoscan SDK.

The Clara Holoscan OpenEmbedded/Yocto recipes, and the BSP build in general, depend on other open-source OpenEmbedded components that include (but are not limited to):

OpenEmbedded Core
OpenEmbedded BitBake
Community-driven meta-tegra OpenEmbedded layer, responsible for most of the core Jetson/L4T BSP support leveraged by Clara Holoscan

If you are already familiar with OpenEmbedded or Yocto, check out the meta-tegra-clara-holoscan-mgx repo on GitHub. The README within that repo provides a guide and full list of requirements needed to build and flash a Clara Holoscan BSP.

NVIDIA also provides the Clara Holoscan OpenEmbedded Builder on the NVIDIA GPU Cloud (NGC) website to simplify the process of getting started with these recipes. It includes all the tools and dependencies that are needed either within the container or as part of a setup script that initializes a local build tree such that building and flashing a Clara Holoscan BSP can be done in just a few simple commands.

To build a Clara Holoscan BSP for IGX Orin Developer Kit using the default configuration, which includes the Clara Holoscan SDK and sample applications, first ensure that your Docker runtime is logged into NGC. Then run the following commands in a new directory:

$ export IMAGE=nvcr.io/nvidia/clara-holoscan/holoscan-mgx-oe-builder:v0.3.0
$ docker run --rm -v $(pwd):/workspace ${IMAGE} setup.sh ${IMAGE} $(id -u) $(id -g)
$ ./bitbake.sh core-image-x11

Note that this build will require at least 200 GB of free disk space, and a first full build will take three or more hours. Once the build is complete, the IGX Orin Developer Kit can be put into recovery mode and flashed with the following command:

$ ./flash.sh core-image-x11

One major feature of the Clara Holoscan Deployment stack is the support of both iGPU and dGPU for Developer Kits. When using the iGPU configuration, the majority of the runtime components come from the standard Tegra packages used by the meta-tegra layer and allows developers to use the onboard HDMI or DisplayPort connection on the developer kit. You can check out more details by visiting meta-tegra-clara-holoscan-mgx on GitHub.

Develop custom medical AI utilizing ultra high speed frame rates

With customized OpenEmbedded distributions on Clara Holoscan SDK v0.3, it is easier than ever to deploy production-quality AI for unique medical applications at the clinical edge. The SDK provides a lightning-fast frame rate of 240 Hz for 4K video, enabling developers to combine data from more sensors for building accelerated AI pipelines.

To learn how to get started with NVIDIA Clara Holoscan, follow the instructions on the Clara Holoscan SDK page.

Misc

Scaling VASP with NVIDIA Magnum IO

Post author By
Post date November 15, 2022
No Comments on Scaling VASP with NVIDIA Magnum IO

You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are…

You could make an argument that the history of civilization and technological advancement is the history of the search and discovery of materials. Ages are named not for leaders or civilizations but for the materials that defined them: Stone Age, Bronze Age, and so on. The current digital or information age could be renamed the Silicon or Semiconductor Age and retain the same meaning.

Though silicon and other semiconductor materials may be the most significant materials driving change today, there are several other materials in research that could equally drive the next generation of changes, including any of the following:

High-temperature superconductors
Photovoltaics
Graphene batteries
Supercapacitors

Semiconductors are at the heart of building chips that enable the extensive and computationally complex search for such novel materials.

In 2011, the United States’ Materials Genome Initiative pushed for the identification of new materials using simulation. However, at that time and to an extent even today, calculating material properties from first principles can be painfully slow even on modern supercomputers.

The Vienna Ab initio Simulation Package (VASP) is one of the most popular software tools for such predictions, and it has been written to leverage acceleration technologies and to minimize the time to insight.

New material review: Hafnia

This post examines the computation of the properties of a material called hafnia or hafnium oxide (HfO₂).

On its own, hafnia is an electric insulator. It is heavily used in semiconductor manufacturing, as it can serve as a high-κ dielectric film when building dynamic random-access memory (DRAM) storage. It can also act as a gate insulator in metal–oxide–semiconductor field-effect transistors (MOSFETs). Hafnia is of high interest for nonvolatile resistive RAM, which could make booting computers a thing of the past.

While an ideal, pure HfO₂ crystal can be calculated affordably using only 12 atoms, it is nothing but a theoretical model. Such crystals have impurities in practice.

At times, a dopant must be added to yield the desired material properties beyond insulation. This doping can be done at the purity level, which means that out of 100 eligible atoms, one atom is replaced by a different element. There are minimally 12 atoms out of which only four are Hf. It soon becomes apparent that such calculations easily call for hundreds of atoms.

This post demonstrates how such calculations can be parallelized efficiently over hundreds and even thousands of GPUs. Hafnia serves as an example, but the principles demonstrated here can of course be applied to similarly sized calculations just as well.

Term definitions

Speedup: A nondimensional measure of performance relative to a reference. For this post, the reference is single-node performance using 8x A100 80 GB SXM4 GPUs without NCCL enabled. Speedup is calculated by dividing the reference runtime by the elapsed runtime.
Linear scaling: The speedup curve for an application that is perfectly parallel. In Amdahl’s law terms, it is for an application that is 100% parallelized and the interconnect is infinitely fast. In such a situation, 2x the compute resources results in half the run time and 10x the compute resources results in one-tenth the run time. When plotting speedup compared to number of compute resources, the performance curve is a line sloping up and to the right at 45 degrees. The effect of a parallelized run outperforms this proportional relation. That is, the slope would be steeper than 45 degrees and it is called super-linear scaling.
Parallel efficiency: A nondimensional measure in percent of how close a particular application execution is to the ideal linear scaling. Parallel efficiency is calculated by dividing the achieved speedup by the linear scaling speedup for that number of compute resources. To avoid wasting compute time, most data centers have policies on minimum parallel efficiency targets (50-70%).

VASP use cases and differentiation

VASP is one of the most widely used applications for electronic-structure calculations and first-principles molecular dynamics. It offers state-of-the-art algorithms and methods to predict material properties like the ones discussed earlier.

The GPU acceleration is implemented using OpenACC. GPU communications can be carried out using the Magnum IO MPI libraries in NVIDIA HPC-X or NVIDIA Collective Communications Library (NCCL).

Use cases and differentiation of hybrid DFT

This section focuses on using a quantum-chemical method known as density functional theory (DFT) that reaches higher-accuracy predictions by mixing in exact-exchange calculations to approximations within the DFT and is then called hybrid DFT. This added accuracy helps to determine band gaps in closer accordance with experimental results.

Band gaps are the property that classifies materials as insulators, semiconductors, or conductors. For materials based on hafnia, this extra accuracy is crucial, but comes at an increased computational complexity.

Combining this with the need for using many atoms demonstrates the demand for scaling to many nodes on GPU-accelerated supercomputers. Fortunately, even higher accuracy methods are available in VASP. For more information about the additional features, see VASP6.

At a higher level, VASP is a quantum-chemistry application that is different from other, and possibly even more familiar, high-performance computing (HPC) computational-chemistry applications like NAMD, GROMACS, LAMMPS, and AMBER. These codes focus on molecular dynamics (MD) using simplifications to the interactions between atoms such as treating them as point charges. This makes simulations of the movement of those atoms, say because of temperature, computationally inexpensive.

VASP, on the other hand, treats the interaction between atoms on the quantum level, in that it calculates how the electrons interact with each other and can form chemical bonds. It can also derive forces and move atoms for a quantum or ab-initio-MD (AIMD) simulation. This can indeed be interesting to the scientific problem discussed in this post.

However, such a simulation would consist of repeating the hybrid-DFT calculation step many times. While subsequent steps might converge faster, the computational profile of each individual step would not change. This is why we only show a single, ionic step here.

Running single-node or multi-node

Many VASP calculations employ chemical systems that are small enough not to require execution on HPC facilities. Some users might be uncomfortable with scaling VASP on multiple nodes and suffer through the time-to-solutions, maybe even to the extent that a power outage or some other failure becomes probable. Others may limit their simulation sizes so that runtimes are not as onerous as they would be if better-suited system sizes were investigated.

There are multiple reasons that would drive you toward running simulations multi-node:

Simulations that would take an unacceptable amount of time to run on a single node, even though the latter might be more efficient.
Large calculations that require large amounts of memory and cannot fit on a single node require distributed parallelism. While some computational quantities must be replicated across the nodes, most of them can be decomposed. Therefore, the amount of memory required per node is cut roughly by the number of nodes participating in the parallel task.

Diagram compares single node runs of VASP to multi-node, using the increasing memory required and the single-node runtime. Multi-node is required when the single node memory limit is reached. As runtimes grow beyond 1+ hours, multi-node is more desirable. — *Figure 1. When to choose single node or multi-node*

For more information about multi-node parallelism and compute efficiency, see the recent HPC for the Age of AI and Cloud Computing ebook.

NVIDIA published a study of multi-node parallelism using the dataset Si256_VJT_HSE06. In this study, NVIDIA asked the question, “For this dataset, and an HPC environment of V100 systems and InfiniBand networking, how far can we reasonably scale?”

Magnum IO communication tools for parallelism

VASP uses the NVIDIA Magnum IO libraries and technologies that optimize multi-GPU and multi-node programming to deliver scalable performance. These are part of the NVIDIA HPC SDK.

In this post, we look at two communication libraries:

Message Passing Interface (MPI): The standard for programming distributed-memory scalable systems.
NVIDIA Collective Communications Library (NCCL): Implements highly optimized multi-GPU and multi-node collective communication primitives using MPI-compatible all-gather, all-reduce, broadcast, reduce, reduce-scatter, and point-to-point routines to take advantage of all available GPUs within and across your HPC server nodes.

VASP users can choose at runtime what communication library should be used. As performance most often improves significantly when MPI is replaced with NCCL, this is the default in VASP.

There are a couple of strong reasons for the observed differences when using NCCL over MPI.

With NCCL, communications are GPU-initiated and stream-aware. This eliminates the need for GPU-to-CPU synchronization, which is otherwise needed before each CPU-initiated MPI communication to ensure that all GPU operations have completed before the MPI library touches the buffers. NCCL communications can be enqueued on a CUDA stream just like a kernel and can facilitate asynchronous operation. The CPU can enqueue further operations to keep the GPU busy.

In the MPI case, the GPU is idle at least for the time that it takes the CPU to enqueue and launch the next GPU operation after the MPI communication is done. Minimizing GPU idle times contributes to higher parallel efficiencies.

With two separate CUDA streams, you can easily use one stream to do the GPU computations and the other one to communicate. Given that these streams are independent, the communication can take place in the background and potentially be hidden entirely behind the computation. Achieving the latter is a big step forward to high parallel efficiencies. This technique can be used in any program that enables a double-buffering approach.

Nonblocking MPI communications can expose similar benefits. However, you still must handle the synchronizations between the GPU and CPU manually with the described performance downsides.

There is another layer of complexity added as the nonblocking MPI communications must be synchronized on the CPU side as well. This requires much more elaborate code from the outset, compared to using NCCL. However, with MPI communications being CPU-initiated, there often is no hardware resource that automatically makes the communications truly asynchronous.

You can spawn CPU threads to ensure communications progress if your application has CPU cores to spend, but that again increases code complexity. Otherwise, the communication might only take place when the process enters MPI_Wait, which offers no advantage over using blocking calls.

Another difference to be aware of is that for reductions, the data is summed up on the CPU. In the case where your single-threaded CPU memory bandwidth is lower than the network bandwidth, this can be an unexpected bottleneck as well.

NCCL, on the other hand, uses the GPU for summations and is aware of the topology. Intranode, it can use available NVLink connections and optimizes internode communication using Mellanox Ethernet, InfiniBand, or similar fabrics.

Computational modeling test case with HfO₂

A hafnia crystal is built from two elements: hafnium (Hf) and oxygen (O). In an ideal system free from dopants or vacancies, for each Hf atom, there will be two O atoms. The minimum number of atoms to describe the structure of the infinitely extended crystal required is four Hf (yellowish) and eight O (red) atoms. Figure 2 shows the structure.

A 3D diagram of a small portion of a hafnia (HfO2) crystal showing four hafnium atoms and eight oxygen atoms connected in a lattice. — *Figure 2. Visualization of the unit cell for a hafnia (HfO₂) crystal*

The box wireframe designates the so-called unit cell. It is repeated in all three dimensions of space to yield the infinitely extended crystal. The picture alludes to that by duplicating the atoms O5, O6, O7, and O8 outside of the unit cell to show their respective bonds to the Hf atoms. This cell has a dimension of 51.4 by 51.9 by 53.2 nm. This is not a perfect cuboid because one of its angles is 99.7° instead of 90°.

The minimal model only treats the 12 atoms enclosed in the box in Figure 2 explicitly. However, you can also prolong the box in one or more directions of space by an integer multiple of the according edge and copy the structure of atoms into the newly created space. Such a result is called a supercell and can help to treat effects that are inaccessible within the minimal model, like a 1% vacancy of oxygen.

Of course, treating a larger cell with more atoms is computationally more demanding. When you add one more cell, so that there are two total cells, in the direction of a while leaving b and c as is, this is called a 2x1x1 supercell with 24 atoms.

For the purposes of this study, we only considered supercells that are costly enough to justify the usage of at least a handful of supercomputer nodes:

2x2x2: 96 atoms, 512 orbitals
3x3x2: 216 atoms, 1,280 orbitals
3x3x3: 324 atoms, 1,792 orbitals
4x4x3: 576 atoms, 3,072 orbitals
4x4x4: 768 atoms, 3,840 orbitals

Keep in mind that computational effort is not directly proportional to the number of atoms or the volume of the unit cell. A rough estimate used in this case study is that it scales cubically with either.

A set of five 3D diagrams of the crystal lattice for hafnia (HfO2) crystals for 96, 216, 324, 576, 768 total atom counts representing the simulations being studied here. — *Figure 3. Visualizations of hafnium oxide supercells for atom counts: 96, 216, 324, 576, 768*

The hafnia system used here is only one example, of course. The lessons are transferable to other systems that employ similar-sized cells and hybrid DFT as well because the underlying algorithms and communication patterns do not change.

If you want to do some testing yourself with HfO₂, you can download the input files used for this study. For copyright reasons, we may not redistribute the POTCAR file. This file is the same across all supercells. As a VASP licensee, you can easily create it yourself from the supplied files by the following Linux command:

# cat PAW_PBE_54/Hf_sv/POTCAR PAW_PBE_54/O/POTCAR > POTCAR

For these scaling experiments, we enforced a constant number of employed crystal orbitals, or bands. This slightly increases the workload beyond the minimum required but has no effect on computational accuracy.

If this wasn’t done, VASP would automatically select a number that is integer-divisible by the number of GPUs and this might increase the workload for certain node counts. We chose the number of orbitals that is integer-divisible by all GPU counts employed. Also, for better computational comparability, the number of k-points is kept fixed at 8, even though larger supercells might not require this in practice.

Supercell modeling test method with VASP

All benchmarks presented in the following are using the latest VASP release 6.3.2, which was compiled using the NVIDIA HPC SDK 22.5 and CUDA 11.7.

For full reference, makefile.include is available for download. They were run on the NVIDIA Selene supercomputer that consists of 560 DGX A100 nodes, each of which provides eight NVIDIA A100-SXM4-80GB GPUs, eight NVIDIA ConnectX-6 HDR InfiniBand network interface cards (NICs), and two AMD EPYC 7742 CPUs.

To ensure the best performance, the processes and threads were pinned to the NUMA nodes on the CPU that offer ideal connectivity to the respective GPUs and NICs that they will use. The reverse NUMA node numbering on AMD EPYC, yields the following process binding for the best hardware locality.

Node local rank	CPU NUMA node	GPU ID	NIC ID
0	3	0	mlx5_0
1	2	1	mlx5_1
2	1	2	mlx5_2
3	0	3	mlx5_3
4	7	4	mlx5_6
5	6	5	mlx5_7
6	5	6	mlx5_8
7	4	7	mlx5_9

Table 1. Compute node GPU and NIC ID mapping

Included in the set of downloadable files is a script called selenerun-ucx.sh. This script is wrapping the call to VASP by performing the following in the workload manager (for example, Slurm) job script:

# export EXE=/your/path/to/vasp_std
# srun ./selenerun-ucx.sh

The selenerun-ucx.sh file must be customized to match your environment, depending on the resource configuration available. For example, the number of GPUs or number of NICs per node may be different from Selene and the script must reflect those differences.

To keep computation time for benchmarking as low as possible, we have restricted all calculations to only one electronic step by setting NELM=1 in the INCAR files. We can do this because we are not concerned with the scientific results like the total energy and running one electronic step suffices to project the performance of a full run. Such a run took 19 iterations to converge with the 3x3x2 supercell.

Of course, each different cell setup could require a different number of iterations until convergence. To benchmark scaling behavior, you want to compare fixed numbers of iterations anyway to keep the workload comparable.

However, evaluating the performance of runs with only one electronic iteration would mislead you because the profile is lopsided. Initialization time would take a much larger share relative to the net iterations and so would the post-convergence parts like the force calculation.

Luckily, the electronic iterations all require the same effort and time. You can project the total runtime of a representative run using the following equation:

$t_{total} = t_{init} + 19 cdot t_{iter} +t_{post}$

You can extract the time for one iteration $t_{init}$ from the VASP internal LOOP timer, while the time spent in post-iteration steps $t_{post}$ is given by the difference between the LOOP+ and LOOP timers.

The initialization time $t_{init}$ , on the other hand, is the difference between the total time reported in VASP as Elapsed time and LOOP+. There is a slight error in such a projection as the first iterations take a little longer due to instances such as one-time allocations. However, the error was checked to be less than 2%.

Parallel efficiency results for a hybrid DFT iteration in VASP

We first reviewed the smallest dataset with 96 atoms: the 2x2x2 supercell. This dataset hardly requires a supercomputer these days. Its full run, with 19 iterations, finishes in around 40 mins on one DGX A100.

Still, with MPI, it can scale to two nodes with 93% parallel efficiency before dropping to 83% on four and even 63% on eight nodes.

On the other hand, NCCL enables nearly ideal scaling of 97% on two nodes, 90% on four nodes, and even on eight nodes it still reaches 71%. However, the biggest advantage by NCCL is clearly demonstrated at 16 nodes. You can still see a >10x relative speedup compared to 6x with MPI only.

The negative scaling beyond 64 nodes needs explanation. To run 128 nodes with 1024 GPUs, you must use 1024 orbitals as well. The other calculations used only 512, so here the workload increases. We didn’t want to include such an excessive orbital count for the lower node runs, though.

Line chart compares relative speedup to the number of compute nodes showing scalability curves for the 96 atom case. Curve #1 is with NCCL OFF with a maximum speedup of 10x at 64 nodes relative to the one node runtime. Curve #2 is with NCCL ON with a maximum speedup of 16x at 64 nodes. — *Figure 4. Scaling and performance for 96-atom case. NCCL-enabled results have been scaled relative to the single node performance with NCCL disabled*.

The next example is already a computationally challenging problem. The full calculation of the 3x3x2 supercell featuring 216 atoms takes more than 7.5 hours to complete on 8xA100 on a single node.

With more demand on computation, there is more time to conclude the communications asynchronously in the background using NCCL. VASP remains above 91% until 16 nodes and only closely falls short of 50% on 128 nodes.

With MPI, VASP does not hide the communications effectively and does not reach 90% even on eight nodes and drops to 41% on 64 nodes already.

Figure 5 shows that the trends regarding the scaling behavior remain the same for the next bigger 3x3x3 supercell with 324 atoms, which would take a full day until the solution on a single node. However, the spreads between using NCCL and MPI increase significantly. On 128 nodes with NCCL, you gain a 2x better relative speedup.

ine chart of relative speedup vs the number of compute nodes showing two scalability curves for the 216 and 326 atom cases. Curve #1 is for 216 atoms NCCL OFF with a maximum speedup of 30x at 128 nodes relative to the 1 node runtime. Curve #2 is with NCCL ON with a maximum speedup of 30x at 128 nodes. Curve #3 is for 324 atoms NCCL OFF with a maximum speedup of 42x at 128 nodes relative to the single-node runtime. Curve #4 is with NCCL ON with a maximum speedup of 84x at 128 nodes. — *Figure 5. Scaling and performance for the 216-atom and 324-atom cases. NCCL-enabled results have been scaled relative to the single-node performance with NCCL disabled*.

Going to an even larger, 4x4x3 supercell containing 576 atoms, you would have to wait more than 5 days for the full calculation using one DGX A100.

However, with such a demanding dataset, a new effect must be discussed: Memory capacity and parallelization options. VASP offers to distribute the workload over k-points while replicating the memory in such setups. While this is much more effective for standard-DFT runs, it also helps with performance on hybrid-DFT calculations and there is no need to leave available memory unused.

For the smaller datasets, even parallelizing over all k-points fits easily into 8xA100 GPUs with 80 GB of memory each. With the 576-atom dataset, on a single node, this is no longer the case though and we must reduce the k-point parallelism. From two nodes onwards, we could fully employ it again.

While it is indistinguishable in Figure 6, there is minor super-linear scaling in the MPI case (102% parallel efficiency) on two nodes. This is because of the necessarily reduced parallelism on one node that is lifted on two or more nodes. However, that is what you would do in practice as well.

We face a similar situation for the 4x4x4 supercell with 768 atoms on one and two nodes, but the super-linear scaling effect is even less pronounced there.

We scaled the 4x4x3 and 4x4x4 supercell to 256 nodes. This equates to 2,048 A100 GPUs. With NCCL, they achieved 67% or even 75% of parallel efficiency. This enables you to yield your results in less than 1.5 hours, in what would have previously taken almost 12 days on one node! The usage of NCCL enables an almost 3x higher relative speedup for such large calculations over MPI.

Line chart compares relative speedup to the number of compute nodes showing scalability curves for the 576 and 768 atom cases. Curve #1 is for 576 atoms NCCL OFF with a maximum speedup of 64x at 256 nodes relative to the single-node runtime. Curve #2 is with NCCL ON with a maximum speedup of 175x at 256 nodes. Curve #3 is for 768 atoms NCCL OFF with a maximum speedup of 78x at 256 nodes relative to the one-node runtime. Curve #4 is with NCCL ON with a maximum speedup of 198x at 256 nodes. — *Figure 6. Scaling and performance for the 576- and 768-atom cases. NCCL-enabled results have been scaled relative to the single node performance with NCCL disabled.*

Recommendations for using NCCL for a VASP simulation

VASP 6.3.2 calculating HfO₂ supercells ranging from 96 to 768 atoms achieves significant performance by using NVIDIA NCCL across many nodes when an NVIDIA GPU-accelerated HPC environment enhanced by NVIDIA InfiniBand networking is available.

A 2D diagram of # atoms on the vertical axis and number of nodes on the horizontal axis showing where NCCL makes a positive difference in scalability and where it does not. NCCL starts to make a difference for all cases larger than 4 nodes for 96 atoms to 16 nodes and 768 atoms. — *Figure 7. A general guideline for when NCCL is beneficial for a VASP simulation similar to HfO₂ running on A100 GPUs with multiple HDR InfiniBand interconnectivity*

Based on this testing, we recommend that users with access to capable HPC environments consider the following:

Run all but the smallest calculations using GPU acceleration.
Consider running larger systems of atoms using both GPUs and multiple nodes to minimize time to insight.
Launch all multi-node calculations using NCCL as it only increases efficiency when running large models.

The slight added overhead to initialize NCCL will be worth the tradeoff.

Summary

In conclusion, you’ve seen that scalability for hybrid DFT in VASP depends on the size of the dataset. This is somewhat expected given that the smaller the dataset is, the earlier each individual GPU will run out of computational load.

NCCL also helps to hide the required communications. Figure 7 shows the levels of parallel efficiency that you can expect for certain dataset sizes with varying node counts. For most computationally intensive datasets, VASP reaches >80% of parallel efficiency on 32 nodes. For most demanding datasets as some of our customers request them, scale-out runs at 256 nodes are possible with good efficiency.

Line chart compares parallel efficiency to the number of compute nodes showing curves for all the executed cases NCCL ON and NCCL OFF. For small atom counts like 96, efficiency drops quickly to less than 10% at 128 nodes. For large atom counts like 576 and 768 with NCCL enabled, efficiency stays well above 60% out to 256 nodes. — *Figure 8. Parallel efficiency as a function of node count (log scale)*

VASP user experience

From our experience with VASP users, running VASP on GPU-accelerated infrastructure is a positive and productive experience that enables you to consider larger and more sophisticated models for your research.

In unaccelerated scenarios, you may be running models smaller than you’d like because you expect runtimes to grow to intolerable levels. Using high-performance, low-latency I/O infrastructure with GPUs, and InfiniBand with Magnum IO acceleration technologies like NCCL, makes efficient, multi-node parallel computing practical and puts larger models within reach for investigators.

HPC system administrator benefits

HPC centers, especially commercial ones, often have policies that prohibit users from running jobs at low parallel efficiency. This prevents users on short deadlines or who need high turn-over rates from using more computational resources at the expense of other users’ job wait time. More often than not, a simple rule of thumb is that 50% parallel efficiency dictates the maximum number of nodes that a user might request and hence increases the time to solution.

We have shown here that, by using NCCL as part of NVIDIA Magnum IO, users of an accelerated HPC system can stay well within efficiency limits and scale their jobs significantly farther than possible when using MPI alone. This means that while keeping overall throughput at its highest across the HPC system, you can minimize runtime and maximize the number of simulations to get new and exciting science done.

HPC application developer advantages

As an application developer, you can benefit from the advantages observed here with VASP just as well. To get started:

Download NCCL.
Read the NCCL User Guide.
Review the popular Fast Inter-GPU Communication with NCCL for Deep Learning Training, and More (a Magnum IO session) GTC session.
Read these informative posts:
- Massively Scale Your Deep Learning Training with NCCL 2.4
- Doubling all2all Performance with NVIDIA Collective Communication Library 2.12

Misc

Building the Future of Real-Time Graphics with NVIDIA and Unreal Engine 5.1

Post author By
Post date November 15, 2022
No Comments on Building the Future of Real-Time Graphics with NVIDIA and Unreal Engine 5.1

The Unreal Engine 5.1 release includes cutting-edge advancements that make it easier to incorporate realistic lighting and accelerate graphics workflows. Using…

The Unreal Engine 5.1 release includes cutting-edge advancements that make it easier to incorporate realistic lighting and accelerate graphics workflows. Using the NVIDIA RTX branch of Unreal Engine (NvRTX), you can significantly increase hardware ray-traced and path-traced operations by up to 40%.

Unreal Engine 5.1 features Lumen, a real-time global illumination solution, which enables developers to create more dynamic scenes where indirect lighting changes on the fly. Realistic lighting is an essential component when creating scenes in games, and Lumen can provide high-quality, scalable global illumination and hardware ray-traced reflections.

Nanite, the Unreal Engine (UE) virtualized geometry system, enables film-quality art consisting of billions of polygons to be directly imported into UE, all while maintaining the highest image quality in real time.

In addition to Lumen and Nanite, Unreal Engine 5.1 advances important features that speed up development cycles like Virtual Shadow Maps, Programmable Rasterizer, Virtual Assets, and automated pipeline state object caching for DX12.

NVIDIA is accelerating this new feature set through a combination of NVIDIA RTX 4090, Shader Execution Reordering, and hardware-accelerated ray tracing cores. Thousands of developers have already experienced the benefits of Unreal Engine with NVIDIA technologies. Over the past few years, NVIDIA has delivered GPUs, libraries, and APIs to support the latest features of Unreal Engine.

Next-generation RTX lighting

Achieving the most accurate lighting in computer graphics requires replicating how light physics simulate in the real world. Path-traced lighting has been used in offline rendering in films to achieve physically accurate results. However, that is an expensive and time-consuming process.

Continued advancements in hardware ray-traced shadows in Unreal Engine 5.1 improve shadow quality using an algorithm that more closely matches offline path tracing. This allows you to create more realistic scenes in real time.

RTX Direct Illumination (RTXDI), available through NvRTX, allows you to take dynamic light counts from single digits into the hundreds. RTXDI uses the same algorithm for direct lighting as the offline path tracer, taking a step closer to unlimited lighting and photorealism.

The next evolution of this technology is in gaming and real-time rendering, which considerably accelerates the time in which frames are processed and rendered.

A wood and bamboo entrance lit in real time compared to the same scene created using the offline path tracer in Unreal Engine 5.1. — *Figure 1. A scene lit in real time (left) compared to the same scene created using the offline path tracer in Unreal Engine 5.1 (right)*

Shader Execution Reordering

A new technology called Shader Execution Reordering (SER) can help solve the challenge of accurately simulating light. SER provides performance gains in ray tracing operations and optimization for specific use cases. NVIDIA is accelerating real-time ray tracing and offline path tracing by leveraging SER through NvRTX.

NvRTX features SER integration to support optimization of many of its ray tracing paths. Developers will see additional frame rate optimization on 40 series cards with up to 40% speed increases in ray tracing operations, and zero impact on quality and content authoring. This improves the efficiency of complex ray tracing calculations, and provides greater gains in scenes that take advantage of ray tracing benefits.

Offline path tracing, which is arguably the most complex tracing operation, will see the largest benefit from SER in Unreal Engine 5.1, with speed improvements of 40% or more. Hardware ray-traced reflections and translucency, which have complex interactions with materials and lighting, will also see benefits.

For more information about SER in Unreal Engine 5.1, see the Shader Execution Reordering Whitepaper and Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering.

Summary

Epic Games and NVIDIA are leading the way into the next generation of rendering, moving the industry toward the future of graphics. With improvement leaps made in each version release of Unreal Engine, developers can expect even more groundbreaking advancements in this space.

Learn more about NVIDIA technologies and Unreal Engine.

Misc

Attention, Sports Fans! WSC Sports’ Amos Berkovich on How AI Keeps the Highlights Coming

Post author By
Post date November 15, 2022
No Comments on Attention, Sports Fans! WSC Sports’ Amos Berkovich on How AI Keeps the Highlights Coming

It doesn’t matter if you love hockey, basketball or soccer. Thanks to the internet, there’s never been a better time to be a sports fan. But editing together so many social media clips, long-form YouTube highlights and other videos from global sporting events is no easy feat. So how are all of these craveable video Read article >

The post Attention, Sports Fans! WSC Sports’ Amos Berkovich on How AI Keeps the Highlights Coming appeared first on NVIDIA Blog.

Overview of MoE Routing

Expert Choice Routing

Evaluation

Final Thoughts

Acknowledgements

OLCF Wombat Cluster

HPC application evaluation

Bioinformatics for protein structure and function prediction

Fluid flow solver for physical problems

NAMD and VMD for biomolecular dynamics simulation and visualization

QMCPACK

NVIDIA Arm HPC Developer Kit evaluation results

The path to NVIDIA Grace Hopper systems

Creating customized Linux distributions with OpenEmbedded

Comparison to HoloPack

Get started with NVIDIA Clara Holoscan

Develop custom medical AI utilizing ultra high speed frame rates

New material review: Hafnia

Term definitions

VASP use cases and differentiation

Use cases and differentiation of hybrid DFT

Running single-node or multi-node

Magnum IO communication tools for parallelism

Computational modeling test case with HfO2

Supercell modeling test method with VASP

Parallel efficiency results for a hybrid DFT iteration in VASP

Recommendations for using NCCL for a VASP simulation

Summary

VASP user experience

HPC system administrator benefits

HPC application developer advantages

Next-generation RTX lighting

Shader Execution Reordering

Summary

Computational modeling test case with HfO₂