Categories
Misc

Ushering In a New Era of HPC and Supercomputing Performance with DPUs

Supercomputers are used to model and simulate the most complex processes in scientific computing, often for insight into new discoveries that otherwise would be…

Supercomputers are used to model and simulate the most complex processes in scientific computing, often for insight into new discoveries that otherwise would be impractical or impossible to demonstrate physically.

The NVIDIA BlueField data processing unit (DPU) is transforming high-performance computing (HPC) resources into more efficient systems, while accelerating problem solving across a breadth of scientific research, from mathematical modeling and molecular dynamics to weather forecasting, climate research, and even renewable energy.

Diagram showing key areas where BlueField is making a positive impact: in-network computing, zero-trust security, cloud-native supercomputing, and computational storage.
Figure 1. Areas of innovation for NVIDIA BlueField DPU

BlueField has already made a marked impact in the areas of cloud networking, security, telecommunications, and edge computing. In addition, there are several areas across high-performance computing where it is sparking innovations for application performance and system efficiency.

NVIDIA BlueField-3 provides powerful computing based on multiple Arm AArch64 cores, a multithreaded datapath accelerator, integrated NVIDIA ConnectX-7 400Gb/s networking, and a broad range of programmable acceleration engines in the I/O path. It’s equipped with dual DDR 6500MT/s DRAM controllers and comes standard with 64 GB onboard memory. BlueField-3 is the third-generation data center infrastructure-on-a-chip that enables incredibly efficient and powerful software-defined, hardware-accelerated infrastructures from cloud to core data center to edge.

So, what does all this mean for high-performance computing?

Boosting HPC application performance and scalability

HPC is all about increasing performance and scalability. For nearly two decades, InfiniBand networking has been the proven leader in terms of performance and application scalability for several reasons.

From a high-level view, InfiniBand is just the most efficient way to move data: direct data placement. There’s no need for the CPU or operating system to be involved and no need for making multiple copies of the data as it makes its way from the network interface, through the system to the actual application that needs it.

If InfiniBand is already so efficient, what benefit would BlueField provide?

One of the key challenges that InfiniBand has been addressing for years is moving network communication overhead away from the CPU, enabling it to spend its time focusing on what it does best: application computation and branching code.

The CPU in today’s mainstream servers is overly general-purpose, sharing its compute cycles, time, and resources across hundreds or thousands of processes that have little to nothing to do with actual computing.

BlueField is bringing unprecedented innovation and efficiency to supercomputing by offloading, accelerating, and isolating a broad range of advanced networking, storage, and security services.

Why the era of AI ushered in the need for the BlueField DPU

The field of artificial intelligence research was founded as an academic discipline in 1956. Even a decade before that, scientists began to discuss the possibility of creating an artificial brain. It was much later that the concepts became reality, with more modern computer hardware and software.

In 2006, NVIDIA introduced CUDA, the industry’s first C-compiler developer environment for the GPU, solving complex computing problems up to 100x faster than traditional approaches. Today, artificial intelligence is prolific and driving nearly every area of scientific research, changing our lives and shaping the industrial landscape.

Similarly, references to the first proposals for nonblocking collective operations were introduced mid-2006. The proposed nonblocking interfaces for the collective group communication functions of Message Passing Interface (MPI) was certainly prolific in theory. However, it was not implemented across many applications. Perhaps this was because, until the introduction of the DPU, the full benefits could not be realized.

Today, with BlueField-3, the technology has arrived—providing the fundamental elements needed for innovation, performance, and efficiency. There is a renewed interest in nonblocking collective operations for increased application performance and scalability, and counter the effects of operating system jitter.

There are also several areas across scientific computing, including early examples, where BlueField is demonstrating how it can be used to transform HPC into highly efficient and sustainable computing.

Saving CPU cycles with in-network computing

NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology improves upon the performance of MPI operation, by offloading many blocking collective operations from the CPU to the switch network, and eliminating the need to send data multiple times between endpoints. This innovative approach decreases the amount of data traversing the network as aggregation nodes are reached, and dramatically reduces the MPI operations time.

BlueField extends additional in-network computing capabilities by leveraging its Arm cores to implement the nonblocking operations. This enables the system host CPU to perform computation with peak overlap.

Figure 2 shows an example of this using the MVAPICH2-DPU library, which is being optimized to take advantage of the full potential of BlueField. It shows the capability to extract peak overlap between computation happening at the host and MPI_Ialltoall communication.

Diagram showing 100% overlap of communication and computation across message sizes ranging from 1K to 512K bytes when using MVAPICH2-DPU library.
Figure 2. Overlap of communication and computation using NVIDIA BlueField technology with nonblocking alltoall

Computational storage for HPC workloads

Computational storage, or in-storage computing, brings HPC capabilities to traditional storage devices. In-storage computing enables you to perform selected computing tasks within or next to a storage device, offloading host processing and reducing data movement. BlueField provides the ability to combine in-storage and in-networking computing on a single card.

BlueField enables storage software stacks to be offloaded from compute nodes while also existing as a fabric-attached NVMe controller capable of accelerating critical storage functions, such as compression, checksum calculation, and parity generation. Such services are offered in parallel file systems.

The entire storage system stack is transparently offloaded within the Linux kernel while enabling simple NVIDIA DOCA implementations of standard storage functions on the NVMe target side.

The next-generation open storage architecture offers a new paradigm for accelerating, isolating, and securing high-performance storage systems. The system employs hardware and software co-design, making the DPU incredibly efficient and transparent to the user.

Acceleration of the file system means increasing the performance of critical functions within the storage system, with storage system performance being a key enabler of deep-learning-based scientific inquiry.

The ability to fully offload both the storage client and server onto DPUs leads to previously unrealizable levels of security and performance isolation. Critical data plane and control plane functions are moved to a separate domain on the DPU. This relieves the server CPU from the work and protects the functions in case the CPU or its software are compromised.

NVIDIA DOCA software framework

The NVIDIA DOCA SDK is the key to unlocking the potential of BlueField. Together, NVIDIA DOCA and BlueField enable the development of applications that deliver breakthrough networking, security, storage, and application performance with a comprehensive, open development platform.

NVIDIA DOCA supports a range of operating systems and distributions and includes drivers, libraries, tools, documentation, and example applications. The upcoming NVIDIA DOCA 1.5 and 2.0 releases introduce a broad range of networking, storage, security capabilities, and enhancements that deliver breakthrough performance and advanced programmability for HPC developers:

  • A new communication channel library
  • Fast access to host memory for UCX accelerations
  • Storage emulation (SNAP) including storage encryption
  • New NVIDIA DOCA services including UCC offload service and telemetry service
  • NVIDIA DOCA security SDK

Transforming HPC today and tomorrow

There are many areas of innovation already on the horizon where BlueField, NVIDIA DOCA, and the community will continue to transform HPC.

Some ideas are already past the whiteboard, such as enhanced performance isolation at a data center scale or enhancing job schedulers for more intelligent job placement.

Because scientific applications are often highly synchronized, the negative effects of system noise on a large-scale HPC system can present a much greater impact on performance. Reducing system noise caused by other processes such as storage is critical.

Telemetry information is powerful. It’s not just about collecting information about routers, switches, and network traffic. Rather, it is possible to gather and share information by workload and I/O characterization.

AI frameworks precisely tune the performance isolation algorithms within the NVIDIA Quantum-2 InfiniBand platform. Multi-application environments sharing common data center resources, such as the network and storage, are ensured the best possible performance, as if the applications were running on bare metal as a single instance.

BlueField is perfectly positioned to address the challenges presented by large-scale computing. For more information on DPUs, add the following GTC session to your calendar:

For more information on other technologies discussed in this post, see the following resources:

Leave a Reply

Your email address will not be published. Required fields are marked *