Categories
Misc

Powering NVIDIA-Certified Enterprise Systems with Arm CPUs

Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and…

Organizations are rapidly becoming more advanced in the use of AI, and many are looking to leverage the latest technologies to maximize workload performance and efficiency. One of the most prevalent trends today is the use of CPUs based on Arm architecture to build data center servers. 

To ensure that these new systems are enterprise-ready and optimally configured, NVIDIA has approved the first NVIDIA-Certified Systems with Arm CPUs and NVIDIA GPUs. This post presents the benefits of NVIDIA-Certified Arm systems, and what customers should expect to see in the near future.

Using Arm architecture for HPC

Arm-based systems are common for edge applications. They are already widely used by large-scale cloud service providers, and are starting to become more popular for data center applications. According to Gartner®, 12% of new servers for high-performance computing (HPC) will be Arm-based by 2025.1 

Systems based on Arm architecture have the ability to run many cores with high energy efficiency, along with high memory bandwidth and low latency. In fact, recent results for the MLPerf benchmarks show Arm systems delivering almost the same performance for inference as x86-based systems, with one test showing the Arm-based server outperforming a similar x86 system.

The certification by NVIDIA of Arm-based systems is the culmination of a process that started in 2019, when NVIDIA ported the CUDA-X libraries to Arm. This paved the way for NVIDIA partners to start building energy-efficient, AI-enabled systems. NVIDIA also partnered with GIGABYTE in 2021 to develop and offer the Arm HPC Developer Kit

Now, NVIDIA Certification will help businesses choose the best enterprise-grade systems.

NVIDIA-Certified Arm systems

NVIDIA-Certified Systems offer NVIDIA GPUs and NVIDIA high-speed, secure network adapters from leading NVIDIA partners in configurations validated for optimum performance, manageability, and scale. Announced at the beginning of 2021, the program gives customers and partners confidence to choose enterprise-grade hardware solutions to power their accelerated computing workloads—from the desktop to the data center and edge.

More than 200 certified systems are now available—covering data center, desktop, and edge—from over 30 partners. NVIDIA-Certified Systems have excellent performance on a range of modern accelerated computing workloads, including AI and data science, 3D computing and visualization, and HPC. 

The certification also validates key enterprise capabilities, including management, security, and scalability. This ensures that certified systems can take advantage of powerful software including: 

GIGABYTE: The first Arm-ready certified system

The first NVIDIA-Certified Arm system is the GIGABYTE G242-P33, which features the Neoverse-based Ampere Altra processor and up to four NVIDIA A100 Tensor Core GPUs. GIGABYTE has been part of the NVIDIA-Certified Systems program since its inception, and now offers more than 15 NVIDIA-Certified Systems. 

“Qualifying Arm-based servers for NVIDIA accelerators continues to be one of GIGABYTE’s top priorities, and with NVIDIA-Certified Systems we will take the performance validation a step further to not only support the new NVIDIA H100 but also to include NVIDIA BlueField-2 DPU and InfiniBand products,” said Etay Lee, CEO of GIGABYTE. 

“Customers want an Arm-ready solution that comes with a wealth of NVIDIA resources and support to achieve faster insights,” Lee added. “That is what our Ampere Altra servers have delivered, starting with our server for the NVIDIA Arm HPC Developer Kit.”

As the Arm architecture becomes more adopted in data centers, it will be important to choose systems that are optimally configured. This is particularly the case for Arm systems equipped with GPUs and high-speed networking, since this architecture is new to many enterprises. 

Customers might not have expertise to design such a system properly, but NVIDIA-Certified Systems provide them with an easy way to make the best choices. To find Arm-based certified systems, see the Qualified Systems Catalog. The catalog will grow as more systems are certified.

1Gartner, “Forecast Analysis: Arm-Based Servers, Worldwide,” G00755363, November 2021.

GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.

Categories
Misc

Finding Out Where Your Application and Network Intersect

Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to…

Modern data centers can run thousands of services and applications. When an issue occurs, as a network administrator, you are guilty by default. You have to prove your innocence on a daily basis, as it is easy to blame the network. It is an unfair world.

Correlating application performance issues to the network is hard to do. You can start by checking basic connectivity using simple pings or traceroutes, check your SNMP-based monitoring tools, sniffers, or even reading device counters to look for drops. In the meantime, users suffer from application slowness, poor performance, or even unavailability.

Unfortunately, all these classic network troubleshooting methods are time-consuming and don’t guarantee success, as it is sometimes nearly impossible to pinpoint problems using them.

NetQ to the rescue

To facilitate network troubleshooting, NVIDIA developed NetQ—a scalable, modern network operations toolset that provides network visibility in real time.

The NetQ team recently introduced the unique flow analysis tool to provide further visibility enhancements. Flow analysis allows network administrators to instantly correlate service traffic flows to the paths taken in the fabric, dramatically reducing the mean time to innocence (MTTI) or even ensuring there is no network issue.

Flow analysis enables you to discover and visualize all paths that a specific application’s traffic flow takes between endpoints in the fabric. It monitors the fabric-wide latency and buffer utilization statistics. With EVPN and multi-tenancy becoming the standard solution in most modern data centers, the flow analysis tool was designed to sample TCP or UDP data on overlay and underlay networks within different VRFs.

Flow analysis becomes even more powerful when used with What Just Happened (WJH) ASIC telemetry. While flows are being analyzed, flow-related WJH events from all switches in traffic paths are presented to help you discover if there were drops that caused the service issue. These two features working together maximize the probability of pinpointing the actual problem affecting an application.   

Screen shot of the dashboard showing latency results and a flow graph.
Figure 1. NetQ flow analysis dashboard

By the numbers

Flow analysis is supported on NVIDIA Spectrum 2 and later switches running Cumulus Linux 5.0 or later. It can also provide partial-path discovery for brownfield deployments with unsupported switches or switches running older versions of Cumulus Linux or SONiC.

Flow analysis samples traffic based on the packet’s four or five tuples, including VXLAN inner and outer headers. Its sampling lifetime is limited to 10, 15, 20, or 30 minutes. You can decide whether to run it on creation or schedule it for a later time.

The sample rate granularity is also configurable to low (1 per 10000), medium (1 per 1000), high (1 per 100), or all packets (1 per 1). The higher the sampling rate, the more accurate your analyzed data. A higher sampling rate results in higher CPU utilization, so I recommend setting lower sampling rates for heavy traffic flows.

Try it yourself in NVIDIA Air

NVIDIA Air is a tool for creating data center digital twins. With Air, you can build your own Cumulus Linux virtual data center, test it, validate it with NetQ, explore features, and learn some best practices. It is entirely free to use!

Try out flow analysis by spinning up the prebuilt NVIDIA Air Infrastructure Simulation Platform demo in the Air Marketplace. Follow the guided tour and see the significant benefits that flow analysis with NetQ can bring to your organization.

For more information, see the following resources:

Categories
Misc

Video Virtuoso Sabour Amirazodi Shares AI-Powered Editing Tips This Week ‘In the NVIDIA Studio’

NVIDIA artist Sabour Amirazodi demonstrates his video editing workflows featuring AI this week in a special edition of In the NVIDIA Studio.

The post Video Virtuoso Sabour Amirazodi Shares AI-Powered Editing Tips This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Categories
Offsites

Quantization for Fast and Environmentally Sustainable Reinforcement Learning

Deep reinforcement learning (RL) continues to make great strides in solving real-world sequential decision-making problems such as balloon navigation, nuclear physics, robotics, and games. Despite its promise, one of its limiting factors is long training times. While the current approach to speed up RL training on complex and difficult tasks leverages distributed training scaling up to hundreds or even thousands of computing nodes, it still requires the use of significant hardware resources which makes RL training expensive, while increasing its environmental impact. However, recent work [1, 2] indicates that performance optimizations on existing hardware can reduce the carbon footprint (i.e., total greenhouse gas emissions) of training and inference.

RL can also benefit from similar system optimization techniques that can reduce training time, improve hardware utilization and reduce carbon dioxide (CO2) emissions. One such technique is quantization, a process that converts full-precision floating point (FP32) numbers to lower precision (int8) numbers and then performs computation using the lower precision numbers. Quantization can save memory storage cost and bandwidth for faster and more energy-efficient computation. Quantization has been successfully applied to supervised learning to enable edge deployments of machine learning (ML) models and achieve faster training. However, there remains an opportunity to apply quantization to RL training.

To that end, we present “QuaRL: Quantization for Fast and Environmentally SustainableReinforcement Learning”, published in the Transactions of Machine Learning Research journal, which introduces a new paradigm called ActorQ that applies quantization to speed up RL training by 1.5-5.4x while maintaining performance. Additionally, we demonstrate that compared to training in full-precision, the carbon footprint is also significantly reduced by a factor of 1.9-3.8x.

Applying Quantization to RL Training
In traditional RL training, a learner policy is applied to an actor, which uses the policy to explore the environment and collect data samples. The samples collected by the actor are then used by the learner to continuously refine the initial policy. Periodically, the policy trained on the learner side is used to update the actor’s policy. To apply quantization to RL training, we develop the ActorQ paradigm. ActorQ performs the same sequence described above, with one key difference being that the policy update from learner to actors is quantized, and the actor explores the environment using the int8 quantized policy to collect samples.

Applying quantization to RL training in this fashion has two key benefits. First, it reduces the memory footprint of the policy. For the same peak bandwidth, less data is transferred between learners and actors, which reduces the communication cost for policy updates from learners to actors. Second, the actors perform inference on the quantized policy to generate actions for a given environment state. The quantized inference process is much faster when compared to performing inference in full precision.

An overview of traditional RL training (left) and ActorQ RL training (right).

In ActorQ, we use the ACME distributed RL framework. The quantizer block performs uniform quantization that converts the FP32 policy to int8. The actor performs inference using optimized int8 computations. Though we use uniform quantization when designing the quantizer block, we believe that other quantization techniques can replace uniform quantization and produce similar results. The samples collected by the actors are used by the learner to train a neural network policy. Periodically the learned policy is quantized by the quantizer block and broadcasted to the actors.

Quantization Improves RL Training Time and Performance
We evaluate ActorQ in a range of environments, including the Deepmind Control Suite and the OpenAI Gym. We demonstrate the speed-up and improved performance of D4PG and DQN. We chose D4PG as it was the best learning algorithm in ACME for Deepmind Control Suite tasks, and DQN is a widely used and standard RL algorithm.

We observe a significant speedup (between 1.5x and 5.41x) in training RL policies. More importantly, performance is maintained even when actors perform int8 quantized inference. The figures below demonstrate this for the D4PG and DQN agents for Deepmind Control Suite and OpenAI Gym tasks.

A comparison of RL training using the FP32 policy (q=32) and the quantized int8 policy (q=8) for D4PG agents on various Deepmind Control Suite tasks. Quantization achieves speed-ups of 1.5x to 3.06x.
A comparison of RL training using the FP32 policy (q=32) and the quantized int8 policy (q=8) for DQN agents in the OpenAI Gym environment. Quantization achieves a speed-up of 2.2x to 5.41x.

Quantization Reduces Carbon Emission
Applying quantization in RL using ActorQ improves training time without affecting performance. The direct consequence of using the hardware more efficiently is a smaller carbon footprint. We measure the carbon footprint improvement by taking the ratio of carbon emission when using the FP32 policy during training over the carbon emission when using the int8 policy during training.

In order to measure the carbon emission for the RL training experiment, we use the experiment-impact-tracker proposed in prior work. We instrument the ActorQ system with carbon monitor APIs to measure the energy and carbon emissions for each training experiment.

Compared to the carbon emission when running in full precision (FP32), we observe that the quantization of policies reduces the carbon emissions anywhere from 1.9x to 3.76x, depending on the task. As RL systems are scaled to run on thousands of distributed hardware cores and accelerators, we believe that the absolute carbon reduction (measured in kilograms of CO2) can be quite significant.

Carbon emission comparison between training using a FP32 policy and an int8 policy. The X-axis scale is normalized to the carbon emissions of the FP32 policy. Shown by the red bars greater than 1, ActorQ reduces carbon emissions.

Conclusion and Future Directions
We introduce ActorQ, a novel paradigm that applies quantization to RL training and achieves speed-up improvements of 1.5-5.4x while maintaining performance. Additionally, we demonstrate that ActorQ can reduce RL training’s carbon footprint by a factor of 1.9-3.8x compared to training in full-precision without quantization.

ActorQ demonstrates that quantization can be effectively applied to many aspects of RL, from obtaining high-quality and efficient quantized policies to reducing training times and carbon emissions. As RL continues to make great strides in solving real-world problems, we believe that making RL training sustainable will be critical for adoption. As we scale RL training to thousands of cores and GPUs, even a 50% improvement (as we have experimentally demonstrated) will generate significant savings in absolute dollar cost, energy, and carbon emissions. Our work is the first step toward applying quantization to RL training to achieve efficient and environmentally sustainable training.

While our design of the quantizer in ActorQ relied on simple uniform quantization, we believe that other forms of quantization, compression and sparsity can be applied (e.g., distillation, sparsification, etc.). We hope that future work will consider applying more aggressive quantization and compression methods, which may yield additional benefits to the performance and accuracy tradeoff obtained by the trained RL policies.

Acknowledgments
We would like to thank our co-authors Max Lam, Sharad Chitlangia, Zishen Wan, and Vijay Janapa Reddi (Harvard University), and Gabriel Barth-Maron (DeepMind), for their contribution to this work. We also thank the Google Cloud team for providing research credits to seed this work.

Categories
Misc

Upcoming Event: JetPack 5.0.2 Walkthrough for Jetson Orin-Based Modules

Join us on October 4 to explore new features in JetPack 5.0.2. Learn how to develop for any Jetson Orin module using emulation support on the Jetson AGX Orin…

Join us on October 4 to explore new features in JetPack 5.0.2. Learn how to develop for any Jetson Orin module using emulation support on the Jetson AGX Orin Developer Kit.

Categories
Misc

Just Released: New Updates to NVIDIA Riva

Build better GPU-accelerated Speech AI applications with the latest NVIDIA Riva updates, including enterprise support.

Build better GPU-accelerated Speech AI applications with the latest NVIDIA Riva updates, including enterprise support.

Categories
Misc

New Course: Introduction to Physics-Informed Machine Learning with Modulus

Learn the basics of physics-informed deep learning and how to use NVIDIA Modulus, the physics machine learning platform, in this self-paced online course.

Learn the basics of physics-informed deep learning and how to use NVIDIA Modulus, the physics machine learning platform, in this self-paced online course.

Categories
Misc

World-Class: NVIDIA Research Builds AI Model to Populate Virtual Worlds With 3D Objects, Characters

The massive virtual worlds created by growing numbers of companies and creators could be more easily populated with a diverse array of 3D buildings, vehicles, characters and more — thanks to a new AI model from NVIDIA Research. Trained using only 2D images, NVIDIA GET3D generates 3D shapes with high-fidelity textures and complex geometric details. Read article >

The post World-Class: NVIDIA Research Builds AI Model to Populate Virtual Worlds With 3D Objects, Characters appeared first on NVIDIA Blog.

Categories
Offsites

TensorStore for High-Performance, Scalable Array Storage

Many exciting contemporary applications of computer science and machine learning (ML) manipulate multidimensional datasets that span a single large coordinate system, for example, weather modeling from atmospheric measurements over a spatial grid or medical imaging predictions from multi-channel image intensity values in a 2d or 3d scan. In these settings, even a single dataset may require terabytes or petabytes of data storage. Such datasets are also challenging to work with as users may read and write data at irregular intervals and varying scales, and are often interested in performing analyses using numerous machines working in parallel.

Today we are introducing TensorStore, an open-source C++ and Python software library designed for storage and manipulation of n-dimensional data that:

TensorStore has already been used to solve key engineering challenges in scientific computing (e.g., management and processing of large datasets in neuroscience, such as peta-scale 3d electron microscopy data and “4d” videos of neuronal activity). TensorStore has also been used in the creation of large-scale machine learning models such as PaLM by addressing the problem of managing model parameters (checkpoints) during distributed training.

Familiar API for Data Access and Manipulation
TensorStore provides a simple Python API for loading and manipulating large array data. In the following example, we create a TensorStore object that represents a 56 trillion voxel 3d image of a fly brain and access a small 100×100 patch of the data as a NumPy array:

>>> import tensorstore as ts
>>> import numpy as np

# Create a TensorStore object to work with fly brain data.
>>> dataset = ts.open({
... 'driver':
... 'neuroglancer_precomputed',
... 'kvstore':
... 'gs://neuroglancer-janelia-flyem-hemibrain/v1.1/segmentation/',
... }).result()

# Create a 3-d view (remove singleton 'channel' dimension):
>>> dataset_3d = dataset[ts.d['channel'][0]]
>>> dataset_3d.domain
{ "x": [0, 34432), "y": [0, 39552), "z": [0, 41408) }

# Convert a 100x100x1 slice of the data to a numpy ndarray
>>> slice = np.array(dataset_3d[15000:15100, 15000:15100, 20000])

Crucially, no actual data is accessed or stored in memory until the specific 100×100 slice is requested; hence arbitrarily large underlying datasets can be loaded and manipulated without having to store the entire dataset in memory, using indexing and manipulation syntax largely identical to standard NumPy operations. TensorStore also provides extensive support for advanced indexing features, including transforms, alignment, broadcasting, and virtual views (data type conversion, downsampling, lazily on-the-fly generated arrays).

The following example demonstrates how TensorStore can be used to create a zarr array, and how its asynchronous API enables higher throughput:

>>> import tensorstore as ts
>>> import numpy as np

>>> # Create a zarr array on the local filesystem
>>> dataset = ts.open({
... 'driver': 'zarr',
... 'kvstore': 'file:///tmp/my_dataset/',
... },
... dtype=ts.uint32,
... chunk_layout=ts.ChunkLayout(chunk_shape=[256, 256, 1]),
... create=True,
... shape=[5000, 6000, 7000]).result()

>>> # Create two numpy arrays with example data to write.
>>> a = np.arange(100*200*300, dtype=np.uint32).reshape((100, 200, 300))
>>> b = np.arange(200*300*400, dtype=np.uint32).reshape((200, 300, 400))

>>> # Initiate two asynchronous writes, to be performed concurrently.
>>> future_a = dataset[1000:1100, 2000:2200, 3000:3300].write(a)
>>> future_b = dataset[3000:3200, 4000:4300, 5000:5400].write(b)

>>> # Wait for the asynchronous writes to complete
>>> future_a.result()
>>> future_b.result()

Safe and Performant Scaling
Processing and analyzing large numerical datasets requires significant computational resources. This is typically achieved through parallelization across numerous CPU or accelerator cores spread across many machines. Therefore a fundamental goal of TensorStore has been to enable parallel processing of individual datasets that is both safe (i.e., avoids corruption or inconsistencies arising from parallel access patterns) and high performance (i.e., reading and writing to TensorStore is not a bottleneck during computation). In fact, in a test within Google’s datacenters, we found nearly linear scaling of read and write performance as the number of CPUs was increased:

Read and write performance for a TensorStore dataset in zarr format residing on Google Cloud Storage (GCS) accessed concurrently using a variable number of single-core compute tasks in Google data centers. Both read and write performance scales nearly linearly with the number of compute tasks.

Performance is achieved by implementing core operations in C++, extensive use of multithreading for operations such as encoding/decoding and network I/O, and partitioning large datasets into much smaller units through chunking to enable efficiently reading and writing subsets of the entire dataset. TensorStore also provides configurable in-memory caching (which reduces slower storage system interactions for frequently accessed data) and an asynchronous API that enables a read or write operation to continue in the background while a program completes other work.

Safety of parallel operations when many machines are accessing the same dataset is achieved through the use of optimistic concurrency, which maintains compatibility with diverse underlying storage layers (including Cloud storage platforms, such as GCS, as well as local filesystems) without significantly impacting performance. TensorStore also provides strong ACID guarantees for all individual operations executing within a single runtime.

To make distributed computing with TensorStore compatible with many existing data processing workflows, we have also integrated TensorStore with parallel computing libraries such as Apache Beam (example code) and Dask (example code).

Use Case: Language Models
An exciting recent development in ML is the emergence of more advanced language models such as PaLM. These neural networks contain hundreds of billions of parameters and exhibit some surprising capabilities in natural language understanding and generation. These models also push the limits of computational infrastructure; in particular, training a language model such as PaLM requires thousands of TPUs working in parallel.

One challenge that arises during this training process is efficiently reading and writing the model parameters. Training is distributed across many separate machines, but parameters must be regularly saved to a single object (“checkpoint”) on a permanent storage system without slowing down the overall training process. Individual training jobs must also be able to read just the specific set of parameters they are concerned with in order to avoid the overhead that would be required to load the entire set of model parameters (which could be hundreds of gigabytes).

TensorStore has already been used to address these challenges. It has been applied to manage checkpoints associated with large-scale (“multipod”) models trained with JAX (code example) and has been integrated with frameworks such as T5X (code example) and Pathways. Model parallelism is used to partition the full set of parameters, which can occupy more than a terabyte of memory, over hundreds of TPUs. Checkpoints are stored in zarr format using TensorStore, with a chunk structure chosen to allow the partition for each TPU to be read and written independently in parallel.

When saving a checkpoint, each model parameter is written using TensorStore in zarr format using a chunk grid that further subdivides the grid used to partition the parameter over TPUs. The host machines write in parallel the zarr chunks for each of the partitions assigned to TPUs attached to that host. Using TensorStore’s asynchronous API, training proceeds even while the data is still being written to persistent storage. When resuming from a checkpoint, each host reads only the chunks that make up the partitions assigned to that host.

Use Case: 3D Brain Mapping
The field of synapse-resolution connectomics aims to map the wiring of animal and human brains at the detailed level of individual synaptic connections. This requires imaging the brain at extremely high resolution (nanometers) over fields of view of up to millimeters or more, which yields datasets that can span petabytes in size. In the future these datasets may extend to exabytes as scientists contemplate mapping entire mouse or primate brains. However, even current datasets pose significant challenges related to storage, manipulation, and processing; in particular, even a single brain sample may require millions of gigabytes with a coordinate system (pixel space) of hundreds of thousands pixels in each dimension.

We have used TensorStore to solve computational challenges associated with large-scale connectomic datasets. Specifically, TensorStore has managed some of the largest and most widely accessed connectomic datasets, with Google Cloud Storage as the underlying object storage system. For example, it has been applied to the human cortex “h01” dataset, which is a 3d nanometer-resolution image of human brain tissue. The raw imaging data is 1.4 petabytes (roughly 500,000 * 350,000 * 5,000 pixels large, and is further associated with additional content such as 3d segmentations and annotations that reside in the same coordinate system. The raw data is subdivided into individual chunks 128x128x16 pixels large and stored in the “Neuroglancer precomputed” format, which is optimized for web-based interactive viewing and can be easily manipulated from TensorStore.

A fly brain reconstruction for which the underlying data can be easily accessed and manipulated using TensorStore.

Getting Started
To get started using the TensorStore Python API, you can install the tensorstore PyPI package using:

pip install tensorstore

Refer to the tutorials and API documentation for usage details. For other installation options and for using the C++ API, refer to installation instructions.

Acknowledgements
Thanks to Tim Blakely, Viren Jain, Yash Katariya, Jan-Matthis Luckmann, Michał Januszewski, Peter Li, Adam Roberts, Brain Williams, and Hector Yee from Google Research, and Davis Bennet, Stuart Berg, Eric Perlman, Stephen Plaza, and Juan Nunez-Iglesias from the broader scientific community for valuable feedback on the design, early testing and debugging.

Categories
Misc

Continental and AEye Join NVIDIA DRIVE Sim Sensor Ecosystem, Providing Rich Capabilities for AV Development

Autonomous vehicle sensors require the same rigorous testing and validation as the car itself, and one simulation platform is up to the task. Global tier-1 supplier Continental and software-defined lidar maker AEye announced this week at NVIDIA GTC that they will migrate their intelligent lidar sensor model into NVIDIA DRIVE Sim. The companies are the Read article >

The post Continental and AEye Join NVIDIA DRIVE Sim Sensor Ecosystem, Providing Rich Capabilities for AV Development appeared first on NVIDIA Blog.