Categories
Misc

Top Financial Services Sessions at NVIDIA GTC 2022

Discover how Deutsche Bank, U.S. Bank, Capital One, and other firms are using #AI technologies to optimize customer experience in financial services through…

Discover how Deutsche Bank, U.S. Bank, Capital One, and other firms are using #AI technologies to optimize customer experience in financial services through recommender systems, NLP, and more.

Categories
Misc

Calculating and Synchronizing Time with the Precision Timing Protocol on the NVIDIA Spectrum Switch

PTP uses an algorithm and method for synchronizing clocks on various devices across packet-based networks to provide submicrosecond accuracy. NVIDIA Spectrum…

PTP uses an algorithm and method for synchronizing clocks on various devices across packet-based networks to provide submicrosecond accuracy. NVIDIA Spectrum supports PTP in both one-step and two-step modes and can serve either as a boundary or a transparent clock.

Here’s how the switch calculates and synchronizes time in one-step mode when acting as a transparent clock. Later in this post, I review overall PTP accuracy.

Calculating and synchronizing time in one-step mode

In one-step mode when acting as a transparent clock, the switch must calculate the residence time of a PTP packet in real time. It does this by comparing the time of the packet’s arrival (t1) with the time of the packet’s egress (t2). The switch then changes the correction field of the packet accordingly.

To perform this calculation, the switch uses several hardware features:

  • A synchronized clock across the ASIC
  • An accurate timestamp as the packet enters the switch
  • A calculation of the time at which the packet will egress the switch

A synchronized clock across the ASIC

Because t1 at ingress and t2 at egress are on two different switch ports, time synchronization between different parts of the ASIC must be of high resolution to maintain an accurate comparison.

Having synchronized timestamps between different hardware units that sometimes work in different frequencies is challenging. The Spectrum family of ASICs can maintain synchronization errors smaller than 4 nanoseconds.

An accurate timestamp as the packet enters the switch

To achieve accurate one-step PTP, the switch must record the exact time at which it receives the packet.

As the switch receives bits from the line, it must assemble them, then parse and recognize the packet as PTP. This process takes time and must be considered so that there is no difference between the timestamp on the packet and the actual time at which the bits enter the switch.

To solve this challenge, the switch includes a designated hardware counter that calculates the number of bits between the line and the packet assembly. This counter can be translated to latency according to the protocol, then subtracted from the t1 timestamp to find the exact arrival time of the packet.

A calculation of the time at which the packet will egress the switch

Calculating the time at which the packet will egress the switch in advance is also a challenge. This is because the latency is typically affected by queuing and other parameters that are not accessible when the switch calculates the timestamp.

To solve this challenge, the switch schedules a future time for the packet to egress, then timestamps the packet according to this time. The PTP packet must then wait until the exact time to egress.

Schematic drawing of the PTP packet modification.
Figure 1. PTP packet modification in the switch

PTP scale

Other vendors use the software to match a PTP packet and its timestamp. The NVIDIA Spectrum-2 and later ASICs take a different approach. They handle PTP flows completely by hardware; nothing is required from software. There are many advantages to this implementation.

The Spectrum approach scales better for PTP flows and there is no burden on the switch’s limited compute resources. The scale is only limited by the CPU host capabilities when acting as a boundary clock. For a transparent clock, where no software is involved, there is technically no limit on the scale.

Software processing is serial and slower than hardware. Therefore, a PTP packet resides on the switch much longer if it requires software intervention. This process increases the delay between the primary and the follower entities in the network and can indirectly damage the synchronization process that assumes constant traversing time from point to point in the network.

PTP accuracy

The overall PTP accuracy of the NVIDIA Spectrum switch is around 10 nanoseconds. This accuracy is maintained for all speeds and FEC configuration.

The following graph demonstrates PTP accuracy on the Spectrum-3 switch.

Schematic drawing of IXIA serving as primary and follower connected to the Spectrum-3 switch.
Figure 2. Setup used to measure PTP accuracy
Graph of offset between primary and follower port of IXIA.
Figure 3. Offset from primary in nanoseconds

These results are taken from a one-hour test at a speed of 50 Gbps in which IXIA serves as a leader clock connected to an NVIDIA Spectrum-3 switch. The switch acts as a boundary clock. Another IXIA port serves as a follower and measures the time offset compared to the primary for each packet.

For more information, see the following resources: 

Categories
Misc

Top Omniverse Sessions for Developers at GTC 2022

Learn how to develop and distribute custom applications for the metaverse with the NVIDIA Omniverse platform at GTC.

Learn how to develop and distribute custom applications for the metaverse with the NVIDIA Omniverse platform at GTC.

Categories
Misc

NVIDIA Hopper Sweeps AI Inference Benchmarks in MLPerf Debut

In their debut on the MLPerf industry-standard AI benchmarks, NVIDIA H100 Tensor Core GPUs set world records in inference on all workloads, delivering up to 4.5x more performance than previous-generation GPUs. The results demonstrate that Hopper is the premium choice for users who demand utmost performance on advanced AI models. Additionally, NVIDIA A100 Tensor Core Read article >

The post NVIDIA Hopper Sweeps AI Inference Benchmarks in MLPerf Debut appeared first on NVIDIA Blog.

Categories
Misc

Upcoming Event: Recommender Systems Sessions at GTC 2022

Learn about transformer-powered personalized online advertising, cross-framework model evaluation, the NVIDIA Merlin ecosystem, and more with these featured GTC…

Learn about transformer-powered personalized online advertising, cross-framework model evaluation, the NVIDIA Merlin ecosystem, and more with these featured GTC 2022 sessions.

Categories
Misc

Full-Stack Innovation Fuels Highest MLPerf Inference 2.1 Results for NVIDIA

Today’s AI-powered applications are enabling richer experiences, fueled by both larger and more complex AI models as well as the application of many models in…

Today’s AI-powered applications are enabling richer experiences, fueled by both larger and more complex AI models as well as the application of many models in a pipeline. To meet the increasing demands of AI-infused applications, an AI platform must not only deliver high performance but also be versatile enough to deliver that performance across a diverse range of AI models. To maximize infrastructure utilization and optimize CapEx, the ability to run the entire AI workflow on the same infrastructure is critical: from data prep and model training to deployed inference.

MLPerf benchmarks have emerged as industry-standard, peer-reviewed measures of deep learning performance, covering AI training, AI inference, and high-performance computing (HPC). MLPerf Inference 2.1, the latest iteration of the MLPerf Inference benchmark suite, covers a breadth of common AI use cases including recommenders, natural language processing, speech recognition, medical imaging, image classification, and object detection.

In this round, NVIDIA made its first MLPerf submissions on the latest NVIDIA H100 Tensor Core GPU based on the breakthrough NVIDIA Hopper Architecture.

  • H100 set new per-accelerator records on all data center tests, demonstrating up to 4.5x higher inference performance compared to the NVIDIA A100 Tensor Core GPU.
  • A100 continued to demonstrate excellent performance across the full suite of MLPerf Inference 2.1 tests for both the data center and edge inference scenarios.

NVIDIA Jetson AGX Orin, built for edge AI and robotics applications, also delivered up to a 50% improvement in performance-per-watt following its debut in the prior round of MLPerf Inference, and ran all edge workloads and scenarios.

Delivering these performance results required deep software and hardware co-optimization. In this post, we discuss the results and then dive into some of the key software optimizations.

NVIDIA H100 Tensor Core technology

On a per-streaming multiprocessor (SM) basis, the H100 Tensor Cores provide twice the matrix multiply-accumulate (MMA) throughput clock-for-clock of the A100 SMs when using the same data types and four times the throughput when comparing FP16 on an A100 SM to FP8 on an H100 SM. New kernels had to be developed in order to leverage several of the H100’s new capabilities and to take advantage of these dramatically faster Tensor Cores.

The H100 Tensor Cores process data so rapidly that it can be challenging to keep them both fed with enough input data and to post-process their output data. Kernels must create an efficient pipeline such that data loading, Tensor Core processing, post-processing, and storage all happen simultaneously and efficiently.

The new H100 asynchronous transaction barriers are instrumental to the efficiency of these pipelines. The asynchronous barriers allow producer threads to run ahead after signaling data availability. In the case of data loading threads, this provides significant improvement in kernels’ ability to hide memory system latencies and ensure a steady stream of input data is available for the Tensor Cores. The asynchronous transaction barriers also provide an efficient mechanism for consumer threads to wait on resource availability so that they don’t waste SM resources in spin loops.

The Tensor Memory Accelerator (TMA) further turbocharges these kernels. The TMA was designed to natively integrate into asynchronous pipelines, and provides for the asynchronous transfer of multi-dimensional tensors from global memory into the SM’s shared memory.

The Tensor Cores are so fast that operations like address calculation can become a performance bottleneck; the TMA offloads this work so that the kernels can focus on running the math and post-processing as quickly as possible.

Finally, the new kernels employ H100 thread block clusters to exploit locality at the GPU processing cluster (GPC). The thread blocks within each thread block cluster collaborate to load data more efficiently and provide higher input bandwidth to the Tensor Cores.

NVIDIA H100 Tensor Core GPU performance results

Starting with the Data Center category, the NVIDIA H100 Tensor Core GPU delivered the highest per-accelerator performance on every workload across both the Server and Offline scenarios, delivering up to 4.5x more performance in the Offline scenario and up to 3.9x more performance in the Server scenario than the A100 Tensor Core GPU.

Left bar chart shows the H100 delivering up to 3.9x more performance than the A100 in the Server scenario. Right chart shows H100 delivering up to 4.5x more performance than A100 in the Offline scenario.
Figure 1. H100 delivers up to 4.5x more performance than A100 in the MLPerf Inference 2.1 Data Center category

Compared to a CPU-only submission, the H100 Tensor Core GPU provides up to 36x higher performance.

Thanks to full-stack improvements, NVIDIA Jetson AGX Orin turned in large improvements in energy efficiency compared to the last round, delivering up to a 50% efficiency improvement.

Left bar chart shows the queries per watt improvements in the Offline scenario. Right chart shows the energy per stream improvements in the Single Stream and Multi Stream scenarios in the AGX Jetson Orin MLPerf Inference 2.1 submission compared to the MLPerf Inference 2.0 submission.
Figure 2. Efficiency improvements in the NVIDIA Jetson AGX Orin MLPerf Inference 2.1 compared to the prior submission

Here’s a closer look at the software optimizations that made these results possible.

High-performance BERT inference using FP8

Diagram shows the steps to perform FP8 inference on BERT.
Figure 3. FP8 Inference on BERT using E4M3 offers increased stability for the forward pass

The NVIDIA Hopper Architecture incorporates new fourth-generation Tensor Cores with support for two new FP8 data types: E4M3 and E5M2. These new data types increase Tensor Core throughput by 2x and reduce memory requirements by 2x compared to 16-bit floating-point.

The E4M3 offers an additional mantissa bit, which leads to increased stability in the first step of the calculation process, known as the forward pass. The additional exponent bit of E5M2 is more helpful for preventing overflow/underflow during the backward pass. For our BERT FP8 submission, we used E4M3.

Our experiments on NLP models like BERT showed that when quantizing the model from a higher precision (FP32) to a lower precision (such as FP8 or INT8), the drop in accuracy observed with FP8 is lower than that of INT8.

Although we can use quantization aware training (QAT) to recover some of the model accuracy with INT8, the accuracy of INT8 under post training quantization (PTQ) remains a challenge. This is where FP8 is beneficial: It can provide 99.9% accuracy of the FP32 model under PTQ without the additional cost and effort required to run QAT. As a result, FP8 can be used for the 99.9% high accuracy category of MLPerf where previously FP16 was required. In essence, FP8 delivered the performance of INT8 with the accuracy of FP16 for this workload.

In the NVIDIA BERT submission, all fully connected and matrix multiply layers in the encoder used FP8 precision. The implementation of these layers used cuBLASLt to perform the FP8 GEMMs on the H100 Tensor Cores.

Key BERT optimizations were extended to support FP8, including the following:

  • Removing padding: Inputs to BERT have variable sequence lengths, and are padded to a maximum sequence length. We strip the padding to avoid wasting compute on padding, and then reconstruct the padded sequences for the final output to match the input shape.
  • Fused multi-head attention: This is a fusion of four operations: transpose Q/K, Q*K, softmax, and QK*V to compute the attention. Fused multi-head attention enhances memory efficiency, skipping the padding to prevent useless computing. Fused multi-head attention provides roughly a 2x end-to-end speedup.
  • Activation fusion: We fuse the GEMM with more operations, including bias and activation functions (GeLU). This fusion also helps enhance memory efficiency by removing extra memory transfers.

RetinaNet for object detection

In MLPerf Inference 2.1, a new one-stage object detection model named RetinaNet was added. This replaced the ssd-resnet34 and ssd-mobilenet workloads of MLPerf Inference 2.0. This updated model architecture and its new inference dataset bring new challenges in delivering fast, accurate, and power-efficient inference.

NVIDIA submitted results for RetinaNet across all the platforms demonstrating the width of our software support.

RetinaNet is trained and inferred using the Open Images dataset, which contains an order of magnitude more object categories and object notations than the COCO dataset used earlier. For RetinaNet, 264 unique classes are selected for training and inference tasks. This is significantly more than the 81 classes used for ssd-resnet34.

Photo from the OpenImage Dataset shows brightly colored bounding boxes around objects and people in a theater.
Figure 4. The OpenImage Dataset used for RetinaNet training and inference includes highly detailed object annotations

Although RetinaNet is also a single-shot object detection model, it has several key differences compared to ssd-resnet34:

  • RetinaNet uses Feature Pyramid Network (FPN) as its backbone on top of a feedforward ResNeXt architecture. ResNeXt uses group convolution in its computation blocks, and has different math characteristics from that of ResNet34.
  • For every image, 120,087 boxes and 264 unique class scores per box are fed into the non-maximum suppression (NMS) layer, and the top 1,000 scoring boxes are selected for outputs. In ssd-resnet34, these numbers were 25x lesser: 15,130 boxes, 81 classes per box, and 200 topK.
The RetinaNet model architecture illustrated with ResNeXt feed-forward network, Feature Pyramid Network (FPN) and Class/Box subnet (K=264, A=9)
Figure 5. MLPerf Inf 2.1 RetinaNet model architecture

NVIDIA used TensorRT as the backend for RetinaNet. TensorRT significantly accelerates inference throughput by automatically optimizing both the graph execution and layer execution:

  • TensorRT provides full support to execute the model inference in mixed FP32/INT8 precision, with minimal accuracy loss compared to FP16 and FP32 precision.
  • TensorRT automatically selects optimized kernels for group convolutions across all 16 ResNeXt blocks.
  • TensorRT provides fusion patterns for convolution, activation, and (optional) pooling layers, which optimize the memory movement for faster inference by merging the layer weights and reducing the number of operations.
  • For the post-processing NMS layer, NVIDIA leverages EfficientNMS, which is an open-sourced high-performance CUDA kernel specialized for NMS tasks, provided as a TensorRT plugin.

NVIDIA Jetson AGX Orin optimizations

NVIDIA Jetson AGX Orin is the latest NVIDIA platform for edge AI and robotics applications. In this round of MLPerf Inference, Jetson AGX Orin demonstrated excellent performance and energy efficiency improvements across the breadth of MLPerf Inference 2.1 edge workloads. Improvements included a 45% reduction in ResNet-50 multi-stream latency and a 17% boost in BERT offline throughput  over the previous round (v2.0). In the power submission, Orin achieved up to 52% power reduction and 48% perf-per-watt improvement on selected benchmarks.  The submissions used the 22.08 Jetson CUDA-X AI Developer Preview software, which includes an optimized NVIDIA Jetson Linux (L4T) image, TensorRT 8.5.0, CUDA 11.4.14, and cuDNN 8.5.0, allowing customers to easily benefit from these same improvements. RetinaNet is fully supported and performant on Jetson AGX Orin with this software stack. This demonstrates the ability of the NVIDIA platform and software to support performant DL inference out-of-the-box.

NVIDIA Orin performance improvements

The significant improvement in MLPerf-Inference v2.1 came from both the general performance boost enabled by the system image and TensorRT 8.5 in 22.08 Jetson CUDA-X AI Developer Preview. The optimized Jetson L4T image provides users access to MaxN power mode, which boosts the frequencies of both the GPU and the DLA units. Meanwhile, this image has the option to use an enlarged page size of 64K that can reduce TLB cache misses when running certain inference workloads. Furthermore, the 3.10.1 DLA compiler natively included in the image incorporates a series of optimization features, which increases the performance of workloads running on the Orin DLA by up to 53%.

TensorRT 8.5 includes two new optimizations that improve inference performance. The first is native support for cuDLA which removes the imposition of inserting copy nodes between DLA nodes and GPU nodes. We observed approximately 1.8% DLA engine end-to-end improvements switching from NVMedia to cuDLA. The second is the addition of optimized kernels for small channel * filter size convolutions fused with a beta=1 residual connection. This improved BERT performance by 17% and ResNet50 by 5% on the GPU in Orin.

NVIDIA Orin energy efficiency improvements

The NVIDIA Orin power submission benefited from all the above performance improvements and also focused on further power reduction. Using the updated L4T image for Orin the power consumption is reduced by fine tuning the CPU, GPU, and DLA frequencies per benchmark to achieve the optimum perf-per-watt. This image also enables new platform power saving features like regulator auto phase shedding and low-power states in low-load conditions. The flexibility of USB-C support in Orin was leveraged to consolidate all I/O through Ethernet-over-USB communication. System power was further reduced by disabling I/O subsystems like Ethernet, WiFi, and DP that are not essential for inference and also by using off-the-shelf higher efficiency GaN power adapters.

These platform and software optimizations reduced system power consumption by up to 52% and improved perf-per-watt by up to 48% over our previous submission in 2.0.

3D U-Net performance improvement

In MLPerf Inference v2.0, the 3D U-Net medical imaging workload switched to the KITS19 dataset, which increased image sizes by up to 8x and raised the amount of compute processing required for a given sample up to 18x due to the sliding window inference. For more information about the NVIDIA MLPerf Inference v2.0 submission, see Getting the Best Performance on MLPerf Inference 2.0.

For MLPerf Inference 2.1, we further improved the performance of the first convolution layer with the TensorRT IPluginV2DDynamicExt plugin.

KiTS19 images are single-channel tensors, and this challenges the performance of the very first 3D convolution in 3D U-Net. In 3D convolution, this channel dimension typically contributes to GEMM’s K dimension. This is specially relevant because the overall performance on 3D U-Net is dominated by the first two and last two 3D-Convolutions. In MLPerf Inference v2.0, these four convolutions contributed to roughly 38% of the entire network run-time; the very first layer responsible for 8%. A non-trivial factor that explains that is the need to use zero-padding to accommodate for the NC/32DHW32 vectorized format layout in where the tensor cores can be utilized most efficiently.

With our updated plugin, we use a INT8 Linear format to leverage efficient computation on this single-channel limited 3D shape input. The advantages of this are twofold:

  • Higher effective use of flops: by not performing unneeded computations
  • PCIe transfer B/W savings: avoids the overhead of either moving zero padded input tensor between host and GPU memory or zero-padding on GPU before sending the input tensor to TensorRT

This optimization improved the first layer performance by 2.7x. Additionally, the slicing kernel no longer needs to deal with zero-padding, and therefore its performance also improved by 2x. As a net result, 3D-UNet’s end-to-end performance improved by 5% in MLPerf Inference 2.1.

Breaking performance records across workloads

In MLPerf Inference 2.1, the first NVIDIA H100 submission set new per-accelerator performance records on all workloads in the data center scenario, and delivered up to 4.5x higher performance than the A100 and 36x higher performance than leading CPUs. This generational performance uplift was possible due to both the many breakthroughs of the NVIDIA Hopper Architecture as well as immense software optimizations that take advantage of those capabilities.

NVIDIA Jetson AGX Orin saw an up to 50% boost in energy efficiency in just one round and it continues to deliver overall inference performance leadership for edge AI and robotics applications.

This latest round of MLPerf Inference showcases the leading performance and versatility of the NVIDIA AI platform for the full breadth of AI workloads and scenarios. With the H100 Tensor Core GPU, we are supercharging the NVIDIA AI platform for the most advanced models and providing users with new levels of performance and capabilities for the most demanding workloads.

For more information, see NVIDIA Hopper Architecture In-Depth.

Categories
Misc

GeForce NOW Supports Over 1,400 Games Streaming Instantly

This GFN Thursday marks a milestone: With the addition of six new titles this week, more than 1,400 games are now available to stream from the GeForce NOW library. Plus, GeForce NOW members streaming to supported Smart TVs from Samsung and LG can get into their games faster with an improved user interface. Your Games, Read article >

The post GeForce NOW Supports Over 1,400 Games Streaming Instantly appeared first on NVIDIA Blog.

Categories
Misc

Explainer: What Is a QPU?

A QPU, aka a quantum processor, is the brain of a quantum computer that uses the behavior of particles like electrons or photons to make certain kinds of…

A QPU, aka a quantum processor, is the brain of a quantum computer that uses the behavior of particles like electrons or photons to make certain kinds of calculations much faster than processors in today’s computers.

Categories
Misc

Upcoming Event: Deep Learning Sessions at GTC 2022

Join our deep learning sessions at GTC 2022 to learn about real-world use cases, new tools, and best practices for training and inference.

Join our deep learning sessions at GTC 2022 to learn about real-world use cases, new tools, and best practices for training and inference.

Categories
Misc

Upcoming Event: Manufacturing Workshops at GTC 2022

Join us for manufacturing sessions at GTC 2022, including an expert-led workshop on Computer Vision for Industrial Inspection.

Join us for manufacturing sessions at GTC 2022, including an expert-led workshop on Computer Vision for Industrial Inspection.