DataBloom - Part 241

ReAct: Synergizing Reasoning and Acting in Language Models

Post author By
Post date November 13, 2022
No Comments on ReAct: Synergizing Reasoning and Acting in Language Models

Posted by Shunyu Yao, Student Researcher, and Yuan Cao, Research Scientist, Google Research, Brain Team <!––>

Recent advances have expanded the applicability of language models (LM) to downstream tasks. On one hand, existing language models that are properly prompted, via chain-of-thought, demonstrate emergent capabilities that carry out self-conditioned reasoning traces to derive answers from questions, excelling at various arithmetic, commonsense, and symbolic reasoning tasks. However, with chain-of-thought prompting, a model is not grounded in the external world and uses its own internal representations to generate reasoning traces, limiting its ability to reactively explore and reason or update its knowledge. On the other hand, recent work uses pre-trained language models for planning and acting in various interactive environments (e.g., text games, web navigation, embodied tasks, robotics), with a focus on mapping text contexts to text actions via the language model’s internal knowledge. However, they do not reason abstractly about high-level goals or maintain a working memory to support acting over long horizons.

In “ReAct: Synergizing Reasoning and Acting in Language Models”, we propose a general paradigm that combines reasoning and acting advances to enable language models to solve various language reasoning and decision making tasks. We demonstrate that the Reason+Act (ReAct) paradigm systematically outperforms reasoning and acting only paradigms, when prompting bigger language models and fine-tuning smaller language models. The tight integration of reasoning and acting also presents human-aligned task-solving trajectories that improve interpretability, diagnosability, and controllability..

Model Overview

ReAct enables language models to generate both verbal reasoning traces and text actions in an interleaved manner. While actions lead to observation feedback from an external environment (“Env” in the figure below), reasoning traces do not affect the external environment. Instead, they affect the internal state of the model by reasoning over the context and updating it with useful information to support future reasoning and acting.

Previous methods prompt language models (LM) to either generate self-conditioned reasoning traces or task-specific actions. We propose ReAct, a new paradigm that combines reasoning and acting advances in language models.

ReAct Prompting

We focus on the setup where a frozen language model, PaLM-540B, is prompted with few-shot in-context examples to generate both domain-specific actions (e.g., “search” in question answering, and “go to” in room navigation), and free-form language reasoning traces (e.g., “Now I need to find a cup, and put it on the table”) for task solving.

For tasks where reasoning is of primary importance, we alternate the generation of reasoning traces and actions so that the task-solving trajectory consists of multiple reasoning-action-observation steps. In contrast, for decision making tasks that potentially involve a large number of actions, reasoning traces only need to appear sparsely in the most relevant positions of a trajectory, so we write prompts with sparse reasoning and let the language model decide the asynchronous occurrence of reasoning traces and actions for itself.

As shown below, there are various types of useful reasoning traces, e.g., decomposing task goals to create action plans, injecting commonsense knowledge relevant to task solving, extracting important parts from observations, tracking task progress while maintaining plan execution, handling exceptions by adjusting action plans, and so on.

The synergy between reasoning and acting allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interacting with the external environments (e.g., Wikipedia) to incorporate additional information into reasoning (act to reason).

ReAct Fine-tuning

We also explore fine-tuning smaller language models using ReAct-format trajectories. To reduce the need for large-scale human annotation, we use the ReAct prompted PaLM-540B model to generate trajectories, and use trajectories with task success to fine-tune smaller language models (PaLM-8/62B).

Comparison of four prompting methods, (a) Standard, (b) Chain of thought (CoT, Reason Only), (c) Act-only, and (d) ReAct, solving a HotpotQA question. In-context examples are omitted, and only the task trajectory is shown. ReAct is able to retrieve information to support reasoning, while also using reasoning to target what to retrieve next, demonstrating a synergy of reasoning and acting.

Results

We conduct empirical evaluations of ReAct and state-of-the-art baselines across four different benchmarks: question answering (HotPotQA), fact verification (Fever), text-based game (ALFWorld), and web page navigation (WebShop). For HotPotQA and Fever, with access to a Wikipedia API with which the model can interact, ReAct outperforms vanilla action generation models while being competitive with chain of thought reasoning (CoT) performance. The approach with the best results is a combination of ReAct and CoT that uses both internal knowledge and externally obtained information during reasoning.

	HotpotQA (exact match, 6-shot)	FEVER (accuracy, 3-shot)
Standard	28.7	57.1
Reason-only (CoT)	29.4	56.3
Act-only	25.7	58.9
ReAct	27.4	60.9
Best ReAct + CoT Method	35.1	64.6
Supervised SoTA	67.5 (using ~140k samples)	89.5 (using ~90k samples)

PaLM-540B prompting results on HotpotQA and Fever.

On ALFWorld and WebShop, ReAct with both one-shot and two-shot prompting outperforms imitation and reinforcement learning methods trained with ~105 task instances, with an absolute improvement of 34% and 10% in success rates, respectively, over existing baselines.

	AlfWorld (2-shot)	WebShop (1-shot)
Act-only	45	30.1
ReAct	71	40
Imitation Learning Baselines	37 (using ~100k samples)	29.1 (using ~90k samples)

PaLM-540B prompting task success rate results on AlfWorld and WebShop.

Scaling results for prompting and fine-tuning on HotPotQA with ReAct and different baselines. ReAct consistently achieves best fine-tuning performances.

A comparison of the ReAct (top) and CoT (bottom) reasoning trajectories on an example from Fever (observation for ReAct is omitted to reduce space). In this case ReAct provided the right answer, and it can be seen that the reasoning trajectory of ReAct is more grounded on facts and knowledge, in contrast to CoT’s hallucination behavior.

We also explore human-in-the-loop interactions with ReAct by allowing a human inspector to edit ReAct’s reasoning traces. We demonstrate that by simply replacing a hallucinating sentence with inspector hints, ReAct can change its behavior to align with inspector edits and successfully complete a task. Solving tasks becomes significantly easier when using ReAct as it only requires the manual editing of a few thoughts, which enables new forms of human-machine collaboration.

A human-in-the-loop behavior correction example with ReAct on AlfWorld. (a) ReAct trajectory fails due to a hallucinating reasoning trace (Act 17). (b) A human inspector edits two reasoning traces (Act 17, 23), ReAct then produces desirable reasoning traces and actions to complete the task.

Conclusion

We present ReAct, a simple yet effective method for synergizing reasoning and acting in language models. Through various experiments that focus on multi-hop question-answering, fact checking, and interactive decision-making tasks, we show that ReAct leads to superior performance with interpretable decision traces.

ReAct demonstrates the feasibility of jointly modeling thought, actions and feedback from the environment within a language model, making it a versatile agent that is capable of solving tasks that require interactions with the environment. We plan to further extend this line of research and leverage the strong potential of the language model for tackling broader embodied tasks, via approaches like massive multitask training and coupling ReAct with equally strong reward models.

Acknowledgements

We would like to thank Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran and Karthik Narasimhan for their great contribution in this work. We would also like to thank Google’s Brain team and the Princeton NLP Group for their joint support and feedback, including project scoping, advising and insightful discussions.

Offsites

Multi-layered Mapping of Brain Tissue via Segmentation Guided Contrastive Learning

Post author By
Post date November 13, 2022
No Comments on Multi-layered Mapping of Brain Tissue via Segmentation Guided Contrastive Learning

Posted by Peter H. Li, Research Scientist, and Sven Dorkenwald, Student Researcher, Connectomics at Google

Mapping the wiring and firing activity of the human brain is fundamental to deciphering how we think — how we sense the world, learn, decide, remember, and create — as well as what issues can arise in brain disease or dysfunction. Recent efforts have delivered publicly available brain maps (high-resolution 3D mapping of brain cells and their connectivities) at unprecedented quality and scale, such as H01, a 1.4 petabyte nanometer-scale digital reconstruction of a sample of human brain tissue from Harvard / Google, and the cubic millimeter mouse cortex dataset from our colleagues at the MICrONS consortium.

To interpret brain maps at this scale requires multiple layers of analysis, including the identification of synaptic connections, cellular subcompartments, and cell types. Machine learning and computer vision technology have played a central role in enabling these analyses, but deploying such systems is still a laborious process, requiring hours of manual ground truth labeling by expert annotators and significant computational resources. Moreover, some important tasks, such as identifying the cell type from only a small fragment of axon or dendrite, can be challenging even for human experts, and have not yet been effectively automated.

Today, in “Multi-Layered Maps of Neuropil with Segmentation-Guided Contrastive Learning”, we are announcing Segmentation-Guided Contrastive Learning of Representations (SegCLR), a method for training rich, generic representations of cellular morphology (the cell’s shape) and ultrastructure (the cell’s internal structure) without laborious manual effort. SegCLR produces compact vector representations (i.e., embeddings) that are applicable across diverse downstream tasks (e.g., local classification of cellular subcompartments, unsupervised clustering), and are even able to identify cell types from only small fragments of a cell. We trained SegCLR on both the H01 human cortex dataset and the MICrONS mouse cortex dataset, and we are releasing the resulting embedding vectors, about 8 billion in total, for researchers to explore.

From brain cells segmented out of a 3D block of tissue, SegCLR embeddings capture cellular morphology and ultrastructure and can be used to distinguish cellular subcompartments (e.g., dendritic spine versus dendrite shaft) or cell types (e.g., pyramidal versus microglia cell).

Representing Cellular Morphology and Ultrastructure

SegCLR builds on recent advances in self-supervised contrastive learning. We use a standard deep network architecture to encode inputs comprising local 3D blocks of electron microscopy data (about 4 micrometers on a side) into 64-dimensional embedding vectors. The network is trained via a contrastive loss to map semantically related inputs to similar coordinates in the embedding space. This is close to the popular SimCLR setup, except that we also require an instance segmentation of the volume (tracing out individual cells and cell fragments), which we use in two important ways.

First, the input 3D electron microscopy data are explicitly masked by the segmentation, forcing the network to focus only on the central cell within each block. Second, we leverage the segmentation to automatically define which inputs are semantically related: positive pairs for the contrastive loss are drawn from nearby locations on the same segmented cell and trained to have similar representations, while inputs drawn from different cells are trained to have dissimilar representations. Importantly, publicly available automated segmentations of the human and mouse datasets were sufficiently accurate to train SegCLR without requiring laborious review and correction by human experts.

SegCLR is trained to represent rich cellular features without manual labeling. Top: The SegCLR architecture maps local masked 3D views of electron microscopy data to embedding vectors. Only the microscopy volume and a draft automated instance segmentation are required. Bottom: The segmentation is also used to define positive versus negative example pairs, whose representations are pushed closer together (positives, blue arrows) or further apart (negatives, red arrows) during training.

Reducing Annotation Training Requirements by Three Orders of Magnitude

SegCLR embeddings can be used in diverse downstream settings, whether supervised (e.g., training classifiers) or unsupervised (e.g., clustering or content-based image retrieval). In the supervised setting, embeddings simplify the training of classifiers, and can greatly reduce ground truth labeling requirements. For example, we found that for identifying cellular subcompartments (axon, dendrite, soma, etc.) a simple linear classifier trained on top of SegCLR embeddings outperformed a fully supervised deep network trained on the same task, while using only about one thousand labeled examples instead of millions.

We assessed the classification performance for axon, dendrite, soma, and astrocyte subcompartments in the human cortex dataset via mean F1-Score, while varying the number of training examples used. Linear classifiers trained on top of SegCLR embeddings matched or exceeded the performance of a fully supervised deep classifier (horizontal line), while using a fraction of the training data.

Distinguishing Cell Types, Even from Small Fragments

Distinguishing different cell types is an important step towards understanding how brain circuits develop and function in health and disease. Human experts can learn to identify some cortical cell types based on morphological features, but manual cell typing is laborious and ambiguous cases are common. Cell typing also becomes more difficult when only small fragments of cells are available, which is common for many cells in current connectomic reconstructions.

Human experts manually labeled cell types for a small number of proofread cells in each dataset. In the mouse cortex dataset, experts labeled six neuron types (top) and four glia types (not shown). In the human cortex dataset, experts labeled two neuron types (not shown) and four glia types (bottom). (Rows not to scale with each other.)

We found that SegCLR accurately infers human and mouse cell types, even for small fragments. Prior to classification, we collected and averaged embeddings within each cell over a set aggregation distance, defined as the radius from a central point. We found that human cortical cell types can be identified with high accuracy for aggregation radii as small as 10 micrometers, even for types that experts find difficult to distinguish, such as microglia (MGC) versus oligodendrocyte precursor cells (OPC).

SegCLR can classify cell types, even from small fragments. Left: Classification performance over six human cortex cell types for shallow ResNet models trained on SegCLR embeddings for different sized cell fragments. Aggregation radius zero corresponds to very small fragments with only a single embedding. Cell type performance reaches high accuracy (0.938 mean F1-Score) for fragments with aggregation radii of only 10 micrometers (boxed point). Right: Class-wise confusion matrix at 10 micrometers aggregation radius. Darker shading along the diagonal indicates that predicted cell types agree with expert labels in most cases. AC: astrocyte; MGC: microglia cell; OGC: oligodendrocyte cell; OPC: oligodendrocyte precursor cell; E: excitatory neuron; I: inhibitory neuron.

In the mouse cortex, ten cell types could be distinguished with high accuracy at aggregation radii of 25 micrometers.

Left: Classification performance over the ten mouse cortex cell types reaches 0.832 mean F1-Score for fragments with aggregation radius 25 micrometers (boxed point). Right: The class-wise confusion matrix at 25 micrometers aggregation radius. Boxes indicate broad groups (glia, excitatory neurons, and inhibitory interneurons). P: pyramidal cell; THLC: thalamocortical axon; BC: basket cell; BPC: bipolar cell; MC: Martinotti cell; NGC: neurogliaform cell.

In additional cell type applications, we used unsupervised clustering of SegCLR embeddings to reveal further neuronal subtypes, and demonstrated how uncertainty estimation can be used to restrict classification to high confidence subsets of the dataset, e.g., when only a few cell types have expert labels.

Revealing Patterns of Brain Connectivity

Finally, we showed how SegCLR can be used for automated analysis of brain connectivity by cell typing the synaptic partners of reconstructed cells throughout the mouse cortex dataset. Knowing the connectivity patterns between specific cell types is fundamental to interpreting large-scale connectomic reconstructions of brain wiring, but this typically requires manual tracing to identify partner cell types. Using SegCLR, we replicated brain connectivity findings that previously relied on intensive manual tracing, while extending their scale in terms of the number of synapses, cell types, and brain areas analyzed. (See the paper for further details.)

SegCLR automated analysis of brain connectivity. Top: An example mouse pyramidal cell, with synapse locations color-coded according to whether the synaptic partner was classified as inhibitory (blue), excitatory (red), or unknown (black). Inset shows higher detail of the soma and proximal dendrites. Bottom: We counted how many upstream synaptic partners were classified as thalamocortical axons, which bring input from sensory systems to the cortex. We found that thalamic input arrives primarily at cortical layer L4, the canonical cortical input layer, and preferentially targets primary visual area V1, rather than higher visual areas (HVA).

What’s Next?

SegCLR captures rich cellular features and can greatly simplify downstream analyses compared to working directly with raw image and segmentation data. We are excited to see what the community can discover using the ~8 billion embeddings we are releasing for the human and mouse cortical datasets (example access code; browsable human and mouse views in Neuroglancer). By reducing complex microscopy data to rich and compact embedding representations, SegCLR opens many novel avenues for biological insight, and may serve as a link to complementary modalities for high-dimensional characterization at the cellular and subcellular levels, such as spatially-resolved transcriptomics.

Offsites

Characterizing Emergent Phenomena in Large Language Models

Post author By
Post date November 13, 2022
No Comments on Characterizing Emergent Phenomena in Large Language Models

Posted by Jason Wei and Yi Tay, Research Scientists, Google Research, Brain Team

The field of natural language processing (NLP) has been revolutionized by language models trained on large amounts of text data. Scaling up the size of language models often leads to improved performance and sample efficiency on a range of downstream NLP tasks. In many cases, the performance of a large language model can be predicted by extrapolating the performance trend of smaller models. For instance, the effect of scale on language model perplexity has been empirically shown to span more than seven orders of magnitude.

On the other hand, performance for certain other tasks does not improve in a predictable fashion. For example, the GPT-3 paper showed that the ability of language models to perform multi-digit addition has a flat scaling curve (approximately random performance) for models from 100M to 13B parameters, at which point the performance jumped substantially. Given the growing use of language models in NLP research and applications, it is important to better understand abilities such as these that can arise unexpectedly.

In “Emergent Abilities of Large Language Models,” recently published in the Transactions on Machine Learning Research (TMLR), we discuss the phenomena of emergent abilities, which we define as abilities that are not present in small models but are present in larger models. More specifically, we study emergence by analyzing the performance of language models as a function of language model scale, as measured by total floating point operations (FLOPs), or how much compute was used to train the language model. However, we also explore emergence as a function of other variables, such as dataset size or number of model parameters (see the paper for full details). Overall, we present dozens of examples of emergent abilities that result from scaling up language models. The existence of such emergent abilities raises the question of whether additional scaling could potentially further expand the range of capabilities of language models.

Emergent Prompted Tasks

First we discuss emergent abilities that may arise in prompted tasks. In such tasks, a pre-trained language model is given a prompt for a task framed as next word prediction, and it performs the task by completing the response. Without any further fine-tuning, language models can often perform tasks that were not seen during training.

Example of few-shot prompting on movie review sentiment classification. The model is given one example of a task (classifying a movie review as positive or negative) and then performs the task on an unseen example.

We call a prompted task emergent when it unpredictably surges from random performance to above-random at a specific scale threshold. Below we show three examples of prompted tasks with emergent performance: multi-step arithmetic, taking college-level exams, and identifying the intended meaning of a word. In each case, language models perform poorly with very little dependence on model size up to a threshold at which point their performance suddenly begins to excel.

The ability to perform multi-step arithmetic (left), succeed on college-level exams (middle), and identify the intended meaning of a word in context (right) all emerge only for models of sufficiently large scale. The models shown include LaMDA, GPT-3, Gopher, Chinchilla, and PaLM.

Performance on these tasks only becomes non-random for models of sufficient scale — for instance, above 10²² training FLOPs for the arithmetic and multi-task NLU tasks, and above 10²⁴ training FLOPs for the word in context tasks. Note that although the scale at which emergence occurs can be different for different tasks and models, no model showed smooth improvement in behavior on any of these tasks. Dozens of other emergent prompted tasks are listed in our paper.

Emergent Prompting Strategies

The second class of emergent abilities encompasses prompting strategies that augment the capabilities of language models. Prompting strategies are broad paradigms for prompting that can be applied to a range of different tasks. They are considered emergent when they fail for small models and can only be used by a sufficiently-large model.

One example of an emergent prompting strategy is called “chain-of-thought prompting”, for which the model is prompted to generate a series of intermediate steps before giving the final answer. Chain-of-thought prompting enables language models to perform tasks requiring complex reasoning, such as a multi-step math word problem. Notably, models acquire the ability to do chain-of-thought reasoning without being explicitly trained to do so. An example of chain-of-thought prompting is shown in the figure below.

Chain of thought prompting enables sufficiently large models to solve multi-step reasoning problems.

The empirical results of chain-of-thought prompting are shown below. For smaller models, applying chain-of-thought prompting does not outperform standard prompting, for example, when applied to GSM8K, a challenging benchmark of math word problems. However, for large models (10²⁴ FLOPs), chain-of-thought prompting substantially improves performance in our tests, reaching a 57% solve rate on GSM8K.

Chain-of-thought prompting is an emergent ability — it fails to improve performance for small language models, but substantially improves performance for large models. Here we illustrate the difference between standard and chain-of-thought prompting at different scales for two language models, LaMDA and PaLM.

Implications of Emergent Abilities

The existence of emergent abilities has a range of implications. For example, because emergent few-shot prompted abilities and strategies are not explicitly encoded in pre-training, researchers may not know the full scope of few-shot prompted abilities of current language models. Moreover, the emergence of new abilities as a function of model scale raises the question of whether further scaling will potentially endow even larger models with new emergent abilities.

Identifying emergent abilities in large language models is a first step in understanding such phenomena and their potential impact on future model capabilities. Why does scaling unlock emergent abilities? Because computational resources are expensive, can emergent abilities be unlocked via other methods without increased scaling (e.g., better model architectures or training techniques)? Will new real-world applications of language models become unlocked when certain abilities emerge? Analyzing and understanding the behaviors of language models, including emergent behaviors that arise from scaling, is an important research question as the field of NLP continues to grow.

Acknowledgements

It was an honor and privilege to work with Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus.

Misc

NVIDIA Grace Hopper Superchip Architecture In-Depth

Post author By
Post date November 10, 2022
No Comments on NVIDIA Grace Hopper Superchip Architecture In-Depth

The NVIDIA Grace Hopper Superchip Architecture is the first true heterogeneous accelerated platform for high-performance computing (HPC) and AI workloads. It…

The NVIDIA Grace Hopper Superchip Architecture is the first true heterogeneous accelerated platform for high-performance computing (HPC) and AI workloads. It accelerates applications with the strengths of both GPUs and CPUs while providing the simplest and most productive distributed heterogeneous programming model to date. Scientists and engineers can focus on solving the world’s most important problems.

Bar chart shows the simulations of the speedups delivered by Grace Hopper over x86 + Hopper platforms for ML Training, Databases, and HPC Applications. For ML Training: up to 4x for Natural Language Processing (NLP), 3.5x for Deep Learning Recommender Models (DLRM), and 1.9x for Graph Neural Networks (GNN). Up to 4.4x for Database Hash Join. For HPC applications, up to 3.6x for ABINIT, 1.75x for OpenFOAM, and 1.3x for multi-node GROMACS. — *Figure 1. End-user application performance simulations of Grace Hopper vs x86+Hopper (Source:* *NVIDIA Grace Hopper Architecture whitepaper*)

In this post, you learn all about the Grace Hopper Superchip and highlight the performance breakthroughs that NVIDIA Grace Hopper delivers. For more information about the speedups that Grace Hopper achieves over the most powerful PCIe-based accelerated platforms using NVIDIA Hopper H100 GPUs, see the NVIDIA Grace Hopper Superchip Architecture whitepaper.

Performance and productivity for strong-scaling HPC and giant AI workloads

The NVIDIA Grace Hopper Superchip architecture brings together the groundbreaking performance of the NVIDIA Hopper GPU with the versatility of the NVIDIA Grace CPU, connected with a high bandwidth and memory coherent NVIDIA NVLink Chip-2-Chip (C2C) interconnect in a single superchip, and support for the new NVIDIA NVLink Switch System.

Diagram of the NVIDIA Grace Hopper Superchip showing the LPDDR5X, HBM3, NVLink, and I/O bandwidths as well as memory capacities. Hopper has up to 96 GB HBM3 at up to 3000 GB/s bandwidth. Grace has up to 512 GB LPDDR5X at up to 546 GB/s bandwidth. Grace and Hopper are connected with NVLink C2C at up to 900 GB/s bandwidth. The Grace Hopper Superchip has up to 64 PCIe Gen 5 lanes delivering up to 512 GB/s bandwidth and up to 18x NVLink for lanes delivering up to 900 GB/s to the NVLink Switch network. — *Figure 2. NVIDIA Grace Hopper Superchip logical overview*

NVIDIA NVLink-C2C is an NVIDIA memory coherent, high-bandwidth, and low-latency superchip interconnect. It is the heart of the Grace Hopper Superchip and delivers up to 900 GB/s total bandwidth. This is 7x higher bandwidth than x16 PCIe Gen5 lanes commonly used in accelerated systems.

NVLink-C2C memory coherency increases developer productivity and performance and enables GPUs to access large amounts of memory.CPU and GPU threads can now concurrently and transparently access both CPU– and GPU-resident memory, enabling you to focus on algorithms instead of explicit memory management.

Memory coherency enables you to transfer only the data you need, and not migrate entire pages to and from the GPU. It also enables lightweight synchronization primitives across GPU and CPU threads by enabling native atomic operations from both the CPU and GPU. NVLink-C2C with Address Translation Services (ATS) leverages the NVIDIA Hopper Direct Memory Access (DMA) copy engines for accelerating bulk transfers of pageable memory across host and device.

NVLink-C2C enables applications to oversubscribe the GPU’s memory and directly utilize NVIDIA Grace CPU’s memory at high bandwidth. With up to 512 GB of LPDDR5X CPU memory per Grace Hopper Superchip, the GPU has direct high-bandwidth access to 4x more memory than what is available with HBM. Combined with the NVIDIA NVLink Switch System, all GPU threads running on up to 256 NVLink-connected GPUs can now access up to 150 TB of memory at high bandwidth. Fourth-generation NVLink enables accessing peer memory using direct loads, stores, and atomic operations, enabling accelerated applications to solve larger problems more easily than ever.

Together with NVIDIA networking technologies, Grace Hopper Superchips provide the recipe for the next generation of HPC supercomputers and AI factories. Customers can take on larger datasets, more complex models, and new workloads, solving them more quickly than before.

The main innovations of the NVIDIA Grace Hopper Superchip are as follows:

NVIDIA Grace CPU:
- Up to 72x Arm Neoverse V2 cores with Armv9.0-A ISA and 4×128-bit SIMD units per core.
- Up to 117 MB of L3 Cache.
- Up to 512 GB of LPDDR5X memory delivering up to 546 GB/s of memory bandwidth.
- Up to 64x PCIe Gen5 lanes.
- NVIDIA Scalable Coherency Fabric (SCF) mesh and distributed cache with up to 3.2 TB/s memory bandwidth.
- High developer productivity with a single CPU NUMA node.
NVIDIA Hopper GPU:
- Up to 144 SMs with fourth-generation Tensor Cores, Transformer Engine, DPX, and 3x higher FP32 and FP64 throughout compared to the NVIDIA A100 GPU.
- Up to 96 GB of HBM3 memory delivering up to 3000 GB/s.
- 60 MB L2 Cache.
- NVLink 4 and PCIe 5.
NVIDIA NVLink-C2C:
- Hardware-coherent interconnect between the Grace CPU and Hopper GPU.
- Up to 900 GB/s total bandwidth, 450 GB/s/dir.
- The Extended GPU Memory feature enables the Hopper GPU to address all CPU memory as GPU memory. Each Hopper GPU can address up to 608 GB of memory within a superchip.
NVIDIA NVLink Switch System:
- Connects up to 256x NVIDIA Grace Hopper Superchips using NVLink 4.
- Each NVLink-connected Hopper GPU can address all HBM3 and LPDDR5X memory of all superchips in the network, for up to 150 TB of GPU addressable memory.

Programming model for performance, portability, and productivity

Traditional heterogeneous platforms with PCIe-connected accelerators require users to follow a complex programming model that involves manually managing device memory allocations and data transfer to and from the host.

The NVIDIA Grace Hopper Superchip platform is heterogeneous and easy to program, and NVIDIA is committed to making it accessible to all developers and applications, independent of the programming language of choice.

Both the Grace Hopper Superchip and the platform are built to enable you to pick the right language for the task at hand, and the NVIDIA CUDA LLVM Compiler APIs enable you to bring your preferred programming language to the CUDA platform with the same level of code-generation quality and optimizations as NVIDIA compilers and tools.

The languages provided by NVIDIA for the CUDA platform (Figure 3) include accelerated standard languages like ISO C++, ISO Fortran, and Python. The platform also supports directive-based programming models like OpenACC, OpenMP, CUDA C++, and CUDA Fortran. The NVIDIA HPC SDK supports all these approaches, along with a rich set of accelerated libraries and tools for profiling and debugging.

The three pillars of the NVIDIA Grace Hopper Superchip programming models are built on top of accelerated libraries, frameworks, and SDKs. The first pillar is Accelerated Standard Languages and includes programming languages like ISO C++, ISO Fortran, and Python. The second pillar is Incremental Portable Optimization and includes OpenMP and OpenACC directives. Finally, maximum performance can be obtained by specializing applications for the CUDA platform using CUDA C++ or CUDA Fortran. — *Figure 3.* *NVIDIA Grace Hopper Superchip programming models*

NVIDIA is a member of the ISO C++ and ISO Fortran programming-language communities, which have enabled ISO C++ and ISO Fortran standard-compliant applications to run on both NVIDIA CPUs and NVIDIA GPUs without any language extensions. For more information about running ISO-conforming applications on GPUs, see Multi-GPU Programming with Standard Parallel C++ and Using Fortran Standard Parallel Programming For GPU Acceleration.

This technology relies heavily on the hardware-accelerated memory coherency provided by NVIDIA NVLink-C2C and NVIDIA Unified Virtual Memory. As shown in Figure 4, in traditional PCIe-connected x86+Hopper systems without ATS, the CPU and the GPU have independent per-process page tables, and system-allocated memory is not directly accessible from the GPU. When a program allocates memory with the system allocator but the page entry is not available in the GPU’s page table, then accessing the memory from a GPU thread fails.

Diagram shows that, on noncoherent platforms with disjoint page tables, when CPU or GPU threads attempt to access a page that is not available in their own separate page tables, the access faults. — *Figure 4. NVIDIA Hopper System with disjoint page tables*

In NVIDIA Grace Hopper Superchip-based systems, ATS enables the CPU and GPU to share a single per-process page table, enabling all CPU and GPU threads to access all system-allocated memory, which can reside on physical CPU or GPU memory. The CPU heap, CPU thread stack, global variables, memory-mapped files, and interprocess memory are accessible to all CPU and GPU threads.

Diagram shows that, on NVIDIA Grace Hopper Superchip systems with ATS, the CPU and GPU can both access the system page table. This enables GPU and CPU threads to access all-system allocated memory, independently of where it resides. — *Figure 5. Address Translation Services in an NVIDIA Grace Hopper Superchip system*

NVIDIA NVLink-C2C hardware-coherency enables the Grace CPU to cache GPU memory at cache-line granularity and for the GPU and CPU to access each other’s memory without page-migrations.

NVLink-C2C also accelerates all atomic operations supported by the CPU and GPU on system-allocated memory. Scoped atomic operations are fully supported and enable fine-grained and scalable synchronization across all threads in the system.

The runtime backs system-allocated memory with physical memory on first touch, either on LPDDR5X or HBM3, depending on whether a CPU or a GPU thread accesses it first. From an OS perspective, the Grace CPU and Hopper GPU are just two separate NUMA nodes. System-allocated memory is migratable so the runtime can change its physical memory backing to improve application performance or deal with memory pressure.

For PCIe-based platforms such as x86 or Arm, you can use the same Unified Memory programming model as the NVIDIA Grace Hopper model. That is possible through the Heterogeneous Memory Management (HMM) feature, which is a combination of Linux kernel features and NVIDIA driver features that use software to emulate memory coherence between CPUs and GPUs.

On NVIDIA Grace Hopper, these applications transparently benefit from the higher-bandwidth, lower-latency, higher atomic throughput, and hardware acceleration for memory coherency provided by NVLink-C2C, without any software changes.

Superchip architectural features

Here’s a look at the main innovations of the NVIDIA Grace Hopper architecture:

NVIDIA Grace CPU
NVIDIA Hopper GPU
NVLink-C2C
NVLink Switch System
Extended GPU memory

NVIDIA Grace CPU

As the parallel compute capabilities of GPUs continue to triple every generation, a fast and efficient CPU is critical to prevent the serial and CPU-only fractions of modern workloads from dominating performance.

NVIDIA Grace CPU is the first NVIDIA data center CPU, and it is built from the ground up to create HPC and AI superchips. Grace provides up to 72 Arm Neoverse V2 CPU cores with the Armv9.0-A ISA, and 4×128-bit wide SIMD units per core with support for Arm’s Scalable Vector Extensions 2 (SVE2) SIMD instruction set.

NVIDIA Grace delivers leading per-thread performance, while providing higher energy efficiency than traditional CPUs. The 72 CPU cores deliver up to a 370 (estimated) score on SPECrate 2017_int_base, ensuring high-performance to satisfy the demands of both HPC and AI heterogeneous workloads.

Bar chart hows simulations of the speedups and energy savings delivered by the Grace CPUs in NVIDIA Grace Hopper Superchips over AMD Milan 7763. OpenFOAM HPC Motorbike Large benchmark is 2.5x faster and uses 3.5x less energy. NEMO GYRE_PISCES, scaling factor nn_GYRE=25, benchmark is 1.6x faster and uses 2.2x less energy. BWA Whole Human Genome (HG002 30x) benchmark is 1.5x faster and uses 2.0x less energy. — Figure 6. End-user application performance and energy saving simulations of the NVIDIA Grace CPU in the Grace Hopper Superchip vs AMD Milan 7763 shows that the Grace CPU is up to 2.5x faster while using 4x less energy

Modern GPU workloads in machine learning and data science need access to huge amounts of memory. Typically, these workloads would have to use multiple GPUs to store the dataset in HBM memory.

The NVIDIA Grace CPU provides up to 512 GB of LPDDR5X memory, which delivers the optimal balance between memory capacity, energy efficiency, and performance. It supplies up to 546 GB/s of LPDDR5X memory bandwidth, which NVLink-C2C makes accessible to the GPU at 900 GB/s total bandwidth.

A single NVIDIA Grace Hopper Superchip provides the Hopper GPU with a total of 608 GB of fast-accessible memory, almost the total amount of slow memory available in a DGX-A100-80; an eight-GPU system of the previous generation.

This is made possible by the NVIDIA SCF shown in Figure 7, a mesh fabric and distributed cache that provides up to 3.2 TB/s of total bisection bandwidth to realize the full performance of CPU cores, memory, system I/Os, and NVLink-C2C. The CPU cores and SCF Cache partitions (SCCs) are distributed throughout the mesh, while Cache Switch Nodes (CSNs) route data through the fabric and serve as interfaces between the CPU cores, cache memory, and the rest of the system.

Diagram shows how two cores and two groups of SCC are connected to the CSNs that route data traffic to LPDDR5X, PCIe, and NVLink. — *Figure 7. SCF logical overview*

NVIDIA Hopper GPU

The NVIDIA Hopper GPU is the ninth-generation NVIDIA data center GPU. It is designed to deliver orders-of-magnitude improvements for large-scale AI and HPC applications compared to previous NVIDIA Ampere GPU generations. The Hopper GPU also features multiple innovations:

New fourth-generation Tensor Cores perform faster matrix computations than ever before on an even broader array of AI and HPC tasks.
A new transformer engine enables H100 to deliver up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation NVIDIA A100 GPU.
Improved features for spatial and temporal data locality and asynchronous execution enable applications to always keep all units busy and maximize power efficiency.
Secure Multi-Instance GPU (MIG ) partitions the GPU into isolated, right-sized instances to maximize quality of service (QoS) for smaller workloads.

First chart shows performance improvements on HPC applications (climate modeling, genomics, Lattice QCD, or 3D FFT). Second chart shows AI inference latency improvements for Megatron Turing NLG 530B. Third chart shows AI training covering Mask R-CNN, GPT-3 with 16B and 175B parameters, DLRM with 14-TB Embedding tables, and MoE Switch-XXL with 395B parameters. — *Figure 8. NVIDIA Hopper GPU enables next-generation AI and HPC breakthroughs*

NVIDIA Hopper is the first truly asynchronous GPU. Its Tensor Memory Accelerator (TMA) and asynchronous transaction barrier enable threads to overlap and pipeline independent data movement and data processing, enabling applications to fully utilize all units.

New spatial and temporal locality features like thread block clusters, distributed shared memory, and thread block reconfiguration provide applications with fast access to larger amounts of shared memory and tools. This enables applications to better reuse data while it’s on-chip, further improving application performance.

On the left, the diagram shows a pipeline that overlaps independent data processing and data movement between producer and consumer threads, which enables keeping all units fully utilized. On the right, the bar chart shows the impact of leveraging Hopper’s new spatial and temporal locality features on application performance. — Figure 9. NVIDIA Hopper GPU asynchronous execution enables overlapping independent data-movement with computation (left). New spatial and temporal locality features improve application performance (right).

For more information, see NVIDIA H100 Tensor Core Architecture Overview and NVIDIA Hopper Architecture In-Depth.

NVLink-C2C: A high-bandwidth, chip-to-chip interconnect for superchips

NVIDIA Grace Hopper fuses an NVIDIA Grace CPU and NVIDIA Hopper GPU into a single superchip through the NVIDIA NVLink-C2C, a 900 GB/s chip-to-chip coherent interconnect that enables programming the Grace Hopper Superchip with a unified programming model.

The NVLink Chip-2-Chip (C2C) interconnect provides a high-bandwidth direct connection between a Grace CPU and a Hopper GPU to create the Grace Hopper Superchip, which is designed for drop-in acceleration of AI and HPC applications.

With 900 GB/s of bidirectional bandwidth, NVLink-C2C provides 7x the bandwidth of x16 PCIe Gen links at lower latency. NVLink-C2C also only uses 1.3 picojoules per bit transferred, which is greater than 5x more energy-efficient than PCIe Gen 5.

Furthermore, NVLink-C2C is a coherent memory interconnect with native hardware support for system-wide atomic operations. This improves the performance of memory accesses to non-local memory, such as CPU and GPU threads accessing memory resident in the other device. Hardware coherency also improves the performance of synchronization primitives, reducing the time the GPU or CPU wait on each other and increasing total system utilization.

Finally, hardware coherency also simplifies the development of heterogeneous computing applications using popular programming languages and frameworks. For more information, see the NVIDIA Grace Hopper Programming Model section.

NVLink Switch System

The NVIDIA NVLink Switch System combines fourth-generation NVIDIA NVLink technology with the new third-generation NVIDIA NVSwitch. A single level of the NVSwitch connects up to eight Grace Hopper Superchips, and a second level in a fat-tree topology enables networking up to 256 Grace Hopper Superchips with NVLink. A Grace Hopper Superchip pair exchanges data at up to 900 GB/s.

With up to 256 Grace Hopper Superchips, the network delivers up to 115.2 TB/s all-to-all bandwidth. This is 9x the all-to-all bandwidth of the NVIDIA InfiniBand NDR400.

Logical overview of the NVLink 4 NVSwitch chip including the PHY Lanes, PORT logic and SHARP accelerators, and cross-bar. — *Figure 10. Logical overview of NVIDIA NVLink 4 NVSwitch*

The fourth-generation NVIDIA NVLink technology enables GPU threads to address up to 150 TB of memory provided by all superchips in the NVLink network using normal memory operations, atomic operations, and bulk transfers. Communication libraries like MPI, NCCL, or NVSHMEM transparently leverage the NVLink Switch System when available.

Extended GPU memory

The NVIDIA Grace Hopper Superchip is designed to accelerate applications with exceptionally large memory footprints, larger than the capacity of the HBM3 and LPDDR5X memory of a single superchip. For more information, see the NVIDIA Grace Hopper Accelerated Applications section.

The Extended GPU Memory (EGM) feature over the high-bandwidth NVLink-C2C enables GPUs to access all the system memory efficiently. EGM provides up to 150 TBs of system memory in a multi-node NVSwitch-connected system. With EGM, physical memory can be allocated to be accessible from any GPU thread in the multi-node system. All GPUs can access EGM at the minimum of GPU-GPU NVLink or NVLink-C2C speed.

Memory accesses within a Grace Hopper Superchip configuration go through the local high-bandwidth NVLink-C2C at 900 GB/s total. Remote memory accesses are performed through GPU NVLink and, depending on the memory being accessed, also NVLink-C2C (Figure 11). With EGM, GPU threads can now access all memory resources available over the NVSwitch fabric, both LPDDR5X and HBM3, at 450 GB/s.

Diagram shows the access paths taken by memory accesses from a Hopper GPU to its local LPPDR5X (path via C2C), to a peer GPU HBM3 (path via NVLink), and to a peer CPU LPDDR5X (path via NVLink to peer GPU, then C2C to LPDDR5X). — *Figure 11. Memory accesses across NVLink-connected Grace Hopper Superchips*

NVIDIA HGX Grace Hopper

NVIDIA HGX Grace Hopper has a single Grace Hopper Superchip per node, paired with BlueField-3 NICs or OEM-Defined I/O and optionally an NVLink Switch System. It can be air– or liquid-cooled and has up to 1,000W TDP.

NVIDIA HGX Grace Hopper with InfiniBand

NVIDIA HGX Grace Hopper with Infiniband (Figure 13) is ideal for the scale-out of traditional machine learning (ML) and HPC workloads that are not bottlenecked by network communication overheads of InfiniBand, which is one of the fastest interconnects available.

Each node contains one Grace Hopper Superchip and one or more PCIe devices like NVMe solid-state drives and BlueField-3 DPUs, NVIDIA ConnectX-7 NICs, or OEM-defined I/O. With 16x PCIe Gen 5 lanes, an NDR400 InfiniBand NIC provides up to 100 GB/s of total bandwidth across the superchips. Combined with NVIDIA BlueField-3 DPUs, this platform is easy to manage and deploy and uses a traditional HPC and AI cluster networking architecture.

Diagram shows an NVIDIA HGX Grace Hopper Superchip with Infiniband networking system. There is hardware coherency within each Grace Hopper Superchip. Each Superchip is connected with a BlueField 3 DPU through PCIe, which are then connected at 100 GB/s total bandwidth with NVIDIA Quantum-2 InfiniBand NDR400 Switches. — *Figure 13. NVIDIA HGX Grace Hopper with InfiniBand for scale-out ML and HPC workloads*

NVIDIA HGX Grace Hopper with NVLink Switch

NVIDIA HGX Grace Hopper with NVLink Switch is ideal for strong scaling giant machine learning and HPC workloads. It enables all GPU threads in the NVLink-connected domain to address up to 150 TB of memory at up to 900 GB/s total bandwidth per superchip in a 256-GPU NVLink-connected system. The a simple programming model uses pointer load, store, and atomic operations. Its 450 GB/s all-reduce bandwidth and up to 115.2 TB/s bisection bandwidth make this platform ideal for strong-scaling the world’s largest and most challenging AI training and HPC workloads.

NVLink-connected domains are networked with NVIDIA InfiniBand networking, for example, NVIDIA ConnectX-7 NICs or NVIDIA BlueField-3 data processing units (DPUs) paired with NVIDIA Quantum 2 NDR switches or OEM-defined I/O solutions.

Diagram shows an NVIDIA HGX Grace Hopper with NVLink Switch System . There is hardware coherency within each Grace Hopper Superchip. Each Grace Hopper Superchip within a cluster of up to 256 Grace Hopper Superchips is connected with each other via the NVLink Switch System. Each Superchip is also connected with a BlueField 3 DPU through PCIe, which are then connected at 100 GB/s total bandwidth with NVIDIA Quantum-2 InfiniBand NDR400 Switches. — *Figure 14. NVIDIA HGX Grace Hopper with NVLink Switch System for strong-scaling giant ML and HPC workloads*

Delivering performance breakthroughs

The NVIDIA Grace Hopper Superchip Architecture whitepaper expands on the details covered in this post. It walks you through how Grace Hopper delivers the performance breakthroughs shown on Figure 1 over what is currently the most powerful PCIe-based accelerated platforms powered by NVIDIA Hopper H100 PCIe GPUs.

Do you have any applications that would be perfect for the NVIDIA Grace Hopper Superchip? Let us know in the comments!

Acknowledgments

We would like to thank Jack Choquette, Ronny Krashinsky, John Hubbard, Mark Hummel, Greg Palmer, Ryan Wells, Alex Ishii, Jonah Alben, and the many NVIDIA architects and engineers who contributed to this post.

Misc

TIME Magazine Names NVIDIA Instant NeRF a Best Invention of 2022

Post author By
Post date November 10, 2022
No Comments on TIME Magazine Names NVIDIA Instant NeRF a Best Invention of 2022

TIME Magazine named NVIDIA Instant NeRF, a technology capable of transforming 2D images into 3D scenes, one of the Best Inventions of 2022. “Before…

TIME Magazine named NVIDIA Instant NeRF, a technology capable of transforming 2D images into 3D scenes, one of the Best Inventions of 2022.

“Before NVIDIA Instant NeRF, creating 3D scenes required specialized equipment, expertise, and lots of time and money. Now it just takes a few photos and a few minutes,” TIME writes in their release.

The 3D rendering tool was introduced at SIGGRAPH 2022, the world’s largest conference for computer graphics and interactive techniques.

At SIGGRAPH, NVIDIA researchers Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller submitted their paper, Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. The innovative research quickly gained popularity, winning the SIGGRAPH 20 2 2 Technical Papers Award.

Accumulating tens of thousands of downloads and achieving over 9,700 stars on the Instant NeRF GitHub page, developers and 3D content creators are embracing the ability to create stunning 3D scenes with the tool.

What is a NeRF?

Neural Radiance Fields (NeRF) are neural networks capable of generating 3D images or scenes from a set of 2D images. Using spatial location and volumetric rendering, the model uses the camera pose from the images to render the 3D space of the scene.

NeRFs are computationally intensive and historically required many hours for rendering. However, Instant NeRFs give users the power to render an image or scene quickly and accurately with a small number of images. You can generate a scene in seconds and the longer the model trains the more details of a scene are rendered.

Video 1. Amalfi Coast created with Instant NeRF. Credit: Jonathan Stephens

Exploring NeRFs

To support developers adopting Instant NeRF, NVIDIA hosted an Instant NeRF Sweepstakes over the summer. The event encouraged contestants to explore their creative abilities while using InstantNeRF and the chance to win a GeForce RTX 3090. The sweepstakes reached over 2.7 million people on Twitter.

Since the code release, a large community of creators ranging from AI researchers to photographers have demonstrated their Instant NGP skills by making their own NeRFs.

“NeRF is a way to freeze a moment in time that is more immersive than a photograph or a video. It’s a way to recreate a moment – the whole moment. NeRF is the natural extension of photogrammetry and the evolution of modern photography,” said Michael Rubloff of Franc Lucent, an early explorer in NeRF technology.

Video 2. NERF of Zeus. Credit: Hugues Bruyère

“Museums are great exploration fields for interactive and immersive experiences. Volumetric scenes like this one can also directly contribute to historical heritage conservation, with a modern and immersive twist,” said Hugues Bruyère, partner and chief of innovation at the Montreal-based creative studio Dpt.

The community has taken NeRFs to another level by expressing themselves and their art through this new form of photography. Some have even found ways to make NeRFs impactful in how they work.

Video 3. A NeRF scene of a woman meditating. Credit: Franc Lucent

Make your own NeRFs

This remarkable technology is available for anyone to try out!

Check out this Getting Started with Instant NeRF post on how to set up the code and create your first NeRF. Or skip right to the code and try it out for yourself by visiting the NV Labs Instant NeRF GitHub.

You can also see how various artists & professionals have used Instant NeRF in their projects.

Video 4. Using NeRF to scan mirror surfaces. Credit: Karen X. Cheng

Featured image: The Purdue Engineering Fountain. Credit: Jonathan Stephens

Misc

Give the Gift of Gaming With GeForce NOW Gift Cards

Post author By
Post date November 10, 2022
No Comments on Give the Gift of Gaming With GeForce NOW Gift Cards

The holiday season is approaching, and GeForce NOW has everyone covered. This GFN Thursday brings an easy way to give the gift of gaming with GeForce NOW gift cards, for yourself or for a gamer in your life. Plus, stream 10 new games from the cloud this week, including the first story downloadable content (DLC) Read article >

The post Give the Gift of Gaming With GeForce NOW Gift Cards appeared first on NVIDIA Blog.

Misc

Accelerate Enterprise Apps with Microsoft Azure Stack HCI and NVIDIA BlueField DPUs

Post author By
Post date November 10, 2022
No Comments on Accelerate Enterprise Apps with Microsoft Azure Stack HCI and NVIDIA BlueField DPUs

As enterprises continue to shift workloads to the cloud, some applications need to remain on-premises to maximize latency performance and meet security, data…

As enterprises continue to shift workloads to the cloud, some applications need to remain on-premises to maximize latency performance and meet security, data sovereignty, and compliance policies. Microsoft Azure Stack HCI is a hyperconverged infrastructure (HCI) stack delivered as an Azure service. Providing built-in security and manageability, Azure Stack HCI is ideally positioned to run production workloads and cloud-native apps in core and edge data centers. 

NVIDIA BlueField data processing unit (DPU) is an accelerated data center infrastructure platform that unleashes application performance and system efficiency. BlueField DPUs help cloud-minded enterprises overcome performance and scalability bottlenecks in modern IT environments. This is achieved by offloading, accelerating, and isolating software-defined infrastructure workloads.

Marking a major leap forward in performance and productivity, Microsoft has demonstrated a prototype of the Azure Stack HCI platform accelerated on NVIDIA BlueField-2 DPUs, delivering 12x CPU resource efficiency and 60% higher throughput. The DPU-accelerated HCI platform enables significant total cost of ownership (TCO) savings by requiring fewer servers and lower power and space to operate a given workload.

Performance and efficiency gains

Azure Stack HCI is a software-defined platform that runs the Azure networking stack for connecting virtual machines (VMs) and containers together. Software-defined networks deliver rich functionality and great flexibility and enable enterprises to easily scale from a single on-premises data center to hybrid and multi-cloud environments.

Despite the many benefits of software-defined networks, those that run exclusively on CPUs are resource constrained. They’re known for stealing away expensive cores that would otherwise be used for running business applications, taking a toll on performance and scalability. In addition, software-defined network (SDN) technologies have had a longstanding conflict with hardware-accelerated networking (namely SR-IOV). This forces cloud architects to prioritize one over the other, often at the cost of poor application performance or higher TCO.

NVIDIA BlueField DPUs are designed to deliver the best of both worlds: the functionality and agility of SDNs with the performance and efficiency of hardware-accelerated networking (SR-IOV). BlueField offloads the entire SDN workload from the host CPU, freeing up CPU cores for line-of-business applications and creating major TCO savings.

Running in the BlueField Arm processor, the SDN pipeline is mapped to the BlueField-accelerated programmable pipeline to increase throughput and packet processing performance and reduce latency. By offloading and accelerating functions of the Azure Stack HCI SDN on NVIDIA BlueField-2 DPUs, the Microsoft team is able to achieve massive performance and efficiency gains (Figure 1).

Diagrams showing network throughput and CPU core savings. x86 CPU delivered 60 Gbps throughput at 8 CPU cores compared to 96 Gbps at zero CPU utilization delivered with BlueField-2. — *Figure 1. Accelerating the Azure Stack HCI on NVIDIA BlueField-2 DPUs results in network throughput gains (left) and CPU core savings (right)*

The test results indicate that BlueField-2 delivers line-rate, software-defined networking at 96 Gb/s with practically zero CPU utilization compared to only 60 Gb/s that were utilized over 8 CPU cores. Those 8 cores are now freed up to run business workloads with 60% faster networking.

To show an apples-to-apples comparison and because the x86 CPU was not able to reach
~100 Gb/s in the test, the number of CPU cores that would have been required to support 96 Gb/s have been extrapolated. Figure 1 shows that 12 CPU cores would have been required to support 96 Gb/s, which means BlueField-2 saves 12 CPU cores to support the same throughput.

This savings enables enterprises to design, deploy, and operate fewer servers to deliver the same business outcomes, or alternatively, achieve better outcomes based on the same number of servers.

From a functional standpoint, the Microsoft Azure Stack HCI platform supports a wide range of SDN policies: VXLAN network encapsulation and decapsulation (encap/decap), access-control list (ACL), quality-of-service (QoS), IP-in-IP encapsulation, network address translation (NAT), IPsec encryption/decryption, and more. The traditional configuration of accelerated networking (SR-IOV) for a VM would require bypassing this rich set of policies for the benefit of achieving higher performance.

The Azure Stack HCI solution prototype, now offloaded and accelerated on the BlueField DPU, delivers both advanced SDN policies without sacrificing performance and efficiency. See Evolving Networking with a DPU-Powered Edge for more details.

Summary

While shifting workloads to the cloud, enterprises continue to prioritize on-premises and edge data centers to meet stringent performance and security requirements. Next-generation applications including machine learning and artificial intelligence increasingly rely on accelerated networking to realize their full potential. The Microsoft Azure Stack HCI on NVIDIA BlueField DPUs prototype delivers on the promise of hybrid cloud and accelerated networking for businesses at every scale.

Learn more about NVIDIA BlueField DPUs.

Misc

NVIDIA AI Turbocharges Industrial Research, Scientific Discovery in the Cloud on Rescale HPC-as-a-Service Platform

Post author By
Post date November 9, 2022
No Comments on NVIDIA AI Turbocharges Industrial Research, Scientific Discovery in the Cloud on Rescale HPC-as-a-Service Platform

Just like many businesses, the world of industrial scientific computing has a data problem. Solving seemingly intractable challenges — from developing new energy sources and creating new modes of transportation, to addressing mission-critical issues such as driving operational efficiencies and improving customer support — requires massive amounts of high performance computing. Instead of having to Read article >

The post NVIDIA AI Turbocharges Industrial Research, Scientific Discovery in the Cloud on Rescale HPC-as-a-Service Platform appeared first on NVIDIA Blog.

Misc

What Is Denoising?

Anyone who’s taken a photo with a digital camera is likely familiar with a “noisy” image: discolored spots that make the photo lose clarity and sharpness. Many photographers have tips and tricks to reduce noise in images, including fixing the settings on the camera lens or taking photos in different lighting. But it isn’t just Read article >

The post What Is Denoising? appeared first on NVIDIA Blog.

Misc

NVIDIA Hopper, Ampere GPUs Sweep Benchmarks in AI Training

Post author By
Post date November 9, 2022
No Comments on NVIDIA Hopper, Ampere GPUs Sweep Benchmarks in AI Training

Two months after their debut sweeping MLPerf inference benchmarks, NVIDIA H100 Tensor Core GPUs set world records across enterprise AI workloads in the industry group’s latest tests of AI training. Together, the results show H100 is the best choice for users who demand utmost performance when creating and deploying advanced AI models. MLPerf is the Read article >

The post NVIDIA Hopper, Ampere GPUs Sweep Benchmarks in AI Training appeared first on NVIDIA Blog.