Whether focused on tiny atoms or the immensity of outer space, supercomputing workloads benefit from the flexibility that the largest systems provide scientists and researchers. To meet the needs of organizations with such large AI and high performance computing (HPC) workloads, Dell Technologies today unveiled the Dell PowerEdge XE9680 system — its first system with Read article >
A finalist for the Gordon Bell special prize for high performance computing-based COVID-19 research has taught large language models (LLMs) a new lingo — gene sequences — that can unlock insights in genomics, epidemiology and protein engineering. Published in October, the groundbreaking work is a collaboration by more than two dozen academic and commercial researchers Read article >
SC22 — NVIDIA today announced broad adoption of its next-generation H100 Tensor Core GPUs and Quantum-2 InfiniBand, including new offerings on Microsoft Azure cloud and 50+ new partner systems for accelerating scientific discovery.
Posted by Jacky Liang, Research Intern, and Andy Zeng, Research Scientist, Robotics at Google <!––><!––>
A common approach used to control robots is to program them with code to detect objects, sequencing commands to move actuators, and feedback loops to specify how the robot should perform a task. While these programs can be expressive, re-programming policies for each new task can be time consuming, and requires domain expertise.
What if when given instructions from people, robots could autonomously write their own code to interact with the world? It turns out that the latest generation of language models, such as PaLM, are capable of complex reasoning and have also been trained on millions of lines of code. Given natural language instructions, current language models are highly proficient at writing not only generic code but, as we’ve discovered, code that can control robot actions as well. When provided with several example instructions (formatted as comments) paired with corresponding code (via in-context learning), language models can take in new instructions and autonomously generate new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime. More broadly, this suggests an alternative approach to using machine learning for robots that (i) pursues generalization through modularity and (ii) leverages the abundance of open-source code and data available on the Internet.
Given code for an example task (left), language models can re-compose API calls to assemble new robot behaviors for new tasks (right) that use the same functions but in different ways.
To explore this possibility, we developed Code as Policies (CaP), a robot-centric formulation of language model-generated programs executed on physical systems. CaP extends our prior work, PaLM-SayCan, by enabling language models to complete even more complex robotic tasks with the full expression of general-purpose Python code. With CaP, we propose using language models to directly write robot code through few-shot prompting. Our experiments demonstrate that outputting code led to improved generalization and task performance over directly learning robot tasks and outputting natural language actions. CaP allows a single system to perform a variety of complex and varied robotic tasks without task-specific training.
A Different Way to Think about Robot Generalization
To generate code for a new task given natural language instructions, CaP uses a code-writing language model that, when prompted with hints (i.e., import statements that inform which APIs are available) and examples (instruction-to-code pairs that present few-shot “demonstrations” of how instructions should be converted into code), writes new code for new instructions. Central to this approach is hierarchical code generation, which prompts language models to recursively define new functions, accumulate their own libraries over time, and self-architect a dynamic codebase. Hierarchical code generation improves state-of-the-art on both robotics as well as standard code-gen benchmarks in natural language processing (NLP) subfields, with 39.8% pass@1 on HumanEval, a benchmark of hand-written coding problems used to measure the functional correctness of synthesized programs.
Code-writing language models can express a variety of arithmetic operations and feedback loops grounded in language. Pythonic language model programs can use classic logic structures, e.g., sequences, selection (if/else), and loops (for/while), to assemble new behaviors at runtime. They can also use third-party libraries to interpolate points (NumPy), analyze and generate shapes (Shapely) for spatial-geometric reasoning, etc. These models not only generalize to new instructions, but they can also translate precise values (e.g., velocities) to ambiguous descriptions (“faster” and “to the left”) depending on the context to elicit behavioral commonsense.
Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.
CaP generalizes at a specific layer in the robot: interpreting natural language instructions, processing perception outputs (e.g., from off-the-shelf object detectors), and then parameterizing control primitives. This fits into systems with factorized perception and control, and imparts a degree of generalization (acquired from pre-trained language models) without the magnitude of data collection needed for end-to-end robot learning. CaP also inherits language model capabilities that are unrelated to code writing, such as supporting instructions with non-English languages and emojis.
CaP inherits the capabilities of language models, such as multilingual and emoji support.
By characterizing the types of generalization encountered in code generation problems, we can also study how hierarchical code generation improves generalization. For example, “systematicity” evaluates the ability to recombine known parts to form new sequences, “substitutivity” evaluates robustness to synonymous code snippets, while “productivity” evaluates the ability to write policy code longer than those seen in the examples (e.g., for new long horizon tasks that may require defining and nesting new functions). Our paper presents a new open-source benchmark to evaluate language models on a set of robotics-related code generation problems. Using this benchmark, we find that, in general, bigger models perform better across most metrics, and that hierarchical code generation improves “productivity” generalization the most.
Performance on our RoboCodeGen Benchmark across different generalization types. The larger model (Davinci) performs better than the smaller model (Cushman), with hierarchical code generation improving productivity the most.
We’re also excited about the potential for code-writing models to express cross-embodied plans for robots with different morphologies that perform the same task differently depending on the available APIs (perception action spaces), which is an important aspect of any robotics foundation model.
Language model code-generation exhibits cross-embodiment capabilities, completing the same task in different ways depending on the available APIs (that define perception action spaces).
Limitations
Code as policies today are restricted by the scope of (i) what the perception APIs can describe (e.g., few visual-language models to date can describe whether a trajectory is “bumpy” or “more C-shaped”), and (ii) which control primitives are available. Only a handful of named primitive parameters can be adjusted without over-saturating the prompts. Our approach also assumes all given instructions are feasible, and we cannot tell if generated code will be useful a priori. CaPs also struggle to interpret instructions that are significantly more complex or operate at a different abstraction level than the few-shot examples provided to the language model prompts. Thus, for example, in the tabletop domain, it would be difficult for our specific instantiation of CaPs to “build a house with the blocks” since there are no examples of building complex 3D structures. These limitations point to avenues for future work, including extending visual language models to describe low-level robot behaviors (e.g., trajectories) or combining CaPs with exploration algorithms that can autonomously add to the set of control primitives.
Open-Source Release
We have released the code needed to reproduce our experiments and an interactive simulated robot demo on the project website, which also contains additional real-world demos with videos and generated code.
Conclusion
Code as policies is a step towards robots that can modify their behaviors and expand their capabilities accordingly. This can be enabling, but the flexibility also raises potential risks since synthesized programs (unless manually checked per runtime) may result in unintended behaviors with physical hardware. We can mitigate these risks with built-in safety checks that bound the control primitives that the system can access, but more work is needed to ensure new combinations of known primitives are equally safe. We welcome broad discussion on how to minimize these risks while maximizing the potential positive impacts towards more general-purpose robots.
Acknowledgements
This research was done by Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng. Special thanks to Vikas Sindhwani, Vincent Vanhoucke for helpful feedback on writing, Chad Boodoo for operations and hardware support. An early preprint is available on arXiv.
Posted by Rishabh Agarwal, Senior Research Scientist, and Max Schwarzer, Student Researcher, Google Research, Brain Team
Reinforcement learning (RL) is an area of machine learning that focuses on training intelligent agents using related experiences so they can learn to solve decision making tasks, such as playing video games, flying stratospheric balloons, and designing hardware chips. Due to the generality of RL, the prevalent trend in RL research is to develop agents that can efficiently learn tabula rasa, that is, from scratch without using previously learned knowledge about the problem. However, in practice, tabula rasa RL systems are typically the exception rather than the norm for solving large-scale RL problems. Large-scale RLsystems, such as OpenAI Five, which achieves human-level performance on Dota 2, undergo multiple design changes (e.g., algorithmic or architectural changes) during their developmental cycle. This modification process can last months and necessitates incorporating such changes without re-training from scratch, which would be prohibitively expensive.
Furthermore, the inefficiency of tabula rasa RL research can exclude many researchers from tackling computationally-demanding problems. For example, the quintessential benchmark of training a deep RL agent on 50+ Atari 2600 games in ALE for 200M frames (the standard protocol) requires 1,000+ GPU days. As deep RL moves towards more complex and challenging problems, the computational barrier to entry in RL research will likely become even higher.
To address the inefficiencies of tabula rasa RL, we present “Reincarnating Reinforcement Learning: Reusing Prior Computation To Accelerate Progress” at NeurIPS 2022. Here, we propose an alternative approach to RL research, where prior computational work, such as learned models, policies, logged data, etc., is reused or transferred between design iterations of an RL agent or from one agent to another. While some sub-areas of RL leverage prior computation, most RL agents are still largely trained from scratch. Until now, there has been no broader effort to leverage prior computational work for the training workflow in RL research. We have also released our code and trained agents to enable researchers to build on this work.
Tabula rasa RL vs. Reincarnating RL (RRL). While tabula rasa RL focuses on learning from scratch, RRL is based on the premise of reusing prior computational work (e.g., prior learned agents) when training new agents or improving existing agents, even in the same environment. In RRL, new agents need not be trained from scratch, except for initial forays into new problems.
Why Reincarnating RL?
Reincarnating RL (RRL) is a more compute and sample-efficient workflow than training from scratch. RRL can democratize research by allowing the broader community to tackle complex RL problems without requiring excessive computational resources. Furthermore, RRL can enable a benchmarking paradigm where researchers continually improve and update existing trained agents, especially on problems where improving performance has real-world impact, such as balloon navigation or chip design. Finally, real-world RL use cases will likely be in scenarios where prior computational work is available (e.g., existing deployed RL policies).
RRL as an alternative research workflow. Imagine a researcher who has trained an agent A1 for some time, but now wants to experiment with better architectures or algorithms. While the tabula rasa workflow requires retraining another agent from scratch, RRL provides the more viable option of transferring the existing agent A1 to another agent and training this agent further, or simply fine-tuning A1.
While there have been some ad hoc large-scale reincarnation efforts with limited applicability, e.g., model surgery in Dota2, policy distillation in Rubik’s cube, PBT in AlphaStar, RL fine-tuning a behavior-cloned policy in AlphaGo / Minecraft, RRL has not been studied as a research problem in its own right. To this end, we argue for developing general-purpose RRL approaches as opposed to prior ad-hoc solutions.
Case Study: Policy to Value Reincarnating RL
Different RRL problems can be instantiated depending on the kind of prior computational work provided. As a step towards developing broadly applicable RRL approaches, we present a case study on the setting of Policy to Value reincarnating RL (PVRL) for efficiently transferring an existing sub-optimal policy (teacher) to a standalone value-based RL agent (student). While a policy directly maps a given environment state (e.g., a game screen in Atari) to an action, value-based agents estimate the effectiveness of an action at a given state in terms of achievable future rewards, which allows them to learn from previously collected data.
For a PVRL algorithm to be broadly useful, it should satisfy the following requirements:
Teacher Agnostic: The student shouldn’t be constrained by the existing teacher policy’s architecture or training algorithm.
Weaning off the teacher: It is undesirable to maintain dependency on past suboptimal teachers for successive reincarnations.
Compute / Sample Efficient: Reincarnation is only useful if it is cheaper than training from scratch.
Given the PVRL algorithm requirements, we evaluate whether existing approaches, designed with closely related goals, will suffice. We find that such approaches either result in small improvements over tabula rasa RL or degrade in performance when weaning off the teacher.
To address these limitations, we introduce a simple method, QDagger, in which the agent distills knowledge from the suboptimal teacher via an imitation algorithm while simultaneously using its environment interactions for RL. We start with a deep Q-network (DQN) agent trained for 400M environment frames (a week of single-GPU training) and use it as the teacher for reincarnating student agents trained on only 10M frames (a few hours of training), where the teacher is weaned off over the first 6M frames. For benchmark evaluation, we report the interquartile mean (IQM) metric from the RLiable library.As shown below for the PVRL setting on Atari games, we find that the QDagger RRL method outperforms prior approaches.
Benchmarking PVRL algorithms on Atari, with teacher-normalized scores aggregated across 10 games. Tabula rasa DQN (–·–) obtains a normalized score of 0.4. Standard baseline approaches include kickstarting, JSRL, rehearsal, offline RL pre-training and DQfD. Among all methods, only QDagger surpasses teacher performance within 10 million frames and outperforms the teacher in 75% of the games.
Reincarnating RL in Practice
We further examine the RRL approach on the Arcade Learning Environment, a widely used deep RL benchmark. First, we take a Nature DQN agent that uses the RMSProp optimizer and fine-tune it with the Adam optimizer to create a DQN (Adam) agent. While it is possible to train a DQN (Adam) agent from scratch, we demonstrate that fine-tuning Nature DQN with the Adam optimizer matches the from-scratch performance using 40x less data and compute.
Reincarnating DQN (Adam) via Fine-Tuning. The vertical separator corresponds to loading network weights and replay data for fine-tuning. Left: Tabula rasa Nature DQN nearly converges in performance after 200M environment frames. Right: Fine-tuning this Nature DQN agent using a reduced learning rate with the Adam optimizer for 20 million frames obtains similar results to DQN (Adam) trained from scratch for 400M frames.
Given the DQN (Adam) agent as a starting point, fine-tuning is restricted to the 3-layer convolutional architecture. So, we consider a more general reincarnation approach that leverages recent architectural and algorithmic advances without training from scratch. Specifically, we use QDagger to reincarnate another RL agent that uses a more advanced RL algorithm (Rainbow) and a better neural network architecture (Impala-CNN ResNet) from the fine-tuned DQN (Adam) agent.
Reincarnating a different architecture / algorithm via QDagger. The vertical separator is the point at which we apply offline pre-training using QDagger for reincarnation. Left: Fine-tuning DQN with Adam. Right: Comparison of a tabula rasa Impala-CNN Rainbow agent (sky blue) to an Impala-CNN Rainbow agent (pink) trained using QDagger RRL from the fine-tuned DQN (Adam). The reincarnated Impala-CNN Rainbow agent consistently outperforms its scratch counterpart. Note that further fine-tuning DQN (Adam) results in diminishing returns (yellow).
Overall, these results indicate that past research could have been accelerated by incorporating a RRL approach to designing agents, instead of re-training agents from scratch. Our paper also contains results on the Balloon Learning Environment, where we demonstrate that RRL allows us to make progress on the problem of navigating stratospheric balloons using only a few hours of TPU-compute by reusing a distributed RL agent trained on TPUs for more than a month.
Discussion
Fairly comparing reincarnation approaches involves using the exact same computational work and workflow. Furthermore, the research findings in RRL that broadly generalize would be about how effective an algorithm is given access to existing computational work, e.g., we successfully applied QDagger developed using Atari for reincarnation on Balloon Learning Environment. As such, we speculate that research in reincarnating RL can branch out in two directions:
Standardized benchmarks with open-sourced computational work: Akin to NLP and vision, where typically a small set of pre-trained models are common, research in RRL may also converge to a small set of open-sourced computational work (e.g., pre-trained teacher policies) on a given benchmark.
Real-world domains: Since obtaining higher performance has real-world impact in some domains, it incentivizes the community to reuse state-of-the-art agents and try to improve their performance.
See our paper for a broader discussion on scientific comparisons, generalizability and reproducibility in RRL. Overall, we hope that this work motivates researchers to release computational work (e.g., model checkpoints) on which others could directly build. In this regard, we have open-sourced our code and trained agents with their final replay buffers. We believe that reincarnating RL can substantially accelerate research progress by building on prior computational work, as opposed to always starting from scratch.
Acknowledgements
This work was done in collaboration with Pablo Samuel Castro, Aaron Courville and Marc Bellemare. We’d like to thank Tom Small for the animated figure used in this post. We are also grateful for feedback by the anonymous NeurIPS reviewers and several members of the Google Research team, DeepMind and Mila.
Posted by Noah Snavely and Zhengqi Li, Research Scientists, Google Research
We live in a world of great natural beauty — of majestic mountains, dramatic seascapes, and serene forests. Imagine seeing this beauty as a bird does, flying past richly detailed, three-dimensional landscapes. Can computers learn to synthesize this kind of visual experience? Such a capability would allow for new kinds of content for games and virtual reality experiences: for instance, relaxing within an immersive flythrough of an infinite nature scene. But existing methods that synthesize new views from images tend to allow for only limited camera motion.
In a research effort we call Infinite Nature, we show that computers can learn to generate such rich 3D experiences simply by viewing nature videos and photographs. Our latest work on this theme, InfiniteNature-Zero (presented at ECCV 2022) can produce high-resolution, high-quality flythroughs starting from a single seed image, using a system trained only on still photographs, a breakthrough capability not seen before. We call the underlying research problem perpetual view generation: given a single input view of a scene, how can we synthesize a photorealistic set of output views corresponding to an arbitrarily long, user-controlled 3D path through that scene? Perpetual view generation is very challenging because the system must generate new content on the other side of large landmarks (e.g., mountains), and render that new content with high realism and in high resolution.
Example flythrough generated with InfiniteNature-Zero. It takes a single input image of a natural scene and synthesizes a long camera path flying into that scene, generating new scene content as it goes.
Background: Learning 3D Flythroughs from Videos
To establish the basics of how such a system could work, we’ll describe our first version, “Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image” (presented at ICCV 2021). In that work we explored a “learn from video” approach, where we collected a set of online videos captured from drones flying along coastlines, with the idea that we could learn to synthesize new flythroughs that resemble these real videos. This set of online videos is called the Aerial Coastline Imagery Dataset (ACID). In order to learn how to synthesize scenes that respond dynamically to any desired 3D camera path, however, we couldn’t simply treat these videos as raw collections of pixels; we also had to compute their underlying 3D geometry, including the camera position at each frame.
The basic idea is that we learn to generate flythroughs step-by-step. Given a starting view, like the first image in the figure below, we first compute a depth map using single-image depth prediction methods. We then use that depth map to render the image forward to a new camera viewpoint, shown in the middle, resulting in a new image and depth map from that new viewpoint.
However, this intermediate image has some problems — it has holes where we can see behind objects into regions that weren’t visible in the starting image. It is also blurry, because we are now closer to objects, but are stretching the pixels from the previous frame to render these now-larger objects.
To handle these problems, we learn a neural image refinement network that takes this low-quality intermediate image and outputs a complete, high-quality image and corresponding depth map. These steps can then be repeated, with this synthesized image as the new starting point. Because we refine both the image and the depth map, this process can be iterated as many times as desired — the system automatically learns to generate new scenery, like mountains, islands, and oceans, as the camera moves further into the scene.
Our Infinite Nature methods take an input view and its corresponding depth map (left). Using this depth map, the system renders the input image to a new desired viewpoint (center). This intermediate image has problems, such as missing pixels revealed behind foreground content (shown in magenta). We learn a deep network that refines this image to produce a new high-quality image (right). This process can be repeated to produce a long trajectory of views. We thus call this approach “render-refine-repeat”.
We train this render-refine-repeat synthesis approach using the ACID dataset. In particular, we sample a video from the dataset and then a frame from that video. We then use this method to render several new views moving into the scene along the same camera trajectory as the ground truth video, as shown in the figure below, and compare these rendered frames to the corresponding ground truth video frames to derive a training signal. We also include an adversarial setup that tries to distinguish synthesized frames from real images, encouraging the generated imagery to appear more realistic.
Infinite Nature can synthesize views corresponding to any camera trajectory. During training, we run our system for T steps to generate T views along a camera trajectory calculated from a training video sequence, then compare the resulting synthesized views to the ground truth ones. In the figure, each camera viewpoint is generated from the previous one by performing a warp operation R, followed by the neural refinement operation gθ.
The resulting system can generate compelling flythroughs, as featured on the project webpage, along with a “flight simulator” Colab demo. Unlike prior methods on video synthesis, this method allows the user to interactively control the camera and can generate much longer camera paths.
InfiniteNature-Zero: Learning Flythroughs from Still Photos
One problem with this first approach is that video is difficult to work with as training data. High-quality video with the right kind of camera motion is challenging to find, and the aesthetic quality of an individual video frame generally cannot compare to that of an intentionally captured nature photograph. Therefore, in “InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images”, we build on the render-refine-repeat strategy above, but devise a way to learn perpetual view synthesis from collections of still photos — no videos needed. We call this method InfiniteNature-Zero because it learns from “zero” videos. At first, this might seem like an impossible task — how can we train a model to generate video flythroughs of scenes when all it’s ever seen are isolated photos?
To solve this problem, we had the key insight that if we take an image and render a camera path that forms a cycle — that is, where the path loops back such that the last image is from the same viewpoint as the first — then we know that the last synthesized image along this path should be the same as the input image. Such cycle consistency provides a training constraint that helps the model learn to fill in missing regions and increase image resolution during each step of view generation.
However, training with these camera cycles is insufficient for generating long and stable view sequences, so as in our original work, we include an adversarial strategy that considers long, non-cyclic camera paths, like the one shown in the figure above. In particular, if we render T frames from a starting frame, we optimize our render-refine-repeat model such that a discriminator network can’t tell which was the starting frame and which was the final synthesized frame. Finally, we add a component trained to generate high-quality sky regions to increase the perceived realism of the results.
With these insights, we trained InfiniteNature-Zero on collections of landscape photos, which are available in large quantities online. Several resulting videos are shown below — these demonstrate beautiful, diverse natural scenery that can be explored along arbitrarily long camera paths. Compared to our prior work — and to prior video synthesis methods — these results exhibit significant improvements in quality and diversity of content (details available in the paper).
Several nature flythroughs generated by InfiniteNature-Zero from single starting photos.
Conclusion
There are a number of exciting future directions for this work. For instance, our methods currently synthesize scene content based only on the previous frame and its depth map; there is no persistent underlying 3D representation. Our work points towards future algorithms that can generate complete, photorealistic, and consistent 3D worlds.
Acknowledgements
Infinite Nature and InfiniteNature-Zero are the result of a collaboration between researchers at Google Research, UC Berkeley, and Cornell University. The key contributors to the work represented in this post include Angjoo Kanazawa, Andrew Liu, Richard Tucker, Zhengqi Li, Noah Snavely, Qianqian Wang, Varun Jampani, and Ameesh Makadia.
Posted by Shunyu Yao, Student Researcher, and Yuan Cao, Research Scientist, Google Research, Brain Team <!––>
Recent advances have expanded the applicability of language models (LM) to downstream tasks. On one hand, existing language models that are properly prompted, via chain-of-thought, demonstrate emergent capabilities that carry out self-conditioned reasoning traces to derive answers from questions, excelling at various arithmetic, commonsense, and symbolic reasoning tasks. However, with chain-of-thought prompting, a model is not grounded in the external world and uses its own internal representations to generate reasoning traces, limiting its ability to reactively explore and reason or update its knowledge. On the other hand, recent work uses pre-trained language models for planning and acting in various interactive environments (e.g., text games, web navigation, embodied tasks, robotics), with a focus on mapping text contexts to text actions via the language model’s internal knowledge. However, they do not reason abstractly about high-level goals or maintain a working memory to support acting over long horizons.
In “ReAct: Synergizing Reasoning and Acting in Language Models”, we propose a general paradigm that combines reasoning and acting advances to enable language models to solve various language reasoning and decision making tasks. We demonstrate that the Reason+Act (ReAct) paradigm systematically outperforms reasoning and acting only paradigms, when prompting bigger language models and fine-tuning smaller language models. The tight integration of reasoning and acting also presents human-aligned task-solving trajectories that improve interpretability, diagnosability, and controllability..
Model Overview
ReAct enables language models to generate both verbal reasoning traces and text actions in an interleaved manner. While actions lead to observation feedback from an external environment (“Env” in the figure below), reasoning traces do not affect the external environment. Instead, they affect the internal state of the model by reasoning over the context and updating it with useful information to support future reasoning and acting.
Previous methods prompt language models (LM) to either generate self-conditioned reasoning traces or task-specific actions. We propose ReAct, a new paradigm that combines reasoning and acting advances in language models.
ReAct Prompting
We focus on the setup where a frozen language model, PaLM-540B, is prompted with few-shot in-context examples to generate both domain-specific actions (e.g., “search” in question answering, and “go to” in room navigation), and free-form language reasoning traces (e.g., “Now I need to find a cup, and put it on the table”) for task solving.
For tasks where reasoning is of primary importance, we alternate the generation of reasoning traces and actions so that the task-solving trajectory consists of multiple reasoning-action-observation steps. In contrast, for decision making tasks that potentially involve a large number of actions, reasoning traces only need to appear sparsely in the most relevant positions of a trajectory, so we write prompts with sparse reasoning and let the language model decide the asynchronous occurrence of reasoning traces and actions for itself.
As shown below, there are various types of useful reasoning traces, e.g., decomposing task goals to create action plans, injecting commonsense knowledge relevant to task solving, extracting important parts from observations, tracking task progress while maintaining plan execution, handling exceptions by adjusting action plans, and so on.
The synergy between reasoning and acting allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interacting with the external environments (e.g., Wikipedia) to incorporate additional information into reasoning (act to reason).
ReAct Fine-tuning
We also explore fine-tuning smaller language models using ReAct-format trajectories. To reduce the need for large-scale human annotation, we use the ReAct prompted PaLM-540B model to generate trajectories, and use trajectories with task success to fine-tune smaller language models (PaLM-8/62B).
Comparison of four prompting methods, (a) Standard, (b) Chain of thought (CoT, Reason Only), (c) Act-only, and (d) ReAct, solving a HotpotQA question. In-context examples are omitted, and only the task trajectory is shown. ReAct is able to retrieve information to support reasoning, while also using reasoning to target what to retrieve next, demonstrating a synergy of reasoning and acting.
Results
We conduct empirical evaluations of ReAct and state-of-the-art baselines across four different benchmarks: question answering (HotPotQA), fact verification (Fever), text-based game (ALFWorld), and web page navigation (WebShop). For HotPotQA and Fever, with access to a Wikipedia API with which the model can interact, ReAct outperforms vanilla action generation models while being competitive with chain of thought reasoning (CoT) performance. The approach with the best results is a combination of ReAct and CoT that uses both internal knowledge and externally obtained information during reasoning.
HotpotQA (exact match, 6-shot)
FEVER (accuracy, 3-shot)
Standard
28.7
57.1
Reason-only (CoT)
29.4
56.3
Act-only
25.7
58.9
ReAct
27.4
60.9
Best ReAct + CoT Method
35.1
64.6
Supervised SoTA
67.5 (using ~140k samples)
89.5 (using ~90k samples)
PaLM-540B prompting results on HotpotQA and Fever.
On ALFWorld and WebShop, ReAct with both one-shot and two-shot prompting outperforms imitation and reinforcement learning methods trained with ~105 task instances, with an absolute improvement of 34% and 10% in success rates, respectively, over existing baselines.
AlfWorld (2-shot)
WebShop (1-shot)
Act-only
45
30.1
ReAct
71
40
Imitation Learning Baselines
37 (using ~100k samples)
29.1 (using ~90k samples)
PaLM-540B prompting task success rate results on AlfWorld and WebShop.
Scaling results for prompting and fine-tuning on HotPotQA with ReAct and different baselines. ReAct consistently achieves best fine-tuning performances.
A comparison of the ReAct (top) and CoT (bottom) reasoning trajectories on an example from Fever (observation for ReAct is omitted to reduce space). In this case ReAct provided the right answer, and it can be seen that the reasoning trajectory of ReAct is more grounded on facts and knowledge, in contrast to CoT’s hallucination behavior.
We also explore human-in-the-loop interactions with ReAct by allowing a human inspector to edit ReAct’s reasoning traces. We demonstrate that by simply replacing a hallucinating sentence with inspector hints, ReAct can change its behavior to align with inspector edits and successfully complete a task. Solving tasks becomes significantly easier when using ReAct as it only requires the manual editing of a few thoughts, which enables new forms of human-machine collaboration.
A human-in-the-loop behavior correction example with ReAct on AlfWorld. (a) ReAct trajectory fails due to a hallucinating reasoning trace (Act 17). (b) A human inspector edits two reasoning traces (Act 17, 23), ReAct then produces desirable reasoning traces and actions to complete the task.
Conclusion
We present ReAct, a simple yet effective method for synergizing reasoning and acting in language models. Through various experiments that focus on multi-hop question-answering, fact checking, and interactive decision-making tasks, we show that ReAct leads to superior performance with interpretable decision traces.
ReAct demonstrates the feasibility of jointly modeling thought, actions and feedback from the environment within a language model, making it a versatile agent that is capable of solving tasks that require interactions with the environment. We plan to further extend this line of research and leverage the strong potential of the language model for tackling broader embodied tasks, via approaches like massive multitask training and coupling ReAct with equally strong reward models.
Acknowledgements
We would like to thank Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran and Karthik Narasimhan for their great contribution in this work. We would also like to thank Google’s Brain team and the Princeton NLP Group for their joint support and feedback, including project scoping, advising and insightful discussions.
Posted by Peter H. Li, Research Scientist, and Sven Dorkenwald, Student Researcher, Connectomics at Google
Mapping the wiring and firing activity of the human brain is fundamental to deciphering how we think — how we sense the world, learn, decide, remember, and create — as well as what issues can arise in brain disease or dysfunction. Recent efforts have delivered publicly available brain maps (high-resolution 3D mapping of brain cells and their connectivities) at unprecedented quality and scale, such as H01, a 1.4 petabyte nanometer-scale digital reconstruction of a sample of human brain tissue from Harvard / Google, and the cubic millimeter mouse cortex dataset from our colleagues at the MICrONS consortium.
To interpret brain maps at this scale requires multiple layers of analysis, including the identification of synaptic connections, cellular subcompartments, and cell types. Machine learning and computer vision technology have played a central role in enabling these analyses, but deploying such systems is still a laborious process, requiring hours of manual ground truth labeling by expert annotators and significant computational resources. Moreover, some important tasks, such as identifying the cell type from only a small fragment of axon or dendrite, can be challenging even for human experts, and have not yet been effectively automated.
Today, in “Multi-Layered Maps of Neuropil with Segmentation-Guided Contrastive Learning”, we are announcing Segmentation-Guided Contrastive Learning of Representations (SegCLR), a method for training rich, generic representations of cellular morphology (the cell’s shape) and ultrastructure (the cell’s internal structure) without laborious manual effort. SegCLR produces compact vector representations (i.e., embeddings) that are applicable across diverse downstream tasks (e.g., local classification of cellular subcompartments, unsupervised clustering), and are even able to identify cell types from only small fragments of a cell. We trained SegCLR on both the H01 human cortex dataset and the MICrONS mouse cortex dataset, and we are releasing the resulting embedding vectors, about 8 billion in total, for researchers to explore.
From brain cells segmented out of a 3D block of tissue, SegCLR embeddings capture cellular morphology and ultrastructure and can be used to distinguish cellular subcompartments (e.g., dendritic spine versus dendrite shaft) or cell types (e.g., pyramidal versus microglia cell).
Representing Cellular Morphology and Ultrastructure
SegCLR builds on recent advances in self-supervised contrastive learning. We use a standard deep network architecture to encode inputs comprising local 3D blocks of electron microscopy data (about 4 micrometers on a side) into 64-dimensional embedding vectors. The network is trained via a contrastive loss to map semantically related inputs to similar coordinates in the embedding space. This is close to the popular SimCLR setup, except that we also require an instance segmentation of the volume (tracing out individual cells and cell fragments), which we use in two important ways.
First, the input 3D electron microscopy data are explicitly masked by the segmentation, forcing the network to focus only on the central cell within each block. Second, we leverage the segmentation to automatically define which inputs are semantically related: positive pairs for the contrastive loss are drawn from nearby locations on the same segmented cell and trained to have similar representations, while inputs drawn from different cells are trained to have dissimilar representations. Importantly, publicly available automated segmentations of the human and mouse datasets were sufficiently accurate to train SegCLR without requiring laborious review and correction by human experts.
SegCLR is trained to represent rich cellular features without manual labeling. Top: The SegCLR architecture maps local masked 3D views of electron microscopy data to embedding vectors. Only the microscopy volume and a draft automated instance segmentation are required. Bottom: The segmentation is also used to define positive versus negative example pairs, whose representations are pushed closer together (positives, blue arrows) or further apart (negatives, red arrows) during training.
Reducing Annotation Training Requirements by Three Orders of Magnitude
SegCLR embeddings can be used in diverse downstream settings, whether supervised (e.g., training classifiers) or unsupervised (e.g., clustering or content-based image retrieval). In the supervised setting, embeddings simplify the training of classifiers, and can greatly reduce ground truth labeling requirements. For example, we found that for identifying cellular subcompartments (axon, dendrite, soma, etc.) a simple linear classifier trained on top of SegCLR embeddings outperformed a fully supervised deep network trained on the same task, while using only about one thousand labeled examples instead of millions.
We assessed the classification performance for axon, dendrite, soma, and astrocyte subcompartments in the human cortex dataset via mean F1-Score, while varying the number of training examples used. Linear classifiers trained on top of SegCLR embeddings matched or exceeded the performance of a fully supervised deep classifier (horizontal line), while using a fraction of the training data.
Distinguishing Cell Types, Even from Small Fragments
Distinguishing different cell types is an important step towards understanding how brain circuits develop and function in health and disease. Human experts can learn to identify some cortical cell types based on morphological features, but manual cell typing is laborious and ambiguous cases are common. Cell typing also becomes more difficult when only small fragments of cells are available, which is common for many cells in current connectomic reconstructions.
Human experts manually labeled cell types for a small number of proofread cells in each dataset. In the mouse cortex dataset, experts labeled six neuron types (top) and four glia types (not shown). In the human cortex dataset, experts labeled two neuron types (not shown) and four glia types (bottom). (Rows not to scale with each other.)
We found that SegCLR accurately infers human and mouse cell types, even for small fragments. Prior to classification, we collected and averaged embeddings within each cell over a set aggregation distance, defined as the radius from a central point. We found that human cortical cell types can be identified with high accuracy for aggregation radii as small as 10 micrometers, even for types that experts find difficult to distinguish, such as microglia (MGC) versus oligodendrocyte precursor cells (OPC).
SegCLR can classify cell types, even from small fragments. Left: Classification performance over six human cortex cell types for shallow ResNet models trained on SegCLR embeddings for different sized cell fragments. Aggregation radius zero corresponds to very small fragments with only a single embedding. Cell type performance reaches high accuracy (0.938 mean F1-Score) for fragments with aggregation radii of only 10 micrometers (boxed point). Right: Class-wise confusion matrix at 10 micrometers aggregation radius. Darker shading along the diagonal indicates that predicted cell types agree with expert labels in most cases. AC: astrocyte; MGC: microglia cell; OGC: oligodendrocyte cell; OPC: oligodendrocyte precursor cell; E: excitatory neuron; I: inhibitory neuron.
In the mouse cortex, ten cell types could be distinguished with high accuracy at aggregation radii of 25 micrometers.
Left: Classification performance over the ten mouse cortex cell types reaches 0.832 mean F1-Score for fragments with aggregation radius 25 micrometers (boxed point). Right: The class-wise confusion matrix at 25 micrometers aggregation radius. Boxes indicate broad groups (glia, excitatory neurons, and inhibitory interneurons). P: pyramidal cell; THLC: thalamocortical axon; BC: basket cell; BPC: bipolar cell; MC: Martinotti cell; NGC: neurogliaform cell.
Finally, we showed how SegCLR can be used for automated analysis of brain connectivity by cell typing the synaptic partners of reconstructed cells throughout the mouse cortex dataset. Knowing the connectivity patterns between specific cell types is fundamental to interpreting large-scale connectomic reconstructions of brain wiring, but this typically requires manual tracing to identify partner cell types. Using SegCLR, we replicated brain connectivity findings that previously relied on intensive manual tracing, while extending their scale in terms of the number of synapses, cell types, and brain areas analyzed. (See the paper for further details.)
SegCLR automated analysis of brain connectivity. Top: An example mouse pyramidal cell, with synapse locations color-coded according to whether the synaptic partner was classified as inhibitory (blue), excitatory (red), or unknown (black). Inset shows higher detail of the soma and proximal dendrites. Bottom: We counted how many upstream synaptic partners were classified as thalamocortical axons, which bring input from sensory systems to the cortex. We found that thalamic input arrives primarily at cortical layer L4, the canonical cortical input layer, and preferentially targets primary visual area V1, rather than higher visual areas (HVA).
What’s Next?
SegCLR captures rich cellular features and can greatly simplify downstream analyses compared to working directly with raw image and segmentation data. We are excited to see what the community can discover using the ~8 billion embeddings we are releasing for the human and mouse cortical datasets (example access code; browsable human and mouse views in Neuroglancer). By reducing complex microscopy data to rich and compact embedding representations, SegCLR opens many novel avenues for biological insight, and may serve as a link to complementary modalities for high-dimensional characterization at the cellular and subcellular levels, such as spatially-resolved transcriptomics.
Posted by Jason Wei and Yi Tay, Research Scientists, Google Research, Brain Team
The field of natural language processing (NLP) has been revolutionized by language models trained on large amounts of text data. Scaling up the size of language models often leads to improved performance and sample efficiency on a range of downstream NLP tasks. In many cases, the performance of a large language model can be predicted by extrapolating the performance trend of smaller models. For instance, the effect of scale on language model perplexity has been empirically shown to span more than seven orders of magnitude.
On the other hand, performance for certain other tasks does not improve in a predictable fashion. For example, the GPT-3 paper showed that the ability of language models to perform multi-digit addition has a flat scaling curve (approximately random performance) for models from 100M to 13B parameters, at which point the performance jumped substantially. Given the growing use of language models in NLP research and applications, it is important to better understand abilities such as these that can arise unexpectedly.
In “Emergent Abilities of Large Language Models,” recently published in the Transactions on Machine Learning Research (TMLR), we discuss the phenomena of emergent abilities, which we define as abilities that are not present in small models but are present in larger models. More specifically, we study emergence by analyzing the performance of language models as a function of language model scale, as measured by total floating point operations (FLOPs), or how much compute was used to train the language model. However, we also explore emergence as a function of other variables, such as dataset size or number of model parameters (see the paper for full details). Overall, we present dozens of examples of emergent abilities that result from scaling up language models. The existence of such emergent abilities raises the question of whether additional scaling could potentially further expand the range of capabilities of language models.
Emergent Prompted Tasks
First we discuss emergent abilities that may arise in prompted tasks. In such tasks, a pre-trained language model is given a prompt for a task framed as next word prediction, and it performs the task by completing the response. Without any further fine-tuning, language models can often perform tasks that were not seen during training.
Example of few-shot prompting on movie review sentiment classification. The model is given one example of a task (classifying a movie review as positive or negative) and then performs the task on an unseen example.
We call a prompted task emergent when it unpredictably surges from random performance to above-random at a specific scale threshold. Below we show three examples of prompted tasks with emergent performance: multi-step arithmetic, taking college-level exams, and identifying the intended meaning of a word. In each case, language models perform poorly with very little dependence on model size up to a threshold at which point their performance suddenly begins to excel.
The ability to perform multi-step arithmetic (left), succeed on college-level exams (middle), and identify the intended meaning of a word in context (right) all emerge only for models of sufficiently large scale. The models shown include LaMDA, GPT-3, Gopher, Chinchilla, and PaLM.
Performance on these tasks only becomes non-random for models of sufficient scale — for instance, above 1022 training FLOPs for the arithmetic and multi-task NLU tasks, and above 1024 training FLOPs for the word in context tasks. Note that although the scale at which emergence occurs can be different for different tasks and models, no model showed smooth improvement in behavior on any of these tasks. Dozens of other emergent prompted tasks are listed in our paper.
Emergent Prompting Strategies
The second class of emergent abilities encompasses prompting strategies that augment the capabilities of language models. Prompting strategies are broad paradigms for prompting that can be applied to a range of different tasks. They are considered emergent when they fail for small models and can only be used by a sufficiently-large model.
One example of an emergent prompting strategy is called “chain-of-thought prompting”, for which the model is prompted to generate a series of intermediate steps before giving the final answer. Chain-of-thought prompting enables language models to perform tasks requiring complex reasoning, such as a multi-step math word problem. Notably, models acquire the ability to do chain-of-thought reasoning without being explicitly trained to do so. An example of chain-of-thought prompting is shown in the figure below.
Chain of thought prompting enables sufficiently large models to solve multi-step reasoning problems.
The empirical results of chain-of-thought prompting are shown below. For smaller models, applying chain-of-thought prompting does not outperform standard prompting, for example, when applied to GSM8K, a challenging benchmark of math word problems. However, for large models (1024 FLOPs), chain-of-thought prompting substantially improves performance in our tests, reaching a 57% solve rate on GSM8K.
Chain-of-thought prompting is an emergent ability — it fails to improve performance for small language models, but substantially improves performance for large models. Here we illustrate the difference between standard and chain-of-thought prompting at different scales for two language models, LaMDA and PaLM.
Implications of Emergent Abilities
The existence of emergent abilities has a range of implications. For example, because emergent few-shot prompted abilities and strategies are not explicitly encoded in pre-training, researchers may not know the full scope of few-shot prompted abilities of current language models. Moreover, the emergence of new abilities as a function of model scale raises the question of whether further scaling will potentially endow even larger models with new emergent abilities.
Identifying emergent abilities in large language models is a first step in understanding such phenomena and their potential impact on future model capabilities. Why does scaling unlock emergent abilities? Because computational resources are expensive, can emergent abilities be unlocked via other methods without increased scaling (e.g., better model architectures or training techniques)? Will new real-world applications of language models become unlocked when certain abilities emerge? Analyzing and understanding the behaviors of language models, including emergent behaviors that arise from scaling, is an important research question as the field of NLP continues to grow.
Acknowledgements
It was an honor and privilege to work with Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus.
The NVIDIA Grace Hopper Superchip Architecture is the first true heterogeneous accelerated platform for high-performance computing (HPC) and AI workloads. It…
The NVIDIA Grace Hopper Superchip Architecture is the first true heterogeneous accelerated platform forhigh-performance computing (HPC) and AI workloads. It accelerates applications with the strengths of both GPUs and CPUs while providing the simplest and most productive distributed heterogeneous programming model to date. Scientists and engineers can focus on solving the world’s most important problems.
In this post, you learn all about the Grace Hopper Superchip and highlight the performance breakthroughs that NVIDIA Grace Hopper delivers. For more information about the speedups that Grace Hopper achieves over the most powerful PCIe-based accelerated platforms using NVIDIA Hopper H100 GPUs, see the NVIDIA Grace Hopper Superchip Architecture whitepaper.
Performance and productivity for strong-scaling HPC and giant AI workloads
NVIDIA NVLink-C2C is an NVIDIA memory coherent, high-bandwidth, and low-latency superchip interconnect. It is the heart of the Grace Hopper Superchip and delivers up to 900 GB/s total bandwidth. This is 7x higher bandwidth than x16 PCIe Gen5 lanes commonly used in accelerated systems.
NVLink-C2C memory coherency increases developer productivity and performance and enables GPUs to access large amounts of memory.CPU and GPU threads can now concurrently and transparently access both CPU– and GPU-resident memory, enabling you to focus on algorithms instead of explicit memory management.
Memory coherency enables you to transfer only the data you need, and not migrate entire pages to and from the GPU. It also enables lightweight synchronization primitives across GPU and CPU threads by enabling native atomic operations from both the CPU and GPU. NVLink-C2C with Address Translation Services (ATS) leverages the NVIDIA Hopper Direct Memory Access (DMA) copy engines for accelerating bulk transfers of pageable memory across host and device.
NVLink-C2C enables applications to oversubscribe the GPU’s memory and directly utilize NVIDIA Grace CPU’s memory at high bandwidth. With up to 512 GB of LPDDR5X CPU memory per Grace Hopper Superchip, the GPU has direct high-bandwidth access to 4x more memory than what is available with HBM. Combined with the NVIDIA NVLink Switch System, all GPU threads running on up to 256 NVLink-connected GPUs can now access up to 150 TB of memory at high bandwidth. Fourth-generation NVLink enables accessing peer memory using direct loads, stores, and atomic operations, enabling accelerated applications to solve larger problems more easily than ever.
Together with NVIDIA networking technologies, Grace Hopper Superchips provide the recipe for the next generation of HPC supercomputers and AI factories. Customers can take on larger datasets, more complex models, and new workloads, solving them more quickly than before.
The main innovations of the NVIDIA Grace Hopper Superchip are as follows:
NVIDIA Grace CPU:
Up to 72x Arm Neoverse V2 cores with Armv9.0-A ISA and 4×128-bit SIMD units per core.
Up to 117 MB of L3 Cache.
Up to 512 GB of LPDDR5X memory delivering up to 546 GB/s of memory bandwidth.
Up to 64x PCIe Gen5 lanes.
NVIDIA Scalable Coherency Fabric (SCF) mesh and distributed cache with up to 3.2 TB/s memory bandwidth.
High developer productivity with a single CPU NUMA node.
NVIDIA Hopper GPU:
Up to 144 SMs with fourth-generation Tensor Cores, Transformer Engine, DPX, and 3x higher FP32 and FP64 throughout compared to the NVIDIA A100 GPU.
Up to 96 GB of HBM3 memory delivering up to 3000 GB/s.
60 MB L2 Cache.
NVLink 4 and PCIe 5.
NVIDIA NVLink-C2C:
Hardware-coherent interconnect between the Grace CPU and Hopper GPU.
Up to 900 GB/s total bandwidth, 450 GB/s/dir.
The Extended GPU Memory feature enables the Hopper GPU to address all CPU memory as GPU memory. Each Hopper GPU can address up to 608 GB of memory within a superchip.
NVIDIA NVLink Switch System:
Connects up to 256x NVIDIA Grace Hopper Superchips using NVLink 4.
Each NVLink-connected Hopper GPU can address all HBM3 and LPDDR5X memory of all superchips in the network, for up to 150 TB of GPU addressable memory.
Programming model for performance, portability, and productivity
Traditional heterogeneous platforms with PCIe-connected accelerators require users to follow a complex programming model that involves manually managing device memory allocations and data transfer to and from the host.
The NVIDIA Grace Hopper Superchip platform is heterogeneous and easy to program, and NVIDIA is committed to making it accessible to all developers and applications, independent of the programming language of choice.
Both the Grace Hopper Superchip and the platform are built to enable you to pick the right language for the task at hand, and the NVIDIA CUDA LLVM Compiler APIs enable you to bring your preferred programming language to the CUDA platform with the same level of code-generation quality and optimizations as NVIDIA compilers and tools.
The languages provided by NVIDIA for the CUDA platform (Figure 3) include accelerated standard languages like ISO C++, ISO Fortran, and Python. The platform also supports directive-based programming models like OpenACC, OpenMP, CUDA C++, and CUDA Fortran. The NVIDIA HPC SDK supports all these approaches, along with a rich set of accelerated libraries and tools for profiling and debugging.
This technology relies heavily on the hardware-accelerated memory coherency provided by NVIDIA NVLink-C2C and NVIDIA Unified Virtual Memory. As shown in Figure 4, in traditional PCIe-connected x86+Hopper systems without ATS, the CPU and the GPU have independent per-process page tables, and system-allocated memory is not directly accessible from the GPU. When a program allocates memory with the system allocator but the page entry is not available in the GPU’s page table, then accessing the memory from a GPU thread fails.
In NVIDIA Grace Hopper Superchip-based systems, ATS enables the CPU and GPU to share a single per-process page table, enabling all CPU and GPU threads to access all system-allocated memory, which can reside on physical CPU or GPU memory. The CPU heap, CPU thread stack, global variables, memory-mapped files, and interprocess memory are accessible to all CPU and GPU threads.
NVIDIA NVLink-C2C hardware-coherency enables the Grace CPU to cache GPU memory at cache-line granularity and for the GPU and CPU to access each other’s memory without page-migrations.
NVLink-C2C also accelerates all atomic operations supported by the CPU and GPU on system-allocated memory. Scoped atomic operations are fully supported and enable fine-grained and scalable synchronization across all threads in the system.
The runtime backs system-allocated memory with physical memory on first touch, either on LPDDR5X or HBM3, depending on whether a CPU or a GPU thread accesses it first. From an OS perspective, the Grace CPU and Hopper GPU are just two separate NUMA nodes. System-allocated memory is migratable so the runtime can change its physical memory backing to improve application performance or deal with memory pressure.
For PCIe-based platforms such as x86 or Arm, you can use the same Unified Memory programming model as the NVIDIA Grace Hopper model. That is possible through the Heterogeneous Memory Management (HMM) feature, which is a combination of Linux kernel features and NVIDIA driver features that use software to emulate memory coherence between CPUs and GPUs.
On NVIDIA Grace Hopper, these applications transparently benefit from the higher-bandwidth, lower-latency, higher atomic throughput, and hardware acceleration for memory coherency provided by NVLink-C2C, without any software changes.
Superchip architectural features
Here’s a look at the main innovations of the NVIDIA Grace Hopper architecture:
NVIDIA Grace CPU
NVIDIA Hopper GPU
NVLink-C2C
NVLink Switch System
Extended GPU memory
NVIDIA Grace CPU
As the parallel compute capabilities of GPUs continue to triple every generation, a fast and efficient CPU is critical to prevent the serial and CPU-only fractions of modern workloads from dominating performance.
NVIDIA Grace delivers leading per-thread performance, while providing higher energy efficiency than traditional CPUs. The 72 CPU cores deliver up to a 370 (estimated) score on SPECrate 2017_int_base, ensuring high-performance to satisfy the demands of both HPC and AI heterogeneous workloads.
Modern GPU workloads in machine learning and data science need access to huge amounts of memory. Typically, these workloads would have to use multiple GPUs to store the dataset in HBM memory.
The NVIDIA Grace CPU provides up to 512 GB of LPDDR5X memory, which delivers the optimal balance between memory capacity, energy efficiency, and performance. It supplies up to 546 GB/s of LPDDR5X memory bandwidth, which NVLink-C2C makes accessible to the GPU at 900 GB/s total bandwidth.
A single NVIDIA Grace Hopper Superchip provides the Hopper GPU with a total of 608 GB of fast-accessible memory, almost the total amount of slow memory available in a DGX-A100-80; an eight-GPU system of the previous generation.
This is made possible by the NVIDIA SCF shown in Figure 7, a mesh fabric and distributed cache that provides up to 3.2 TB/s of total bisection bandwidth to realize the full performance of CPU cores, memory, system I/Os, and NVLink-C2C. The CPU cores and SCF Cache partitions (SCCs) are distributed throughout the mesh, while Cache Switch Nodes (CSNs) route data through the fabric and serve as interfaces between the CPU cores, cache memory, and the rest of the system.
NVIDIA Hopper GPU
The NVIDIA Hopper GPU is the ninth-generation NVIDIA data center GPU. It is designed to deliver orders-of-magnitude improvements for large-scale AI and HPC applications compared to previous NVIDIA Ampere GPU generations. The Hopper GPU also features multiple innovations:
New fourth-generation Tensor Cores perform faster matrix computations than ever before on an even broader array of AI and HPC tasks.
A new transformer engine enables H100 to deliver up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation NVIDIA A100 GPU.
Improved features for spatial and temporal data locality and asynchronous execution enable applications to always keep all units busy and maximize power efficiency.
Secure Multi-Instance GPU (MIG) partitions the GPU into isolated, right-sized instances to maximize quality of service (QoS) for smaller workloads.
NVIDIA Hopper is the first truly asynchronous GPU. Its Tensor Memory Accelerator (TMA) and asynchronous transaction barrier enable threads to overlap and pipeline independent data movement and data processing, enabling applications to fully utilize all units.
New spatial and temporal locality features like thread block clusters, distributed shared memory, and thread block reconfiguration provide applications with fast access to larger amounts of shared memory and tools. This enables applications to better reuse data while it’s on-chip, further improving application performance.
NVLink-C2C: A high-bandwidth, chip-to-chip interconnect for superchips
NVIDIA Grace Hopper fuses an NVIDIA Grace CPU and NVIDIA Hopper GPU into a single superchip through the NVIDIA NVLink-C2C, a 900 GB/s chip-to-chip coherent interconnect that enables programming the Grace Hopper Superchip with a unified programming model.
The NVLink Chip-2-Chip (C2C) interconnect provides a high-bandwidth direct connection between a Grace CPU and a Hopper GPU to create the Grace Hopper Superchip, which is designed for drop-in acceleration of AI and HPC applications.
With 900 GB/s of bidirectional bandwidth, NVLink-C2C provides 7x the bandwidth of x16 PCIe Gen links at lower latency. NVLink-C2C also only uses 1.3 picojoules per bit transferred, which is greater than 5x more energy-efficient than PCIe Gen 5.
Furthermore, NVLink-C2C is a coherent memory interconnect with native hardware support for system-wide atomic operations. This improves the performance of memory accesses to non-local memory, such as CPU and GPU threads accessing memory resident in the other device. Hardware coherency also improves the performance of synchronization primitives, reducing the time the GPU or CPU wait on each other and increasing total system utilization.
Finally, hardware coherency also simplifies the development of heterogeneous computing applications using popular programming languages and frameworks. For more information, see the NVIDIA Grace Hopper Programming Model section.
NVLink Switch System
The NVIDIA NVLink Switch System combines fourth-generation NVIDIA NVLink technology with the new third-generation NVIDIA NVSwitch. A single level of the NVSwitch connects up to eight Grace Hopper Superchips, and a second level in a fat-tree topology enables networking up to 256 Grace Hopper Superchips with NVLink. A Grace Hopper Superchip pair exchanges data at up to 900 GB/s.
With up to 256 Grace Hopper Superchips, the network delivers up to 115.2 TB/s all-to-all bandwidth. This is 9x the all-to-all bandwidth of the NVIDIA InfiniBand NDR400.
The fourth-generation NVIDIA NVLink technology enables GPU threads to address up to 150 TB of memory provided by all superchips in the NVLink network using normal memory operations, atomic operations, and bulk transfers. Communication libraries like MPI, NCCL, or NVSHMEM transparently leverage the NVLink Switch System when available.
Extended GPU memory
The NVIDIA Grace Hopper Superchip is designed to accelerate applications with exceptionally large memory footprints, larger than the capacity of the HBM3 and LPDDR5X memory of a single superchip. For more information, see the NVIDIA Grace Hopper Accelerated Applications section.
The Extended GPU Memory (EGM) feature over the high-bandwidth NVLink-C2C enables GPUs to access all the system memory efficiently. EGM provides up to 150 TBs of system memory in a multi-node NVSwitch-connected system. With EGM, physical memory can be allocated to be accessible from any GPU thread in the multi-node system. All GPUs can access EGM at the minimum of GPU-GPU NVLink or NVLink-C2C speed.
Memory accesses within a Grace Hopper Superchip configuration go through the local high-bandwidth NVLink-C2C at 900 GB/s total. Remote memory accesses are performed through GPU NVLink and, depending on the memory being accessed, also NVLink-C2C (Figure 11). With EGM, GPU threads can now access all memory resources available over the NVSwitch fabric, both LPDDR5X and HBM3, at 450 GB/s.
NVIDIA HGX Grace Hopper
NVIDIA HGX Grace Hopper has a single Grace Hopper Superchip per node, paired with BlueField-3 NICs or OEM-Defined I/O and optionally an NVLink Switch System. It can be air– or liquid-cooled and has up to 1,000W TDP.
NVIDIA HGX Grace Hopper with InfiniBand
NVIDIA HGX Grace Hopper with Infiniband (Figure 13) is ideal for the scale-out of traditional machine learning (ML) and HPC workloads that are not bottlenecked by network communication overheads of InfiniBand, which is one of the fastest interconnects available.
Each node contains one Grace Hopper Superchip and one or more PCIe devices like NVMe solid-state drives and BlueField-3 DPUs, NVIDIA ConnectX-7 NICs, or OEM-defined I/O. With 16x PCIe Gen 5 lanes, an NDR400 InfiniBand NIC provides up to 100 GB/s of total bandwidth across the superchips. Combined with NVIDIA BlueField-3 DPUs, this platform is easy to manage and deploy and uses a traditional HPC and AI cluster networking architecture.
NVIDIA HGX Grace Hopper with NVLink Switch
NVIDIA HGX Grace Hopper with NVLink Switch is ideal for strong scaling giant machine learning and HPC workloads. It enables all GPU threads in the NVLink-connected domain to address up to 150 TB of memory at up to 900 GB/s total bandwidth per superchip in a 256-GPU NVLink-connected system. The a simple programming model uses pointer load, store, and atomic operations. Its 450 GB/s all-reduce bandwidth and up to 115.2 TB/s bisection bandwidth make this platform ideal for strong-scaling the world’s largest and most challenging AI training and HPC workloads.
NVLink-connected domains are networked with NVIDIA InfiniBand networking, for example, NVIDIA ConnectX-7 NICs or NVIDIA BlueField-3 data processing units (DPUs) paired with NVIDIA Quantum 2 NDR switches or OEM-defined I/O solutions.
Delivering performance breakthroughs
The NVIDIA Grace Hopper Superchip Architecture whitepaper expands on the details covered in this post. It walks you through how Grace Hopper delivers the performance breakthroughs shown on Figure 1 over what is currently the most powerful PCIe-based accelerated platforms powered by NVIDIA Hopper H100 PCIe GPUs.
Do you have any applications that would be perfect for the NVIDIA Grace Hopper Superchip? Let us know in the comments!
Acknowledgments
We would like to thank Jack Choquette, Ronny Krashinsky, John Hubbard, Mark Hummel, Greg Palmer, Ryan Wells, Alex Ishii, Jonah Alben, and the many NVIDIA architects and engineers who contributed to this post.