
Towards Helpful Robots: Grounding Language in Robotic Affordances

Over the last several years, we have seen significant progress in applying machine learning to robotics. However, robotic systems today are capable of executing only very short, hard-coded commands, such as “Pick up an apple,” because they tend to perform best with clear tasks and rewards. They struggle with learning to perform long-horizon tasks and reasoning about abstract goals, such as a user prompt like “I just worked out, can you get me a healthy snack?”

Meanwhile, recent progress in training language models (LMs) has led to systems that can perform a wide range of language understanding and generation tasks with impressive results. However, these language models are inherently not grounded in the physical world due to the nature of their training process: a language model generally does not interact with its environment nor observe the outcome of its responses. This can result in it generating instructions that may be illogical, impractical or unsafe for a robot to complete in a physical context. For example, when prompted with “I spilled my drink, can you help?” the language model GPT-3 responds with “You could try using a vacuum cleaner,” a suggestion that may be unsafe or impossible for the robot to execute. When asking the FLAN language model the same question, it apologizes for the spill with “I’m sorry, I didn’t mean to spill it,” which is not a very useful response. Therefore, we asked ourselves, is there an effective way to combine advanced language models with robot learning algorithms to leverage the benefits of both?

In “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”, we present a novel approach, developed in partnership with Everyday Robots, that leverages advanced language model knowledge to enable a physical agent, such as a robot, to follow high-level textual instructions for physically-grounded tasks, while grounding the language model in tasks that are feasible within a specific real-world context. We evaluate our method, which we call PaLM-SayCan, by placing robots in a real kitchen setting and giving them tasks expressed in natural language. We observe highly interpretable results for temporally-extended complex and abstract tasks, like “I just worked out, please bring me a snack and a drink to recover.” Specifically, we demonstrate that grounding the language model in the real world nearly halves errors over non-grounded baselines. We are also excited to release a robot simulation setup where the research community can test this approach.


With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.


With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

A Dialog Between User and Robot, Facilitated by the Language Model
Our approach uses the knowledge contained in language models (Say) to determine and score actions that are useful towards high-level instructions. It also uses an affordance function (Can) that enables real-world-grounding and determines which actions are possible to execute in a given environment. Using the the PaLM language model, we call this PaLM-SayCan.

Our approach selects skills based on what the language model scores as useful to the high level instruction and what the affordance model scores as possible.

Our system can be seen as a dialog between the user and robot, facilitated by the language model. The user starts by giving an instruction that the language model turns into a sequence of steps for the robot to execute. This sequence is filtered using the robot’s skillset to determine the most feasible plan given its current state and environment. The model determines the probability of a specific skill successfully making progress toward completing the instruction by multiplying two probabilities: (1) task-grounding (i.e., a skill language description) and (2) world-grounding (i.e., skill feasibility in the current state).

There are additional benefits of our approach in terms of its safety and interpretability. First, by allowing the LM to score different options rather than generate the most likely output, we effectively constrain the LM to only output one of the pre-selected responses. In addition, the user can easily understand the decision making process by looking at the separate language and affordance scores, rather than a single output.

PaLM-SayCan is also interpretable: at each step, we can see the top options it considers based on their language score (blue), affordance score (red), and combined score (green).

Training Policies and Value Functions
Each skill in the agent’s skillset is defined as a policy with a short language description (e.g., “pick up the can”), represented as embeddings, and an affordance function that indicates the probability of completing the skill from the robot’s current state. To learn the affordance functions, we use sparse reward functions set to 1.0 for a successful execution, and 0.0 otherwise.

We use image-based behavioral cloning (BC) to train the language-conditioned policies and temporal-difference-based (TD) reinforcement learning (RL) to train the value functions. To train the policies, we collected data from 68,000 demos performed by 10 robots over 11 months and added 12,000 successful episodes, filtered from a set of autonomous episodes of learned policies. We then learned the language conditioned value functions using MT-Opt in the Everyday Robots simulator. The simulator complements our real robot fleet with a simulated version of the skills and environment, which is transformed using RetinaGAN to reduce the simulation-to-real gap. We bootstrapped simulation policies’ performance by using demonstrations to provide initial successes, and then continuously improved RL performance with online data collection in simulation.

Given a high-level instruction, our approach combines the probabilities from the language model with the probabilities from the value function (VF) to select the next skill to perform. This process is repeated until the high-level instruction is successfully completed.

Performance on Temporally-Extended, Complex, and Abstract Instructions
To test our approach, we use robots from Everyday Robots paired with PaLM. We place the robots in a kitchen environment containing common objects and evaluate them on 101 instructions to test their performance across various robot and environment states, instruction language complexity and time horizon. Specifically, these instructions were designed to showcase the ambiguity and complexity of language rather than to provide simple, imperative queries, enabling queries such as “I just worked out, how would you bring me a snack and a drink to recover?” instead of “Can you bring me water and an apple?”

We use two metrics to evaluate the system’s performance: (1) the plan success rate, indicating whether the robot chose the right skills for the instruction, and (2) the execution success rate, indicating whether it performed the instruction successfully. We compare two language models, PaLM and FLAN (a smaller language model fine-tuned on instruction answering) with and without the affordance grounding as well as the underlying policies running directly with natural language (Behavioral Cloning in the table below). The results show that the system using PaLM with affordance grounding (PaLM-SayCan) chooses the correct sequence of skills 84% of the time and executes them successfully 74% of the time, reducing errors by 50% compared to FLAN and compared to PaLM without robotic grounding. This is particularly exciting because it represents the first time we can see how an improvement in language models translates to a similar improvement in robotics. This result indicates a potential future where robotics is able to ride the wave of progress that we have been observing in language models, bringing these subfields of research closer together.

Algorithm     Plan     Execute
PaLM-SayCan     84%     74%
PaLM     67%    
FLAN-SayCan     70%     61%
FLAN     38%    
Behavioral Cloning     0%     0%
PaLM-SayCan halves errors compared to PaLM without affordances and compared to FLAN over 101 tasks.
SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.


SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.


If you’re interested in learning more about this project from the researchers themselves, please check out the video below:

Conclusion and Future Work
We’re excited about the progress that we’ve seen with PaLM-SayCan, an interpretable and general approach to leveraging knowledge from language models that enables a robot to follow high-level textual instructions to perform physically-grounded tasks. Our experiments on a number of real-world robotic tasks demonstrate the ability to plan and complete long-horizon, abstract, natural language instructions at a high success rate. We believe that PaLM-SayCan’s interpretability allows for safe real-world user interaction with robots. As we explore future directions for this work, we hope to better understand how information gained via the robot’s real-world experience could be leveraged to improve the language model and to what extent natural language is the right ontology for programming robots. We have open-sourced a robot simulation setup, which we hope will provide researchers with a valuable resource for future research that combines robotic learning with advanced language models. The research community can visit the project’s GitHub page and website to learn more.

We’d like to thank our coauthors Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Kelly Fu, Keerthana Gopalakrishnan, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. We’d also like to thank Yunfei Bai, Matt Bennice, Maarten Bosma, Justin Boyd, Bill Byrne, Kendra Byrne, Noah Constant, Pete Florence, Laura Graesser, Rico Jonschkowski, Daniel Kappler, Hugo Larochelle, Benjamin Lee, Adrian Li, Suraj Nair, Krista Reymann, Jeff Seto, Dhruv Shah, Ian Storz, Razvan Surdulescu, and Vincent Zhao for their help and support in various aspects of the project. And we’d like to thank Tom Small for creating many of the animations in this post.


Digital Art Professor Kate Parsons Inspires Next Generation of Creators This Week ‘In the NVIDIA Studio’

Many artists can edit a video, paint a picture or build a model — but transforming one’s imagination into stunning creations can now involve breakthrough design technologies. Kate Parsons, a digital art professor at Pepperdine University and this week’s featured In the NVIDIA Studio artist, helped bring a music video for How Do I Get to Invincible to life using virtual reality and NVIDIA GeForce RTX GPUs.

The post Digital Art Professor Kate Parsons Inspires Next Generation of Creators This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.


NVIDIA GTC to Feature CEO Jensen Huang Keynote Announcing New AI and Metaverse Technologies, 200+ Sessions With Top Tech, Business Execs

Deep Learning Pioneers Yoshua Bengio, Geoff Hinton, Yann LeCun Among the Scores of Industry Experts to Present at World’s Premier AI Conference, Sept. 19-22SANTA CLARA, Calif., Aug. 15, 2022 …


From Sapling to Forest: Five Sustainability and Employment Initiatives We’re Nurturing in India

For over a decade, NVIDIA has invested in social causes and communities in India as part of our commitment to corporate social responsibility. Bolstering those efforts, we’re unveiling this year’s investments in five projects that have been selected by the NVIDIA Foundation team, focused on the areas of environmental conservation, ecological restoration, social innovation and job Read article >

The post From Sapling to Forest: Five Sustainability and Employment Initiatives We’re Nurturing in India appeared first on NVIDIA Blog.


Rax: Composable Learning-to-Rank Using JAX

Ranking is a core problem across a variety of domains, such as search engines, recommendation systems, or question answering. As such, researchers often utilize learning-to-rank (LTR), a set of supervised machine learning techniques that optimize for the utility of an entire list of items (rather than a single item at a time). A noticeable recent focus is on combining LTR with deep learning. Existing libraries, most notably TF-Ranking, offer researchers and practitioners the necessary tools to use LTR in their work. However, none of the existing LTR libraries work natively with JAX, a new machine learning framework that provides an extensible system of function transformations that compose: automatic differentiation, JIT-compilation to GPU/TPU devices and more.

Today, we are excited to introduce Rax, a library for LTR in the JAX ecosystem. Rax brings decades of LTR research to the JAX ecosystem, making it possible to apply JAX to a variety of ranking problems and combine ranking techniques with recent advances in deep learning built upon JAX (e.g., T5X). Rax provides state-of-the-art ranking losses, a number of standard ranking metrics, and a set of function transformations to enable ranking metric optimization. All this functionality is provided with a well-documented and easy to use API that will look and feel familiar to JAX users. Please check out our paper for more technical details.

Learning-to-Rank Using Rax
Rax is designed to solve LTR problems. To this end, Rax provides loss and metric functions that operate on batches of lists, not batches of individual data points as is common in other machine learning problems. An example of such a list is the multiple potential results from a search engine query. The figure below illustrates how tools from Rax can be used to train neural networks on ranking tasks. In this example, the green items (B, F) are very relevant, the yellow items (C, E) are somewhat relevant and the red items (A, D) are not relevant. A neural network is used to predict a relevancy score for each item, then these items are sorted by these scores to produce a ranking. A Rax ranking loss incorporates the entire list of scores to optimize the neural network, improving the overall ranking of the items. After several iterations of stochastic gradient descent, the neural network learns to score the items such that the resulting ranking is optimal: relevant items are placed at the top of the list and non-relevant items at the bottom.

Using Rax to optimize a neural network for a ranking task. The green items (B, F) are very relevant, the yellow items (C, E) are somewhat relevant and the red items (A, D) are not relevant.

Approximate Metric Optimization
The quality of a ranking is commonly evaluated using ranking metrics, e.g., the normalized discounted cumulative gain (NDCG). An important objective of LTR is to optimize a neural network so that it scores highly on ranking metrics. However, ranking metrics like NDCG can present challenges because they are often discontinuous and flat, so stochastic gradient descent cannot directly be applied to these metrics. Rax provides state-of-the-art approximation techniques that make it possible to produce differentiable surrogates to ranking metrics that permit optimization via gradient descent. The figure below illustrates the use of rax.approx_t12n, a function transformation unique to Rax, which allows for the NDCG metric to be transformed into an approximate and differentiable form.

Using an approximation technique from Rax to transform the NDCG ranking metric into a differentiable and optimizable ranking loss (approx_t12n and gumbel_t12n).

First, notice how the NDCG metric (in green) is flat and discontinuous, making it hard to optimize using stochastic gradient descent. By applying the rax.approx_t12n transformation to the metric, we obtain ApproxNDCG, an approximate metric that is now differentiable with well-defined gradients (in red). However, it potentially has many local optima — points where the loss is locally optimal, but not globally optimal — in which the training process can get stuck. When the loss encounters such a local optimum, training procedures like stochastic gradient descent will have difficulty improving the neural network further.

To overcome this, we can obtain the gumbel-version of ApproxNDCG by using the rax.gumbel_t12n transformation. This gumbel version introduces noise in the ranking scores which causes the loss to sample many different rankings that may incur a non-zero cost (in blue). This stochastic treatment may help the loss escape local optima and often is a better choice when training a neural network on a ranking metric. Rax, by design, allows the approximate and gumbel transformations to be freely used with all metrics that are offered by the library, including metrics with a top-k cutoff value, like recall or precision. In fact, it is even possible to implement your own metrics and transform them to obtain gumbel-approximate versions that permit optimization without any extra effort.

Ranking in the JAX Ecosystem
Rax is designed to integrate well in the JAX ecosystem and we prioritize interoperability with other JAX-based libraries. For example, a common workflow for researchers that use JAX is to use TensorFlow Datasets to load a dataset, Flax to build a neural network, and Optax to optimize the parameters of the network. Each of these libraries composes well with the others and the composition of these tools is what makes working with JAX both flexible and powerful. For researchers and practitioners of ranking systems, the JAX ecosystem was previously missing LTR functionality, and Rax fills this gap by providing a collection of ranking losses and metrics. We have carefully constructed Rax to function natively with standard JAX transformations such as jax.jit and jax.grad and various libraries like Flax and Optax. This means that users can freely use their favorite JAX and Rax tools together.

Ranking with T5
While giant language models such as T5 have shown great performance on natural language tasks, how to leverage ranking losses to improve their performance on ranking tasks, such as search or question answering, is under-explored. With Rax, it is possible to fully tap this potential. Rax is written as a JAX-first library, thus it is easy to integrate it with other JAX libraries. Since T5X is an implementation of T5 in the JAX ecosystem, Rax can work with it seamlessly.

To this end, we have an example that demonstrates how Rax can be used in T5X. By incorporating ranking losses and metrics, it is now possible to fine-tune T5 for ranking problems, and our results indicate that enhancing T5 with ranking losses can offer significant performance improvements. For example, on the MS-MARCO QNA v2.1 benchmark we are able to achieve a +1.2% NDCG and +1.7% MRR by fine-tuning a T5-Base model using the Rax listwise softmax cross-entropy loss instead of a pointwise sigmoid cross-entropy loss.

Fine-tuning a T5-Base model on MS-MARCO QNA v2.1 with a ranking loss (softmax, in blue) versus a non-ranking loss (pointwise sigmoid, in red).

Overall, Rax is a new addition to the growing ecosystem of JAX libraries. Rax is entirely open source and available to everyone at More technical details can also be found in our paper. We encourage everyone to explore the examples included in the github repository: (1) optimizing a neural network with Flax and Optax, (2) comparing different approximate metric optimization techniques, and (3) how to integrate Rax with T5X.

Many collaborators within Google made this project possible: Xuanhui Wang, Zhen Qin, Le Yan, Rama Kumar Pasumarthi, Michael Bendersky, Marc Najork, Fernando Diaz, Ryan Doherty, Afroz Mohiuddin, and Samer Hassan.


Get Hands-On Training from NVIDIA Experts at GTC

What if you could spend 8 hours with an AI legend while getting hands-on experience using some of the most advanced GPU and DPU technology available? As part of…

What if you could spend 8 hours with an AI legend while getting hands-on experience using some of the most advanced GPU and DPU technology available?

As part of the upcoming GPU Technical Conference, the NVIDIA Deep Learning Institute (DLI) is offering 20 full-day workshops covering a range of deep learning, data science, and accelerated computing topics. In each workshop, you are given access to a fully configured, GPU-accelerated server in the cloud. You gain experience building and deploying an end-to-end project using industry-standard software, tools, and frameworks while learning from some of the most experienced AI practitioners in the industry.

DLI workshops are currently $99 until August 29, $149 as of August 30. Register now!.

All workshops are created and taught by NVIDIA experts. Here are three who are teaching DLI workshops at GTC:

  • Bob Crovella (USA)
  • Adam Grzywaczewski (England)
  • Gwangsoo Hong (Korea)

Bob Crovella, NVIDIA solution architect (USA)  

Photo of Bob Crovella
Bob Crovella

Bob has been a solution architect and field application engineer in the areas of scientific simulation, HPC, and deep learning for almost 25 years at NVIDIA. He and his teams have helped hundreds of customers and partners figure out how to leverage the capabilities of accelerated computing to solve some of the world’s most difficult problems.

After NVIDIA introduced CUDA in 2007, Bob was one of the first to train customers and partners on how to unlock the power of GPUs and has since become one of the leading experts on parallel computing architecture.   

“It’s breathtaking. When I first learned to program CUDA, I was amazed by what the machine is capable of and the power you can unlock with your code. You witness something speed up dramatically, like 10X or 100X faster. And suddenly you have this realization that this thing is everything they said it was. I know this is kind of geeky, right? But through my work and teaching DLI, I get to give that same opportunity to others to experience that kind of excitement—to program the most powerful piece of processing hardware on the planet. It’s not an experience that everyone gets.” 

Bob earned a BS in electrical and electronics engineering from the University of Buffalo and an MS in electrical engineering from Rensselaer Polytechnic Institute. He is currently certified to teach four DLI courses.

At GTC, Bob will be teaching Scaling CUDA C++ Applications to Multiple Nodes on Monday, September 19 from 9 AM to 5 PM PDT. “This is one of the most advanced CUDA programming classes that we offer. We help students take GPU programming to the next level: using multiple computers in a cluster to solve bigger and bigger problems.”

Adam Grzywaczewski, NVIDIA senior deep learning solution architect (England)

Photo of Adam Grzywaczewski
Adam Grzywaczewski

Adam is a senior deep learning solution architect at NVIDIA. Over the last 5 years, he has worked with hundreds of customers helping them design and implement AI solutions. He specializes in large training workloads and neural networks that require hundreds of GPUs to train and run.

“When I first started at NVIDIA, DGX was new. In fact, I have the first prototype of a DGX Station here under my desk. Over time, I have seen customers systematically start to migrate to intensive work on very large systems, very large training jobs, and a surprisingly large number of inference workloads. We are seeing customers have a lot of very serious conversations and engineering work around deployment to production.”

Adam has co-authored two DLI workshops, is certified in six workshops, and has taught the most workshops among the EMEA solution architects in the past year.    

“Our workshops are very focused, and they are designed with a very pragmatic attitude—to solve the problems that they are advertised to solve. We distill huge amounts of knowledge into each course, information that doesn’t exist in such a distilled format anywhere else. And you get direct access to fully configured GPUs. In a course that we just released, the student starts the training process of an extremely large language model and then deploys that model to production. And with just a couple of clicks, teaches that model how to translate and how to answer the questions. It’s actually quite empowering.”

Adam received his BS in information retrieval from Coventry University, his MS in computer science from the Silesian University of Technology, and his Ph. D. from Coventry University.

At GTC, Adam will be teaching Model Parallelism: Building and Deploying Large Neural Networks.

Gwangsoo Hong, NVIDIA solution architect (Korea)

Photo of Gwansoo Hong
Gwansoo Hong

Gwangsoo has been a solution architect with NVIDIA for almost 4 years. His current responsibilities include helping customers get the most value out of their NVIDIA full-stack platform. He specializes in computer vision and NLP with deep learning with expertise in GPU acceleration for large-scale models. He is certified in eight DLI workshops and is one of our most sought-after instructors in Korea.

“The part I love the most about being a DLI instructor is working with various students and teaching them about end-to-end deep learning workloads like training, inference, and services; helping them learn about different workloads and application domains; and materializing their ideas. It’s also rewarding to teach students of all backgrounds and ages and see them successfully complete the DLI course. I learn something from each of them. The reaction I get most often from my students is, ‘This can’t be.” 

At GTC, he will also be teaching Model Parallelism: Building and Deploying Large Neural Networks on  on Wednesday, September 21 from 9 AM to 6 PM KST.

Register now for early discounts

Don’t miss this unique opportunity to take your AI skills to the next level. Registration for the conference is free and the DLI workshops are offered at a special price of $149, or $99 if you register before August 28 (normally $500 per seat.)

For the complete list, see GTC Workshops & Training. Some workshops are available in Taiwanese, Korean, and Japanese and are scheduled in those respective time zones.


New NVIDIA Neural Graphics SDKs Make Metaverse Content Creation Available to All

A dozen tools and programs—including new releases NeuralVDB and Kaolin Wisp—make 3D content creation easy and fast for millions of designers and creators.

A dozen tools and programs—including new releases NeuralVDB and Kaolin Wisp—make 3D content creation easy and fast for millions of designers and creators.


Upcoming Webinar: Designing Efficient Vision Transformer Networks for Autonomous Vehicles

Explore design principles for efficient transformers in production and how innovative model design can help achieve better accuracy in AV perception.

Explore design principles for efficient transformers in production and how innovative model design can help achieve better accuracy in AV perception.


Top Israel Medical Center Partners with AI Startups to Help Detect Brain Bleeds, Other Critical Cases

Israel’s largest private medical center is working with startups and researchers to bring potentially life-saving AI solutions to real-world healthcare workflows. With more than 1.5 million patients across eight medical centers, Assuta Medical Centers conduct over 100,000 surgeries, 800,000 imaging tests and hundreds of thousands of other health diagnostics and treatments each year. These create Read article >

The post Top Israel Medical Center Partners with AI Startups to Help Detect Brain Bleeds, Other Critical Cases appeared first on NVIDIA Blog.


GFN Thursday Brings Thunder to the Cloud With ‘Rumbleverse’ Arriving on GeForce NOW

It’s time to rumble in Grapital City with Rumbleverse launching today on GeForce NOW. Punch your way into the all-new, free-to-play Brawler Royale from Iron Galaxy Studios and Epic Games Publishing, streaming from the cloud to nearly all devices. That means gamers can tackle, uppercut, body slam and more from any GeForce NOW-compatible device, including Read article >

The post GFN Thursday Brings Thunder to the Cloud With ‘Rumbleverse’ Arriving on GeForce NOW appeared first on NVIDIA Blog.