Categories
Offsites

Better Language Models Without Massive Compute

In recent years, language models (LMs) have become more prominent in natural language processing (NLP) research and are also becoming increasingly impactful in practice. Scaling up LMs has been shown to improve performance across a range of NLP tasks. For instance, scaling up language models can improve perplexity across seven orders of magnitude of model sizes, and new abilities such as multi-step reasoning have been observed to arise as a result of model scale. However, one of the challenges of continued scaling is that training new, larger models requires great amounts of computational resources. Moreover, new models are often trained from scratch and do not leverage the weights from previously existing models.

In this blog post, we explore two complementary methods for improving existing language models by a large margin without using massive computational resources. First, in “Transcending Scaling Laws with 0.1% Extra Compute”, we introduce UL2R, which is a lightweight second stage of pre-training that uses a mixture-of-denoisers objective. UL2R improves performance across a range of tasks and even unlocks emergent performance on tasks that previously had close to random performance. Second, in “Scaling Instruction-Finetuned Language Models”, we explore fine-tuning a language model on a collection of datasets phrased as instructions, a process we call “Flan”. This approach not only boosts performance, but also improves the usability of the language model to user inputs without engineering of prompts. Finally, we show that Flan and UL2R can be combined as complementary techniques in a model called Flan-U-PaLM 540B, which outperforms the unadapted PaLM 540B model by 10% across a suite of challenging evaluation benchmarks.

UL2R Training

Traditionally, most language models are pre-trained on either a causal language modeling objective that enables the model to predict the next word in a sequence (e.g., GPT-3 or PaLM) or a denoising objective, where the model learns to recover the original sentence from a corrupted sequence of words, (e.g., T5). Although there are some tradeoffs in language modeling objectives in that causal LMs are better at long-form generation and LMs trained on a denoising objective are better for fine-tuning, in prior work we demonstrated that a mixture-of-denoisers objective that includes both objectives results in better performance on both scenarios.

However, pre-training a large language model on a different objective from scratch can be computationally prohibitive. Hence, we propose UL2 Repair (UL2R), an additional stage of continued pre-training with the UL2 objective that only requires a relatively small amount of compute. We apply UL2R to PaLM and call the resulting new language model U-PaLM.

In empirical evaluations, we found that scaling curves improve substantially with only a small amount of UL2 training. For instance, we show that by using UL2R on the intermediate checkpoint of PaLM 540B, we reach the performance of the final PaLM 540B checkpoint while using 2x less compute (or a difference of 4.4 million TPUv4 hours). Naturally, applying UL2R to the final PaLM 540B checkpoint also leads to substantial improvements, as described in the paper.

Compute versus model performance of PaLM 540B and U-PaLM 540B on 26 NLP benchmarks (listed in Table 8 in the paper). U-PaLM 540B continues training PaLM for a very small amount of compute but provides a substantial gain in performance.

Another benefit that we observed from using UL2R is that on some tasks, performance is much better than models trained purely on the causal language modeling objective. For instance, there are many BIG-Bench tasks that have been described as “emergent abilities”, i.e., abilities that can only be observed in sufficiently large language models. Although the way that emergent abilities are most commonly found is by scaling up the size of the LM, we found that UL2R can actually elicit emergent abilities without increasing the scale of the LM.

For instance, in the Navigate task from BIG-Bench, which measures the model’s ability to perform state tracking, all models except U-PaLM with less than 1023 training FLOPs achieve approximately random performance. U-PaLM performance is more than 10 points above that. Another example of this is the Snarks task from BIG-Bench, which measures the model’s ability to detect sarcasm. Again, whereas all models less than 1024 training FLOPs achieve approximately random performance, U-PaLM achieves well above even for the 8B and 62B models.

For two abilities from BIG-Bench that demonstrate emergent task performance, U-PaLM achieves emergence at a smaller model size due to its use of the UL2R objective.

Instruction Fine-Tuning

In our second paper, we explore instruction fine-tuning, which involves fine-tuning LMs on a collection of NLP datasets phrased as instructions. In prior work, we applied instruction fine-tuning to a 137B-parameter model on 62 NLP tasks, such as answering a trivia question, classifying the sentiment of a movie, or translating a sentence to Spanish.

In this work we fine-tune a 540B parameter language model on more than 1.8K tasks. Moreover, whereas previous efforts only fine-tuned a LM with few-shot exemplars (e.g., MetaICL) or zero-shot without exemplars (e.g., FLAN, T0), we fine-tune on a combination of both. We also include chain of thought fine-tuning data, which enables the model to perform multi-step reasoning. We call our improved methodology “Flan”, for fine-tuning language models. Notably, even with fine-tuning on 1.8K tasks, Flan only uses a small portion of compute compared to pre-training (e.g., for PaLM 540B, Flan only requires 0.2% of the pre-training compute).

We fine-tune language models on 1.8K tasks phrased as instructions, and evaluate them on unseen tasks, which are not included in fine-tuning. We fine-tune both with and without exemplars (i.e., zero-shot and few-shot) and with and without chain of thought, enabling generalization across a range of evaluation scenarios.

In the paper, we instruction–fine-tune LMs of a range of sizes to investigate the joint effect of scaling both the size of the LM and the number of fine-tuning tasks. For instance, for the PaLM class of LMs, which includes models of 8B, 62B, and 540B parameters. We evaluate our models on four challenging benchmark evaluation suites (MMLU, BBH, TyDiQA, and MGSM), and find that both scaling the number of parameters and number of fine-tuning tasks improves performance on unseen tasks.

Both scaling up to a 540B parameter model and using 1.8K fine-tuning tasks improves the performance on unseen tasks. The y-axis is the normalized average over four evaluation suites (MMLU, BBH, TyDiQA, and MGSM).

In addition to better performance, instruction fine-tuning a LM enables it to respond to user instructions at inference time, without few-shot exemplars or prompt engineering. This makes LMs more user-friendly across a range of inputs. For instance, LMs without instruction fine-tuning can sometimes repeat the input or fail to follow instructions, but instruction fine-tuning mitigates such errors.

Our instruction–fine-tuned language model, Flan-PaLM, responds better to instructions compared to the PaLM model without instruction fine-tuning.

Putting Them Together

Finally, we show that UL2R and Flan can be combined to train the Flan-U-PaLM model. Since Flan uses new data from NLP tasks and enables zero-shot instruction following, we apply Flan as the second method after UL2R. We again evaluate on the four benchmark suites, and find that the Flan-U-PaLM model outperforms PaLM models with just UL2R (U-PaLM) or just Flan (Flan-PaLM). Further, Flan-U-PaLM achieves a new state-of-the-art on the MMLU benchmark with a score of 75.4% when combined with chain of thought and self-consistency.

Combining UL2R and Flan (Flan-U-PaLM) leads to the best performance compared to just using UL2R (U-PaLM) or just Flan (Flan-U-PaLM). Performance is the normalized average over four evaluation suites (MMLU, BBH, TyDiQA, and MGSM).

<!–

Average performance on four challenging evaluation suites
PaLM 49.1%
U-PaLM 50.2%
Flan-PaLM 58.4%
Flan-U-PaLM 59.1%

Combining UL2R and Flan (Flan-U-PaLM) leads to the best performance compared to just using UL2R (U-PaLM) or just Flan (Flan-U-PaLM). Performance is the normalized average over four evaluation suites (MMLU, BBH, TyDiQA, and MGSM).

–>

Overall, UL2R and Flan are two complementary methods for improving pre-trained language models. UL2R adapts the LM to a mixture-of-denoisers objective using the same data, whereas Flan leverages training data from over 1.8K NLP tasks to teach the model to follow instructions. As LMs become even larger, techniques such as UL2R and Flan that improve general performance without large amounts of compute may become increasingly attractive.

Acknowledgements

It was a privilege to collaborate on these two papers with Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le, Mostafa Dehghani, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Ed H. Chi, Jeff Dean, Jacob Devlin, and Adam Roberts.

Categories
Offsites

Google at NeurIPS 2022

This week marks the beginning of the 36th annual Conference on Neural Information Processing Systems (NeurIPS 2022), the biggest machine learning conference of the year, which is being held in New Orleans, LA. NeurIPS 2022 will be held in person with additional options for virtual attendees, and includes invited talks, demonstrations and presentations of some of the latest in machine learning research. This year, NeurIPS is also offering a new track, called Spotlight Papers, which will provide opportunities to highlight papers presented in prestigious journals that would otherwise not have been eligible for submission.

Google is proud to be a Diamond level sponsor of NeurIPS this year and will have a significant presence year with more than 175 accepted papers, additionally contributing to and learning from the broader academic research community through numerous talks, posters, workshops, and tutorials. You can learn more about our work being presented in the list below (Google affiliations highlighted in bold).

Organizing Committee

General Chairs includes: Sanmi Koyejo

Program Chairs include: Alekh Agarwal

Workshop Chairs include: Hanie Sedghi

Tutorial Chairs include: Adji Bousso Dieng, Jessica Schrouff

Affinity Workshop Chair: Adji Bousso Dieng, Jessica Schrouff

Program Committee, Senior Area Chairs include: Corinna Cortes, Claudio Gentile, Mohammad Ghavamzadeh, Amir Globerson, Elad Hazan, Katherine Heller, Satyen Kale, Been Kim, Sanjiv Kumar, Hugo Larochelle, Sergey Levine, Yishay Mansour, Mehryar Mohri, Tara Sainath, Dale Schuurmans, Daniel Tarlow

NeurIPS Foundation Board Secretary: Michael Mozer

NeurIPS Foundation Board Members include: Corinna Cortes, Isabelle Guyon, Sanmi Koyejo, Hugo Larochelle

NeurIPS Foundation Advisory Board include: Peter Bartlett, Zoubin Ghahramani, John C. Platt, Fernando Pereira, Dale Schuurmans

Keynote Speakers

The Data-Centric Era: How ML is Becoming an Experimental Science
Isabelle Guyon

The Forward-Forward Algorithm for Training Deep Neural Networks
Geoffrey Hinton

Outstanding Paper Award

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi

EXPO Day Workshops

Graph Neural Networks in Tensorflow: A Practical Guide
Workshop Organizers include: Bryan Perozzi, Sami Abu-el-Haija

A Hands-On Introduction to Tensorflow and Jax
Workshop Organizers include: Josh Gordon

Affinity Workshops

LatinX in AI (LXAI)
Platinum Sponsor
Networking & Social Chairs include: Andres Muñoz Medina
Program Committee includes: Johan Obando Ceron

Queer in AI
Panelists include: Sara Beery, Talia Ringer

Women in Machine Learning (WiML)
Platinum Sponsor
Workshop Organizers and Mentorship Chairs include: Beliz Gunel
Mentors include: Adam Roberts, Eleni Triantafillou, Zelda Mariet, Clara Hu, Rosanne Liu, Alekh Agarwal, Vinod Prabhakaran, Rose Yu, Katherine Heller

Workshops

New in ML
Workshop Organizers include: Isabelle Guyon

AI for Accelerated Materials Design (AI4Mat)
Workshop Organizers include: Benjamin Sanchez-Lengeling

All Things Attention: Bridging Different Perspectives on Attention
Invited Speakers and Panelists include: Vidhya Navalpakkam

Efficient Natural Language and Speech Processing (ENLSP-II): The Future of Pre-trained Models
Invited Speakers include: Tara Sainath, Anna Huang
Invited Panelists include: Mohammad Norouzi
Program Committee includes: Wenhu Chen

Federated Learning: Recent Advances and New Challenges
Program Committee includes: Kallista Bonawitz, Zachary Charles, Wenshuo Guo, Peter Kairouz, Zhaozhuo Xu, Zheng Xu

Gaussian Processes, Spatiotemporal Modeling, and Decision-Making Systems
Workshop Organizers include: Zi Wang
Invited Speakers include: Jasper Snoek, Carolina Osorio
Advisory Board includes: Zoubin Ghahramani

Has it Trained Yet? A Workshop for Algorithmic Efficiency in Practical Neural Network Training
Workshop Organizers include: Zachary Nado, George Dahl, Naman Agarwal, Aakanksha Chowdhery
Invited Speakers include: Aakanksha Chowdhery, Priya Goyal

Human in the Loop Learning (HiLL)
Workshop Organizers include: Fisher Yu, Vittorio Ferrari
Invited Speakers include: Dorsa Singh, Igor Mordatch, Ding Zhao

INTERPOLATE — First Workshop on Interpolation Regularizers and Beyond
Workshop Organizers include: Yann Dauphin
Invited Speakers include: Chelsea Finn
Panelists include: Chelsea Finn, Dustin Tran
Program Committee includes: Wang Chen, Kimin Lee

LaReL: Language and Reinforcement Learning
Invited Speakers include: Dorsa Singh, Igor Mordatch

Medical Imaging Meets NeurIPS
Program Committee includes: Chenyu You

Memory in Artificial and Real Intelligence (MemARI)
Program Committee includes: Benjamin Eysenbach, Otilia Stretcu

Meta-Learning
Workshop Organizers include: Eleni Triantafillou
Invited Speakers include: Lucas Byer, Chelsea Finn
Program Committee includes: Ishita Dasgupta, Praneet Dutta, Benjamin Eysenbach, Maximilian Igl, Louis Kirsch, Parsa Mahmoudieh, Marc Pickett, Eleni Triantafillou

New Frontiers in Graph Learning (GLFrontiers)
Workshop Organizers include: Hanjun Dai

Offline Reinforcement Learning Workshop: Offline RL as a “Launchpad”
Workshop Organizers include: Rishabh Agarwal, Aviral Kumar, George Tucker
Invited Speakers include: Dorsa Sadigh

Score-Based Methods
Invited Speakers include: Mohammad Norouzi
Invited Panelists include: Jascha Sohl-Dickstein

Synthetic Data for Empowering ML Research
Invited Speakers include: Mehryar Mohri
Invited Panelists include: Katrina Ligett
Program Committee includes: Jinsung Yoon

Table Representation Learning
Workshop Organizers include: Pengcheng Yin
Invited Speakers include: Xinyun Chen, Carsten Binnig
Panelists include: Julian Eisenschlos
Program Committee includes: Wenhu Chen, Xinyun Chen, Beliz Gunel

A Causal View on Dynamical Systems
Program Committee includes: Rose Yu

Algorithmic Fairness Through the Lens of Causality and Privacy
Workshop Organizers include: Awa Dieng
Invited Speakers include: Nicolas Papernot
Roundtable Leads include: David Madras, Negar Rostamzadeh, Nyalleng Moroosi
Program Committee includes: Matt Kusner

Broadening Research Collaborations in ML
Workshop Organizers include: Rosanne Liu, Pablo Samuel Castro, Sunipa Dev

Decentralization and Trustworthy Machine Learning in Web3: Methodologies, Platforms, and Applications
Invited Speakers include: Peter Kairouz

Distribution Shifts (DistShift): Connecting Methods and Applications
Workshop Organizers include: Becca Roelofs, Chelsea Finn, Jacob Eisenstein, Pang Wei Koh
Invited Speakers include: Sarah Beery

Foundation Models for Decision Making
Workshop Organizers include: Sherry Yang, Yilun Du, Igor Mordatch, Shixiang Shane Gu,Ofir Nachum
Invited Speakers include: Dorsa Sadigh, Dale Schuurmans, Machel Reid
Program Committee includes: Bo Dai, Aleksandra Faust, Hiroki Furuta, Kati Goshvadi, Izzeddin Gur, Austin Huang, Kimin Lee, Kuang-Huei Lee, Lisa Lee, Yingjie Miao, Jordi Orbay, Ted Xiao

Gaze Meets ML
Program Committee includes: Peter Mattson, Mehdi Moradi

I Can’t Believe It’s Not Better: Understanding Deep Learning Through Empirical Falsification
Workshop Organizers include: Javier Antorán
Panelists include: Kevin Murphy

Interactive Learning for Natural Language Processing
Invited Speakers include: Anca Dragan
Program Committees include: Julia Kreutzer, Shunyu Yao

Machine Learning and the Physical Sciences
Workshop Organizers include: Adji Bousso Dieng
Invited Speakers include: Ekin Doğuş Çubuk

Machine Learning for Systems
Workshop Organizers include: Martin Maas, Azade Nova, Dan Zhang
Invited Speakers include: Jeff Dean
Program Committee includes: Milad Hashemi, Kevin Swersky

Machine Learning in Structural Biology
Invited Speakers include: David Fleet

MATH-AI: Toward Human-Level Mathematical Reasoning
Workshop Organizers include: Swaroop Mishra, Yuhuai Wu
Invited Speakers include: Talia Ringer

OPT 2022: Optimization for Machine Learning
Workshop Organizers include: Courtney Paquette

Reinforcement Learning for Real Life (RL4RealLife)
Workshop Organizers include: Minmin Chen
Invited Panelists include: Pablo Samuel Castro
Program Committee includes: Victor Carbune, Bo Chang, Yinlam Chow, Konstantina Christakopoulou, Bo Dai, Hanjun Dai, Aleksandra Faust, Joshua Greaves‎, Chih-wei Hsu, Rahul Kidambi, Srivatsan Krishnan, Iou-Jen Liu, Cong Lu, Jincheng Mei, Chao Qin

Self-Supervised Learning – Theory and Practice
Invited Speakers include: Mathilde Caron

Symmetry and Geometry in Neural Representations (NeurReps)
Invited Speakers include: Noah Shutty
Program Committee includes: Ondrej Biza, Noah Shutty

Temporal Graph Learning Workshop
Invited Speakers include: Mehran Kazemi

Transfer Learning for Natural Language Processing
Workshop Organizers include: Deepak Ramachandran, Sebastian Ruder
Invited Speakers include: Jonas Pfeiffer
Invited Debaters include: Ellie Pavlick
Program Committee includes: Patrick Fernandes, Jonas Pfeiffer, Jiao Sun, Tu Vu, Xinyi Wang, Xin Xu

Cultures of AI and AI for Culture
Workshop Organizers include: Rida Qadri, Fernando Diaz

Deep Reinforcement Learning Workshop
Workshop Organizers include: Karol Hausman, Ted Xiao, Zeyu Zheng
Invited Speakers include: Igor Mordatch
Advisory Board includes: Chelsea Finn

Empowering Communities: A Participatory Approach to AI for Mental Health
Program Committee includes: Diana Mincu, Subhrajit Roy, Martin Seneviratne

HCAI@NeurIPS 2022, Human Centered AI
Keynote Speaker includes: Fernanda Viegas

Learning Meaningful Representations of Life
Workshop Organizers include: Adji Bousso Dieng

Machine Learning for Creativity and Design
Workshop Organizers include: Yingtao Tian

Machine Learning Safety
Workshop Organizers include: Nicholas Carlini
Invited Speakers include: Dorsa Sadigh

Neuro Causal and Symbolic AI (nCSI)
Workshop Organizers include: Thomas Kipf

Robot Learning Workshop: Trustworthy Robotics
Workshop Organizers include: Alex Bewley, Jonathan Tompson
Invited Speakers include: Karol Hausman, Brian Ichter, Been Kim, Leila Takayama, Andy Zeng
Program Committee includes: Vincent Vanhoucke

The Symbiosis of Deep Learning and Differential Equations II
Workshop Organizers include: Winnie Xu
Invited Speakers include: Rose Yu

Tackling Climate Change with Machine Learning
Workshop Organizers include: Emma Strubell

Trustworthy and Socially Responsible Machine Learning
Invited Speakers include: Been Kim, Dorsa Sadigh, Milind Tambe

Vision Transformers: Theory and Applications
Invited Speakers include: Cordelia Schmid, Ming Hsuan Yang

Tutorials

Advances in Bayesian Optimization
Tutorial Organizers include: Virginia Aglietti

Creative Culture and Machine Learning
Tutorial Organizers include: Negar Rostamzadeh

Fair and Socially Responsible ML for Recommendations: Challenges and Perspectives
Invited Panelists include: Fernando Diaz

Lifelong Learning Machines
Invited Panelists include: Christopher Summerfield

The Role of Meta-learning for Few-Shot Learning
Tutorial Organizers include: Eleni Triantafillou
Invited Panelists include: Neil Houlsby, Priyanka Agrawal

Competitions

NeurIPS 2022 Competition Track: Overview & Results
Invited Speakers include: Isabelle Guyon

Causal Insights for Learning Paths in Education
Competition Organizers include: Zichao (Jack) Wang

IGLU: Interactive Grounded Language Understanding in a Collaborative Environment
Competition Organizers include: Negar Arabzadeh

Cross-Domain MetaDL: Any-Way Any-Shot Learning Competition with Novel Datasets from Practical Domains
Competition Organizers include: Isabelle Guyon

Reconnaissance Blind Chess: An Unsolved Challenge for Multi-Agent Decision Making Under Uncertainty
Competition Organizers include: Bo Li

VisDA 2022 Challenge: Sim2Real Domain Adaptation for Industrial Recycling
Competition Organizers include: Dina Bashkirova

Spotlight Papers

CoPur: Certifiably Robust Collaborative Inference via Feature Purification
Jing Liu, Chulin Xie, Oluwasanmi O Koyejo, Bo Li

Machine Learning on Graphs: A Model and Comprehensive Taxonomy
Ines Chami*, Sami Abu-El-Haija, Bryan Perozzi, Christopher Ré, Kevin Murphy

Sparse Winning Tickets are Data-Efficient Image Recognizers
Mukund Varma T, Xuxi Chen, Zhenyu Zhang, Tianlong Chen, Subhashini Venugopalan, Zhangyang Wang

Federated Learning from Pre-trained Models: A Contrastive Learning Approach
Yue Tan, Guodong Long, Jie Ma, Lu Liu, Tianyi Zhou, Jing Jiang

Improving Multi-task Generalization via Regularizing Spurious Correlation
Ziniu Hu*, Zhe Zhao, Xinyang Yi, Tiansheng Yao, Lichan Hong, Yizhou Sun, Ed H. Chi

The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning
Yunhao Tang, Mark Rowland, Rémi Munos, Bernardo Ávila Pires, Will Dabney, Marc G. Bellemare

Residual Multiplicative Filter Networks for Multiscale Reconstruction
Shayan Shekarforoush, David B. Lindell, David J. Fleet, Marcus A Brubaker

Differentially Private Learning with Margin Guarantees
Raef Bassily, Mehryar Mohri, Ananda Theertha Suresh

Optimal Query Complexities for Dynamic Trace Estimation
David P. Woodruff*, Fred Zhang*, Qiuyi Zhang

Papers

From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent
Ayush Sekhari, Satyen Kale, Jason D. Lee, Chris De Sa, Karthik Sridharan

On the Global Convergence Rates of Decentralized Softmax Gradient Play in Markov Potential Games
Runyu Zhang, Jincheng Mei, Bo Dai, Dale Schuurmans, Na Li

Matryoshka Representation Learning
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, Ali Farhadi

Efficient Risk-Averse Reinforcement Learning
Ido Greenberg, Yinlam Chow, Mohammad Ghavamzadeh, Shie Mannor

Operator Splitting Value Iteration
Amin Rakhsha, Andrew Wang, Mohammad Ghavamzadeh, Amir-massoud Farahmand

Cluster Randomized Designs for One-Sided Bipartite Experiments
Jennifer Brennan*, Vahab Mirrokni, Jean Pouget-Abadie

A Unified Sequence Interface for Vision Tasks
Ting Chen, Saurabh Saxena, Lala Li, Tsung-Yi Lin*, David J. Fleet, Geoffrey Hinton

Cryptographic Hardness of Learning Halfspaces with Massart Noise
Ilias Diakonikolas, Daniel M. Kane, Pasin Manurangsi, Lisheng Ren

Better Best of Both Worlds Bounds for Bandits with Switching Costs
Idan Amir, Guy Azov, Tomer Koren, Roi Livni

Fast Neural Kernel Embeddings for General Activations
Insu Han, Amir Zandieh, Jaehoon Lee, Roman Novak, Lechao Xiao, Amin Karbasi

Hierarchical Agglomerative Graph Clustering in Poly-Logarithmic Depth
Laxman Dhulipala, David Eisenstat, Jakub Łącki, Vahab Mirronki, Jessica Shi

Improving Zero-Shot Generalization in Offline Reinforcement Learning Using Generalized Similarity Functions
Bogdan Mazoure*, Ilya Kostrikov, Ofir Nachum, Jonathan Tompson

Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples
Maura Pintor, Luca Demetrio, Angelo Sotgiu, Ambra Demontis, Nicholas Carlini, Battista Biggio, Fabio Roli

Learning Energy Networks with Generalized Fenchel-Young Losses
Mathieu Blondel, Felipe Llinares-López, Robert Dadashi, Léonard Hussenot, Matthieu Geist

Learning Robust Dynamics Through Variational Sparse Gating
Arnav Kumar Jain, Shiva Kanth Sujit, Shruti Joshi, Vincent Michalski, Danijar Hafner, Samira Ebrahimi Kahou

Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures
Arnav Kumar Jain, Shiva Kanth Sujit, Shruti Joshi, Vincent Michalski, Danijar Hafner, Samira Ebrahimi Kahou

So3krates: Equivariant Attention for Interactions on Arbitrary Length-Scales in Molecular Systems
J. Thorben Frank, Oliver T. Unke, Klaus-Robert Müller

Spectral Bias in Practice: The Role of Function Frequency in Generalization
Sara Fridovich-Keil*, Raphael Gontijo-Lopes, Rebecca Roelofs

Delving into Out-of-Distribution Detection with Vision-Language Representations
Yifei Ming, Ziyang Cai, Jiuxiang Gu, Yiyou Sun, Wei Li, Yixuan Li

Path Independent Equilibrium Models Can Better Exploit Test-Time Computation
Cem Anil, Ashwini Pokle, Kaiqu Liang, Johannes Treutlein, Yuhuai Wu, Shaojie Bai, J. Zico Kolter, Roger Grosse

On Optimal Learning Under Targeted Data Poisoning
Steve Hanneke, Amin Karbasi, Mohammad Mahmoody, Idan Mehalel, Shay Moran

Learning With Little Mixing
Ingvar Ziemann, Stephen Tu

Block-Recurrent Transformers
DeLesley Hutchins, Imanol Schlag*, Yuhuai Wu, Ethan Dyer, Behnam Neyshabur

TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets
Chengrun Yang, Gabriel Bender, Hanxiao Liu, Pieter-Jan Kindermans, Madeleine Udell, Yifeng Lu, Quoc Le, Da Huang

Regret Bounds for Multilabel Classification in Sparse Label Regimes
Robert Busa-Fekete, Heejin Choi, Krzysztof Dembczynski, Claudio Gentile, Henry William Reeve, Balazs Szorenyi

Robust Reinforcement Learning Using Offline Data
Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh

Contrastive Learning as Goal-Conditioned Reinforcement Learning
Benjamin Eysenbach, Tianjun Zhang, Sergey Levine, Ruslan Salakhutdinov

Beyond Rewards: A Hierarchical Perspective on Offline Multiagent Behavioral Analysis
Shayegan Omidshafiei, Andrei Kapishnikov, Yannick Assogba, Lucas Dixon, Been Kim

Revisiting Neural Scaling Laws in Language and Vision
Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai

Polynomial Neural Fields for Subband Decomposition and Manipulation
Guandao Yang*, Sagie Benaim, Varun Jampani, Kyle Genova, Jonathan T. Barron, Thomas Funkhouser, Bharath Hariharan, Serge Belongie

First Is Better Than Last for Language Data Influence
Chih-Kuan Yeh, Ankur Taly, Mukund Sundararajan, Frederick Liu, Pradeep Ravikumar

The Privacy Onion Effect: Memorization Is Relative
Nicholas Carlini, Matthew Jagielski, Chiyuan Zhang, Nicolas Papernot, Andreas Terzis, Florian Tramer

Deep Hierarchical Planning from Pixels (see blog post)
Danijar Hafner, Kuang-Huei Lee, Ian Fischer, Pieter Abbeel

Discovered Policy Optimisation
Chris Lu, Jakub Grudzien Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, Jakob Foerster

Semi-supervised Active Linear Regression
Fnu Devvrit, Nived Rajaraman, Pranjal Awasthi

Pruning’s Effect on Generalization Through the Lens of Training and Regularization
Tian Jin, Daniel M. Roy, Michael Carbin, Jonathan Frankle, Gintare Karolina Dziugaite

Exploring Length Generalization in Large Language Models
Cem Anil*, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, Behnam Neyshabur

Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe Algorithm Under Parallelization
Benjamin Dubois-Taine, Francis Bach, Quentin Berthet, Adrien Taylor

Global Normalization for Streaming Speech Recognition in a Modular Framework
Ehsan Variani, Ke Wu, Michael Riley, David Rybach, Matt Shannon, Cyril Allauzen

Learning Predictions for Algorithms with Predictions
Mikhail Khodak, Maria-Florina Balcan, Ameet Talwalkar, Sergei Vassilvitskii

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts (see blog post)
Basil Mustafa, Carlos Riquelme, Joan Puigcerver, Rodolphe Jenatton, Neil Houlsby

Incrementality Bidding via Reinforcement Learning Under Mixed and Delayed Rewards
Ashwinkumar Badanidiyuru, Zhe Feng, Tianxi Li, Haifeng Xu*

Solving Quantitative Reasoning Problems with Language Models (see blog post)
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, Vedant Misra

Anonymized Histograms in Intermediate Privacy Models
Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi

Efficient and Stable Fully Dynamic Facility Location
Sayan Bhattacharya, Nikos Parotsidis, Silvio Lattanzi

Are All Losses Created Equal: A Neural Collapse Perspective
Jinxin Zhou, Chong You, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, Zhihui Zhu

Universal Rates for Interactive Learning
Steve Hanneke, Amin Karbasi, Shay Moran, Grigoris Velegkas

Nearly Optimal Algorithms for Linear Contextual Bandits with Adversarial Corruptions
Jiafan He, Dongruo Zhou, Tong Zhang, Quanquan Gu

Multiclass Learnability Beyond the PAC Framework: Universal Rates and Partial Concept Classes
Alkis Kalavasis, Grigoris Velegkas, Amin Karbasi

Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning
Cenk Baykal, Nishanth Dikkala, Rina Panigrahy, Cyrus Rashtchian, Xin Wang

Pre-trained Language Models for Interactive Decision-Making
Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, Yuke Zhu

Polynomial Neural Fields for Subband Decomposition and Manipulation
Guandao Yang*, Sagie Benaim, Varun Jampani, Kyle Genova, Jonathan T. Barron, Thomas Funkhouser, Bharath Hariharan, Serge Belongie

Submodular Maximization in Clean Linear Time
Wenxin Li, Moran Feldman, Ehsan Kazemi, Amin Karbasi

Reinforcement Learning with Logarithmic Regret and Policy Switches
Grigoris Velegkas, Zhuoran Yang, Amin Karbasi

Algorithms with Prediction Portfolios
Michael Dinitz, Sungjin Im, Thomas Lavastida, Benjamin Moseley, Sergei Vassilvitskii

Understanding and Improving Robustness of Vision Transformers Through Patch-Based Negative Augmentation
Yao Qin, Chiyuan Zhang, Ting Chen, Balaji Lakshminarayanan, Alex Beutel, Xuezhi Wang

Best of Both Worlds Model Selection
Aldo Pacchiano, Christoph Dann, Claudio Gentile

Fair Wrapping for Black-Box Predictions
Alexander Soen, Ibrahim Alabdulmohsin, Sanmi Koyejo, Yishay Mansour, Nyalleng Moorosi, Richard Nock, Ke Sun, Lexing Xie

A Reduction to Binary Approach for Debiasing Multiclass Datasets
Ibrahim Alabdulmohsin, Jessica Schrouff, Oluwasanmi Koyejo

Weighted Distillation with Unlabeled Examples
Fotis Iliopoulos, Vasilis Kontonis, Cenk Baykal, Gaurav Menghani, Khoa Trihn,Erik Vee

A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases
James Harrison, Luke Metz, Jascha Sohl-Dickstein

Post-hoc Estimators for Learning to Defer to an Expert
Harikrishna Narasimhan, Wittawat Jitkrittum, Aditya Krishna Menon, Ankit Singh Rawat, Sanjiv Kumar

Model-Based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity
Alekh Agarwal, Tong Zhang

On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL
Jinglin Chen, Aditya Modi, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal

Towards Learning Universal Hyperparameter Optimizers with Transformers (see blog post)
Yutian Chen, Xingyou Song, Chansoo Lee, Zi Wang, Qiuyi Zhang, David Dohan, Kazuya Kawakami, Greg Kochanski, Arnaud Doucet, Marc’aurelio Ranzato, Sagi Perel, Nando de Freitas

Reproducibility in Optimization: Theoretical Framework and Limits
Kwangjun Ahn*, Prateek Jain, Ziwei Ji, Satyen Kale, Praneeth Netrapalli, Gil I. Shamir

Confident Adaptive Language Modeling
Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Q. Tran, Yi Tay, Donald Metzler

Reinforcement Learning with Neural Radiance Fields
Danny Driess, Ingmar Schubert, Pete Florence, Yunzhu Li, Marc Toussaint

Invariant and Transportable Representations for Anti-Causal Domain Shifts
Yibo Jiang, Victor Veitch

Simple Mechanisms for Welfare Maximization in Rich Advertising Auctions
Gagan Aggarwal, Kshipra Bhawalkar, Aranyak Mehta, Divyarthi Mohan, Alexandros Psomas

STaR: Bootstrapping Reasoning with Reasoning
Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman

Stochastic Online Learning with Feedback Graphs: Finite-Time and Asymptotic Optimality
Teodor V. Marinov, Mehryar Mohri, Julian Zimmert

The Curse of Unrolling: Rate of Differentiating Through Optimization
Damien Scieur, Quentin Bertrand, Gauthier Gidel, Fabian Pedregosa

Visual Prompting via Image Inpainting
Amir Bar, Yossi Gandelsman, Trevor Darrell, Amir Globerson, Alexei A Efros

Multi-Class H-Consistency Bounds
Pranjal Awasthi, Anqi Mao, Mehryar Mohri, Yutao Zhong

Anonymous Bandits for Multi-User Systems
Hossein Esfandiari, Vahab Mirrokni, Jon Schneider

Understanding the Eluder Dimension
Gene Li, Pritish Kamath, Dylan J. Foster, Nathan Srebro

Why So Pessimistic? Estimating Uncertainties for Offline RL Through Ensembles, and Why Their Independence Matters
Seyed Kamyar Seyed Ghasemipour, Shixiang Shane Gu, Ofir Nachum

A Best-of-Both-Worlds Algorithm for Bandits with Delayed Feedback
Saeed Masoudian, Julian Zimmert, Yevgeny Seldin

A Theoretical View on Sparsely Activated Networks
Cenk Baykal, Nishanth Dikkala, Rina Panigrahy, Cyrus Rashtchian, Xin Wang

Chain of Thought Prompting Elicits Reasoning in Large Language Models (see blog post)
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou

Decoupled Context Processing for Context Augmented Language Modeling
Zonglin Li, Ruiqi Guo, Sanjiv Kumar

Exploring Through Random Curiosity with General Value Functions
Aditya Ramesh, Louis Kirsch, Sjoerd van Steenkiste, Jürgen Schmidhuber

Object Scene Representation Transformer
Mehdi S. M. Sajjadi, Daniel Duckworth, Aravindh Mahendran, Sjoerd van Steenkiste, Filip Pavetić, Mario Lučić, Leonidas J. Guibas, Klaus Greff, Thomas Kipf

Joint Model-Policy Optimization of a Lower Bound for Model-Based RL
Benjamin Eysenbach, Alexander Khazatsky, Sergey Levine, Ruslan Salakhutdinov

A Fourier Approach to Mixture Learning
Mingda Qiao*, Guru Guruganesh, Ankit Singh Rawat, Avinava Dubey, Manzil Zaheer

Why Neural Networks Find Simple Solutions: The Many Regularizers of Geometric Complexity
Benoit Dherin, Michael Munn, Mihaela Rosca, David Barrett

Do Current Multi-task Optimization Methods in Deep Learning Even Help?
Derrick Xin, Behrooz Ghorbani, Ankush Garg, Orhan Firat, Justin Gilmer

Associating Objects and Their Effects in Video Through Coordination Games
Erika Lu, Forrester Cole, Weidi Xie, Tali Dekel, William Freeman, Andrew Zisserman, Michael Rubinstein

Increasing Confidence in Adversarial Robustness Evaluations
Roland S. Zimmermann*, Wieland Brendel, Florian Tramèr, Nicholas Carlini

The Role of Baselines in Policy Gradient Optimization
Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, Dale Schuurmans

Scaling Multimodal Pre-training via Cross-Modality Gradient Harmonization
Junru Wu, Yi Liang, Feng Han, Hassan Akbari, Zhangyang Wang, Cong Yu*

S3GC: Scalable Self-Supervised Graph Clustering
Fnu Devvrit*, Aditya Sinha, Inderjit Dhillon, Prateek Jain

Algorithms and Hardness for Learning Linear Thresholds from Label Proportions
Rishi Saket

ALMA: Hierarchical Learning for Composite Multi-Agent Tasks
Shariq Iqbal, Robby Costales, Fei Sha

DC-BENCH: Dataset Condensation Benchmark
Justin Cui, Ruochen Wang, Si Si, Cho-Jui Hsieh

Does GNN Pre-training Help Molecular Representation?
Ruoxi Sun, Hanjun Dai, Adams Yu

Drawing Out of Distribution with Neuro-Symbolic Generative Models
Yichao Liang, Joshua B. Tenenbaum, Tuan Anh Le, N. Siddharth

Mixture-of-Experts with Expert Choice Routing (see blog post)
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew Dai, Zhifeng Chen, Quoc Le, James Laudon

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback
Tiancheng Jin, Tal Lancewicki, Haipeng Luo, Yishay Mansour, Aviv Rosenberg

Precise Learning Curves and Higher-Order Scalings for Dot-Product Kernel Regression
Lechao Xiao, Jeffrey Pennington, Theodor Misiakiewicz, Hong Hu, Yue Lu

Rate-Optimal Online Convex Optimization in Adaptive Linear Control
Asaf Cassel, Alon Cohen, Tomer Koren

Why Neural Networks Find Simple Solutions: The Many Regularizers of Geometric Complexity
Benoit Dherin, Michael Munn, Mihaela Rosca, David G.T. Barrett

Private Isotonic Regression
Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi

Sketching Based Representations for Robust Image Classification with Provable Guarantees
Nishanth Dikkala, Sankeerth Rao Karingula, Raghu Meka, Jelani Nelson, Rina Panigrahy, Xin Wang

The Role of Baselines in Policy Gradient Optimization
Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, Dale Schuurmans

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
Elad Ben Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

Near-Optimal Private and Scalable k-Clustering
Vincent Cohen-Addad, Alessandro Epasto, Vahab Mirrokni, Shyam Narayanan*, Peilin Zhong

When Does Differentially Private Learning Not Suffer in High Dimensions?
Xuechen Li, Daogao Liu, Tatsunori Hashimoto, Huseyin A Inan, Janardhan Kulkarni, YinTat Lee, Abhradeep Guha Thakurta

End-to-End Learning to Index and Search in Large Output Spaces
Nilesh Gupta, Patrick H. Chen, Hsiang-Fu, Yu, Cho-Jui Hsieh, Inderjit S. Dhillon

A Boosting Approach to Reinforcement Learning
Nataly Brukhim, Elad Hazan, Karan Singh

FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction
Samiul Alam, Luyang Liu, Ming Yan, Mi Zhang

Non-Convex Online Learning via Algorithmic Equivalence
Udaya Ghai, Zhou Lu, Elad Hazan

Is this the Right Neighborhood? Accurate and Query Efficient Model Agnostic Explanations
Amit Dhurandhar, Karthikeyan Natesan Ramamurthy, Karthikeyan Shanmugam

SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos
Gamaleldin F. Elsayed, Aravindh Mahendran, Sjoerd van Steenkiste, Klaus Greff, Michael C. Mozer, Thomas Kipf

UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes
Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, Neil Houlsby

Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions
Courtney Paquette, Elliot Paquette, Ben Adlam, Jeffrey Pennington

Multi-game Decision Transformers (see blog post)
Kuang-Huei Lee, Ofir Nachum, Mengjiao Yang, Lisa Lee, Daniel Freeman, Winnie Xu, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski, Igor Mordatch

Subsidiary Prototype Alignment for Universal Domain Adaptation
Jogendra Nath Kundu, Suvaansh Bhambri, Akshay Ravindra Kulkarni, Hiran Sarkar, Varun Jampani, Venkatesh Babu Radhakrishnan

SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections
Mark Boss*, Andreas Engelhardt*, Abhishek Kar, Yuanzhen Li, Deqing Sun, Jonathan T. Barron, Hendrik P. A. Lensch, Varun Jampani

Chefs’ Random Tables: Non-Trigonometric Random Features
Valerii Likhosherstov, Krzysztof Marcin Choromanski, Avinava Dubey, Frederick Liu, Tamas Sarlos, Adrian Weller

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks
Mansheej Paul, Brett W Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite

DP-PCA: Statistically Optimal and Differentially Private PCA
Xiyang Liu, Weihao Kong, Prateek Jain, Sewoong Oh

Emergent Communication: Generalization and Overfitting in Lewis Games
Mathieu Rita, Corentin Tallec, Paul Michel, Jean-Bastien Grill, Olivier Pietquin, Emmanuel Dupoux, Florian Strub

Handcrafted Backdoors in Deep Neural Networks
Sanghyun Hong, Nicholas Carlini, Alexey Kurakin

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification
Muhammad Ferjad Naeem, Yongqin Xian, Luc Van Gool, Federico Tombari

Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams
Sergey Denisov, Brendan McMahan, Keith Rush, Adam Smith, Abhradeep Guha Thakurta

Optimal Scaling for Locally Balanced Proposals in Discrete Spaces
Haoran Sun*, Hanjun Dai, Dale Schuurmans

Near-Optimal Correlation Clustering with Privacy
Vincent Cohen-Addad, Chenglin Fan, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers
Albert Q. Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, Mateja Jamnik

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
Felix Chern, Blake Hechtman, Andy Davis, Ruiqi Guo, David Majnemer, Sanjiv Kumar

When Does Dough Become a Bagel? Analyzing the Remaining Mistakes on ImageNet
Vijay Vasudevan, Benjamin Caine, Raphael Gontijo-Lopes, Sara Fridovich-Keil, Rebecca Roelofs

DASCO: Dual-Generator Adversarial Support Constrained Offline Reinforcement Learning
Quan Vuong, Aviral Kumar, Sergey Levine, Yevgen Chebotar

A Characterization of Semi-Supervised Adversarially Robust PAC Learnability
Idan Attias, Steve Hanneke, Yishay Mansour

Back Razor: Memory-Efficient Transfer Learning by Self-Sparsified Backpropagation
Ziyu Jiang, Xuxi Chen, Xueqin Huang, Xianzhi Du, Denny Zhou, Zhangyang Wang

Subquadratic Kronecker Regression with Applications to Tensor Decomposition
Matthew Fahrbach, Gang Fu, Mehrdad Ghadiri

Zero-Shot Transfer Learning Within a Heterogeneous Graph via Knowledge Transfer Networks
Minji Yoon*, John Palowitch, Dustin Zelle, Ziniu Hu*, Ruslan Salakhutdinov, Bryan Perozzi

Differentially Private Graph Learning via Sensitivity-Bounded Personalized PageRank
Alessandro Epasto, Vahab Mirrokni, Bryan Perozzi, Anton Tsitsulin, Peilin Zhong

Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress (see blog post)
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron Courville, Marc G. Bellemare

Private and Communication-Efficient Algorithms for Entropy Estimation
Gecia Bravo-Hermsdorff, Robert Busa-Fekete, Mohammad Ghavamzadeh, Andres Munoz Medina, Umar Syed

Oracle Inequalities for Model Selection in Offline Reinforcement Learning
Jonathan Lee, George Tucker, Ofir Nachum, Bo Dai, Emma Brunskill

Diagnosing Failures of Fairness Transfer Across Distribution Shift in Real-World Medical Settings
Jessica Schrouff*, Natalie Harris, Oluwasanmi O Koyejo, Ibrahim Alabdulmohsin, Eva Schnider*, Krista Opsahl-Ong, Alexander Brown, Subhrajit Roy, Diana Mincu, Christina Chen, Awa Dieng, Yuan Liu, Vivek Natarajan, Alan Karthikesalingam, Katherine A Heller, Silvia Chiappa, Alexander D’Amour

LASSIE: Learning Articulated Shapes from Sparse Image Ensemble via 3D Part Discovery
Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani

Patching Open-Vocabulary Models by Interpolating Weights
Gabriel Ilharco, Mitchell Wortsman, Samir Yitzhak Gadre, Shuran Song, Hannaneh Hajishirzi, Simon Kornblith, Ali Farhadi, Ludwig Schmidt

TUSK: Task-Agnostic Unsupervised Keypoints
Yuhe Jin, Weiwei Sun, Jan Hosang, Eduard Trulls, Kwang Moo Yi

Active Learning of Classifiers with Label and Seed Queries
Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, Andrea Paudice, Maximilian Thiessen

Autoformalization with Large Language Models
Yuhuai Wu, Albert Q. Jiang, Wenda Li, Markus N. Rabe, Charles Staats, Mateja Jamnik, Christian Szegedy

Benign Underfitting of Stochastic Gradient Descent
Tomer Koren, Roi Livni, Yishay Mansour, Uri Sherman

Chain of Thought Imitation with Procedure Cloning
Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum

Efficient and Modular Implicit Differentiation
Mathieu Blondel, Quentin Berthet, Marco Cuturi, Roy Frostig, Stephan Hoyer, Felipe Llinares-López, Fabian Pedregosa, Jean-Philippe Vert

Insights into Pre-training via Simpler Synthetic Tasks
Yuhuai Wu, Felix Li, Percy Liang

Self-Supervised Learning with an Information Maximization Criterion
Serdar Ozsoy, Shadi Hamdan, Sercan Ö. Arik, Deniz Yuret, Alper T. Erdogan

Trimmed Maximum Likelihood Estimation for Robust Generalized Linear Model
Weihao Kong, Rajat Sen, Pranjal Awasthi, Abhimanyu Das

Using Embeddings for Causal Estimation of Peer Influence in Social Networks
Irina Cristali, Victor Veitch

VCT: A Video Compression Transformer
Fabian Mentzer, George Toderici, David Minnen, Sung-Jin Hwang, Sergi Caelles, Mario Lucic, Eirikur Agustsson

Video Diffusion Models
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet

Large Language Models are Zero-Shot Reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa

Improved Coresets for Euclidean k-Means
Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, Omar Ali Sheikh-Omar

On the Adversarial Robustness of Mixture of Experts
Joan Puigcerver, Rodolphe Jenatton, Carlos Riquelme Ruiz, Pranjal Awasthi, Srinadh Bhojanapalli

Stars: Tera-Scale Graph Building for Clustering and Learning
CJ Carey, Jonathan Halcrow, Rajesh Jayaram, Vahab Mirrokni, Warren Schudy, Peilin Zhong

VER: Scaling On-Policy RL Leads to the Emergence of Navigation in Embodied Rearrangement
Erik Wijmans, Irfan Essa, Dhruv Batra

TaSIL: Taylor Series Imitation Learning
Daniel Pfrommer, Thomas TCK Zhang, Stephen Tu, Nikolai Matni

RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks
Leo Kozachkov, Michaela M Ennis, Jean-Jacques Slotine

Integral Probability Metrics PAC-Bayes Bounds
Ron Amit, Baruch Epstein, Shay Moran, Ron Meir

D2NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video
Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, Forrester Cole, Cengiz Oztireli

Posted Pricing and Dynamic Prior-Independent Mechanisms with Value Maximizers
Yuan Deng, Vahab Mirrokni, Hanrui Zhang

Transformer Memory as a Differentiable Search Index
Yi Tay, Vinh Q. Tran, Mostafa Dehghani, Jianmo Ni, Dara Bahri, Harsh Mehta, Zhen Qin, Kai Hui, Zhe Zhao, Jai Gupta, Tal Schuster, William W. Cohen, Donald Metzler



*Work done while at Google.  

Categories
Offsites

Conversation Summaries in Google Chat

Information overload is a significant challenge for many organizations and individuals today. It can be overwhelming to keep up with incoming chat messages and documents that arrive at our inbox everyday. This has been exacerbated by the increase in virtual work and remains a challenge as many teams transition to a hybrid work environment with a mix of those working both virtually and in an office. One solution that can address information overload is summarization — for example, to help users improve their productivity and better manage so much information, we recently introduced auto-generated summaries in Google Docs.

Today, we are excited to introduce conversation summaries in Google Chat for messages in Spaces. When these summaries are available, a card with automatically generated summaries is shown as users enter Spaces with unread messages. The card includes a list of summaries for the different topics discussed in Spaces. This feature is enabled by our state-of-the-art abstractive summarization model, Pegasus, which generates useful and concise summaries for chat conversations, and is currently available to selected premium Google Workspace business customers.

Conversation summaries provide a helpful digest of conversations in Spaces, allowing users to quickly catch-up on unread messages and navigate to the most relevant threads.

Conversation Summarization Modeling

The goal of text summarization is to provide helpful and concise summaries for different types of text, such as documents, articles, or spoken conversations. A good summary covers the key points succinctly, and is fluent and grammatically correct. One approach to summarization is to extract key parts from the text and concatenate them together into a summary (i.e., extractive summarization). Another approach is to use natural language generation (NLG) techniques to summarize using novel words and phrases not necessarily present in the original text. This is referred to as abstractive summarization and is considered closer to how a person would generally summarize text. A main challenge with abstractive summarization, however, is that it sometimes struggles to generate accurate and grammatically correct summaries, especially in real world applications.

ForumSum Dataset

The majority of abstractive summarization datasets and research focuses on single-speaker text documents, like news and scientific articles, mainly due to the abundance of human-written summaries for such documents. On the other hand, datasets of human-written summaries for other types of text, like chat or multi-speaker conversations, are very limited.

To address this we created ForumSum, a diverse and high-quality conversation summarization dataset with human-written summaries. The conversations in the dataset are collected from a wide variety of public internet forums, and are cleaned up and filtered to ensure high quality and safe content (more details in the paper).

An example from the ForumSum dataset.

Each utterance in the conversation starts on a new line, contains an author name and a message text that is separated with a colon. Human annotators are then given detailed instructions to write a 1-3 sentence summary of the conversation. These instructions went through multiple iterations to ensure annotators wrote high quality summaries. We have collected summaries for over six thousand conversations, with an average of more than 6 speakers and 10 utterances per conversation. ForumSum provides quality training data for the conversation summarization problem: it has a variety of topics, number of speakers, and number of utterances commonly encountered in a chat application.

Conversation Summarization Model Design

As we have written previously, the Transformer is a popular model architecture for sequence-to-sequence tasks, like abstractive summarization, where the inputs are the document words and the outputs are the summary words. Pegasus combined transformers with self-supervised pre-training customized for abstractive summarization, making it a great model choice for conversation summarization. First, we fine-tune Pegasus on the ForumSum dataset where the input is the conversation words and the output is the summary words. Second, we use knowledge distillation to distill the Pegasus model into a hybrid architecture of a transformer encoder and a recurrent neural network (RNN) decoder. The resulting model has lower latency and memory footprint while maintaining similar quality as the Pegasus model.

Quality and User Experience

A good summary captures the essence of the conversation while being fluent and grammatically correct. Based on human evaluation and user feedback, we learned that the summarization model generates useful and accurate summaries most of the time. But occasionally the model generates low quality summaries. After looking into issues reported by users, we found that there are two main types of low quality summaries. The first one is misattribution, when the model confuses which person or entity said or performed a certain action. The second one is misrepresentation, when the model’s generated summary misrepresents or contradicts the chat conversation.

To address low quality summaries and improve the user experience, we have made progress in several areas:

  1. Improving ForumSum: While ForumSum provides a good representation of chat conversations, we noticed certain patterns and language styles in Google Chat conversations that differ from ForumSum, e.g., how users mention other users and the use of abbreviations and special symbols. After exploring examples reported by users, we concluded that these out-of-distribution language patterns contributed to low quality summaries. To address this, we first performed data formatting and clean-ups to reduce mismatches between chat and ForumSum conversations whenever possible. Second, we added more training data to ForumSum to better represent these style mismatches. Collectively, these changes resulted in reduction of low quality summaries.
  2. Controlled triggering: To make sure summaries bring the most value to our users, we first need to make sure that the chat conversation is worthy of summarization. For example, we found that there is less value in generating a summary when the user is actively engaged in a conversation and does not have many unread messages, or when the conversation is too short.
  3. Detecting low quality summaries: While the two methods above limited low quality and low value summaries, we still developed methods to detect and abstain from showing such summaries to the user when they are generated. These are a set of heuristics and models to measure the overall quality of summaries and whether they suffer from misattribution or misrepresentation issues.

Finally, while the hybrid model provided significant performance improvements, the latency to generate summaries was still noticeable to users when they opened Spaces with unread messages. To address this issue, we instead generate and update summaries whenever there is a new message sent, edited or deleted. Then summaries are cached ephemerally to ensure they surface smoothly when users open Spaces with unread messages.

Conclusion and Future Work

We are excited to apply state-of-the-art abstractive summarization models to help our Workspace users improve their productivity in Spaces. While this is great progress, we believe there are many opportunities to further improve the experience and the overall quality of summaries. Future directions we are exploring include better modeling and summarizing entangled conversations that include multiple topics, and developing metrics that better measure the factual consistency between chat conversations and summaries.

Acknowledgements

The authors would like to thank the many people across Google that contributed to this work: Ahmed Chowdhury, Alejandro Elizondo, Anmol Tukrel, Benjamin Lee, Chao Wang, Chris Carroll, Don Kim, Jackie Tsay, Jennifer Chou, Jesse Sliter, John Sipple, Kate Montgomery, Maalika Manoharan, Mahdis Mahdieh, Mia Chen, Misha Khalman, Peter Liu, Robert Diersing, Sarah Read, Winnie Yeung, Yao Zhao, and Yonghui Wu.

Categories
Offsites

But what is a convolution?

Categories
Offsites

The Data Cards Playbook: A Toolkit for Transparency in Dataset Documentation

As machine learning (ML) research moves toward large-scale models capable of numerous downstream tasks, a shared understanding of a dataset’s origin, development, intent, and evolution becomes increasingly important for the responsible and informed development of ML models. However, knowledge about datasets, including use and implementations, is often distributed across teams, individuals, and even time. Earlier this year at the ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT), we published Data Cards, a dataset documentation framework aimed at increasing transparency across dataset lifecycles. Data Cards are transparency artifacts that provide structured summaries of ML datasets with explanations of processes and rationale that shape the data and describe how the data may be used to train or evaluate models. At minimum, Data Cards include the following: (1) upstream sources, (2) data collection and annotation methods, (3) training and evaluation methods, (4) intended use, and (5) decisions affecting model performance.

In practice, two critical factors determine the success of a transparency artifact, the ability to identify the information decision-makers use and the establishment of processes and guidance needed to acquire that information. We started to explore this idea in our paper with three “scaffolding” frameworks designed to adapt Data Cards to a variety of datasets and organizational contexts. These frameworks helped us create boundary infrastructures, which are the processes and engagement models that complement technical and functional infrastructure necessary to communicate information between communities of practice. Boundary infrastructures enable dataset stakeholders to find common ground used to provide diverse input into decisions for the creation, documentation, and use of datasets.

Today, we introduce the Data Cards Playbook, a self-guided toolkit for a variety of teams to navigate transparency challenges with their ML datasets. The Playbook applies a human-centered design approach to documentation — from planning a transparency strategy and defining the audience to writing reader-centric summaries of complex datasets — to ensure that the usability and utility of the documented datasets are well understood. We’ve created participatory activities to navigate typical obstacles in setting up a dataset transparency effort, frameworks that can scale data transparency to new data types, and guidance that researchers, product teams and companies can use to produce Data Cards that reflect their organizational principles.

The Data Cards Playbook incorporates the latest in fairness, accountability, and transparency research.

The Data Cards Playbook

We created the Playbook using a multi-pronged approach that included surveys, artifact analysis, interviews, and workshops. We studied what Googlers wanted to know about datasets and models, and how they used that information in their day-to-day work. Over the past two years, we deployed templates for transparency artifacts used by fifteen teams at Google, and when bottlenecks arose, we partnered with these teams to determine appropriate workarounds. We then created over twenty Data Cards that describe image, language, tabular, video, audio, and relational datasets in production settings, some of which are now available on GitHub. This multi-faceted approach provided insights into the documentation workflows, collaborative information-gathering practices, information requests from downstream stakeholders, and review and assessment practices for each Google team.

Moreover, we spoke with design, policy, and technology experts across the industry and academia to get their unique feedback on the Data Cards we created. We also incorporated our learnings from a series of workshops at ACM FAccT in 2021. Within Google, we evaluated the effectiveness and scalability of our solutions with ML researchers, data scientists, engineers, AI ethics reviewers, product managers, and leadership. In the Data Cards Playbook, we’ve translated successful approaches into repeatable practices that can easily be adapted to unique team needs.

Activities, Foundations, and Transparency Patterns

The Data Cards Playbook is modeled after sprints and co-design practices, so cross-functional teams and their stakeholders can work together to define transparency with an eye for real-world problems they experience when creating dataset documentation and governance solutions. The thirty-three available Activities invite broad, critical perspectives from a wide variety of stakeholders, so Data Cards can be useful for decisions across the dataset lifecycle. We partnered with researchers from the Responsible AI team at Google to create activities that can reflect considerations of fairness and accountability. For example, we’ve adapted Evaluation Gaps in ML practices into a worksheet for more complete dataset documentation.

Download readily-available activity templates to use the Data Cards Playbook in your organization.

We’ve formed Transparency Patterns with evidence-based guidance to help anticipate challenges faced when producing transparent documentation, offer best practices that improve transparency, and make Data Cards useful for readers from different backgrounds. The challenges and their workarounds are based on data and insights from Googlers, industry experts, and academic research.

Patterns help unblock teams with recommended practices, caution against common pitfalls, and suggested alternatives to roadblocks.

The Playbook also includes Foundations, which are scalable concepts and frameworks that explore fundamental aspects of transparency as new contexts of data modalities and ML arise. Each Foundation supports different product development stages and includes key takeaways, actions for teams, and handy resources.

Playbook Modules

The Playbook is organized into four modules: (1) Ask, (2) Inspect, (3) Answer, and (3) Audit. Each module contains a growing compendium of materials teams can use within their workflows to tackle transparency challenges that frequently co-occur. Since Data Cards were created with scalability and extensibility in mind, modules leverage divergence-converge thinking that teams may already use, so documentation isn’t an afterthought. The Ask and Inspect modules help create and evaluate Data Card templates for organizational needs and principles. The Answer and Audit modules help data teams complete the templates and evaluate the resulting Data Cards.

In Ask, teams define transparency and optimize their dataset documentation for cross-functional decision-making. Participatory activities create opportunities for Data Card readers to have a say in what constitutes transparency in the dataset’s documentation. These address specific challenges and are rated for different intensities and durations so teams can mix-and-match activities around their needs.

The Inspect module contains activities to identify gaps and opportunities in dataset transparency and processes from user-centric and dataset-centric perspectives. It supports teams in refining, validating, and operationalizing Data Card templates across an organization so readers can arrive at reasonable conclusions about the datasets described.

The Answer module contains transparency patterns and dataset-exploration activities to answer challenging and ambiguous questions. Topics covered include preparing for transparency, writing reader-centric summaries in documentation, unpacking the usability and utility of datasets, and maintaining a Data Card over time.

The Audit module helps data teams and organizations set up processes to evaluate completed Data Cards before they are published. It also contains guidance to measure and track how a transparency effort for multiple datasets scales within organizations.

In Practice

A data operations team at Google used an early version of the Lenses and Scopes Activities from the Ask modules to create a customized Data Card template. Interestingly, we saw them use this template across their workflow till datasets were handed off. They used Data Cards to take dataset requests from research teams, tracked the various processes to create the datasets, collected metadata from vendors responsible for annotations, and managed approvals. Their experiences of iterating with experts and managing updates are reflected in our Transparency Patterns.

Another data governance group used a more advanced version of the activities to interview stakeholders for their ML health-related initiative. Using these descriptions, they identified stakeholders to co-create their Data Card schema. Voting on Lenses was used to rule out typical documentation questions, and identify atypical documentation needs specific to their data type, and important for decisions frequently made by ML leadership and tactical roles within their team. These questions were then used to customize existing metadata schemas in their data repositories.

Conclusion

We present the Data Cards Playbook, a continuous and contextual approach to dataset transparency that deliberately considers all relevant materials and contexts. With this, we hope to establish and promote practice-oriented foundations for transparency to pave the path for researchers to develop ML systems and datasets that are responsible and benefit society.

In addition to the four Playbook modules described, we’re also open-sourcing a card builder, which generates interactive Data Cards from a Markdown file. You can see the builder in action in the GEM Benchmark project’s Data Cards. The Data Cards created were a result of activities from this Playbook, in which the GEM team identified improvements across all dimensions, and created an interactive collection tool designed around scopes.

We acknowledge that this is not a comprehensive solution for fairness, accountability, or transparency in itself. We’ll continue to improve the Playbook using lessons learned. We hope the Data Cards Playbook can become a robust platform for collaboratively advancing transparency research, and invite you to make this your own.

Acknowledgements

This work was done in collaboration with Reena Jana, Vivian Tsai, and Oddur Kjartansson. We want to thank Donald Gonzalez, Dan Nanas, Parker Barnes, Laura Rosenstein, Diana Akrong, Monica Caraway, Ding Wang, Danielle Smalls, Aybuke Turker, Emily Brouillet, Andrew Fuchs, Sebastian Gehrmann, Cassie Kozyrkov, Alex Siegman, and Anthony Keene for their immense contributions; and Meg Mitchell and Timnit Gebru for championing this work.

We also want to thank Adam Boulanger, Lauren Wilcox, Roxanne Pinto, Parker Barnes, and Ayça Çakmakli for their feedback; Tulsee Doshi, Dan Liebling, Meredith Morris, Lucas Dixon, Fernanda Viegas, Jen Gennai, and Marian Croak for their support. This work would not have been possible without our workshop and study participants, and numerous partners, whose insights and experiences have shaped this Playbook.

Categories
Offsites

Mixture-of-Experts with Expert Choice Routing

The capacity of a neural network to absorb information is limited by the number of its parameters, and as a consequence, finding more effective ways to increase model parameters has become a trend in deep learning research. Mixture-of-experts (MoE), a type of conditional computation where parts of the network are activated on a per-example basis, has been proposed as a way of dramatically increasing model capacity without a proportional increase in computation. In sparsely-activated variants of MoE models (e.g., Switch Transformer, GLaM, V-MoE), a subset of experts is selected on a per-token or per-example basis, thus creating sparsity in the network. Such models have demonstrated better scaling in multiple domains and better retention capability in a continual learning setting (e.g., Expert Gate). However, a poor expert routing strategy can cause certain experts to be under-trained, leading to an expert being under or over-specialized.

In “Mixture-of-Experts with Expert Choice Routing”, presented at NeurIPS 2022, we introduce a novel MoE routing algorithm called Expert Choice (EC). We discuss how this novel approach can achieve optimal load balancing in an MoE system while allowing heterogeneity in token-to-expert mapping. Compared to token-based routing and other routing methods in traditional MoE networks, EC demonstrates very strong training efficiency and downstream task scores. Our method resonates with one of the vision for Pathways, which is to enable heterogeneous mixture-of-experts via Pathways MPMD (multi program, multi data) support.

Overview of MoE Routing

MoE operates by adopting a number of experts, each as a sub-network, and activating only one or a few experts for each input token. A gating network must be chosen and optimized in order to route each token to the most suited expert(s). Depending on how tokens are mapped to experts, MoE can be sparse or dense. Sparse MoE only selects a subset of experts when routing each token, reducing computational cost as compared to a dense MoE. For example, recent work has implemented sparse routing via k-means clustering, linear assignment to maximize token-expert affinities, or hashing. Google also recently announced GLaM and V-MoE, both of which advance the state of the art in natural language processing and computer vision via sparsely gated MoE with top-k token routing, demonstrating better performance scaling with sparsely activated MoE layers. Many of these prior works used a token choice routing strategy in which the routing algorithm picks the best one or two experts for each token.

Token Choice Routing. The routing algorithm picks the top-1 or top-2 experts with highest affinity scores for each token. The affinity scores can be trained together with model parameters.

The independent token choice approach often leads to an imbalanced load of experts and under-utilization. In order to mitigate this, previous sparsely gated networks introduced additional auxiliary losses as regularization to prevent too many tokens being routed to a single expert, but the effectiveness was limited. As a result, token choice routings need to overprovision expert capacity by a significant margin (2x–8x of the calculated capacity) to avoid dropping tokens when there is a buffer overflow.

In addition to load imbalance, most prior works allocate a fixed number of experts to each token using a top-k function, regardless of the relative importance of different tokens. We argue that different tokens should be received by a variable number of experts, conditioned on token importance or difficulty.

Expert Choice Routing

To address the above issues, we propose a heterogeneous MoE that employs the expert choice routing method illustrated below. Instead of having tokens select the top-k experts, the experts with predetermined buffer capacity are assigned to the top-k tokens. This method guarantees even load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance. EC routing speeds up training convergence by over 2x in an 8B/64E (8 billion activated parameters, 64 experts) model, compared to the top-1 and top-2 gating counterparts in Switch Transformer, GShard, and GLaM.

Expert Choice Routing. Experts with predetermined buffer capacity are assigned top-k tokens, thus guaranteeing even load balancing. Each token can be received by a variable number of experts.

In EC routing, we set expert capacity k as the average tokens per expert in a batch of input sequences multiplied by a capacity factor, which determines the average number of experts that can be received by each token. To learn the token-to-expert affinity, our method produces a token-to-expert score matrix that is used to make routing decisions. The score matrix indicates the likelihood of a given token in a batch of input sequences being routed to a given expert.

Similar to Switch Transformer and GShard, we apply an MoE and gating function in the dense feedforward (FFN) layer, as it is the most computationally expensive part of a Transformer-based network. After producing the token-to-expert score matrix, a top-k function is applied along the token dimension for each expert to pick the most relevant tokens. A permutation function is then applied based on the generated indexes of the token, to create a hidden value with an additional expert dimension. The data is split across multiple experts such that all experts can execute the same computational kernel concurrently on a subset of tokens. Because a fixed expert capacity can be determined, we no longer overprovision expert capacity due to load imbalancing, thus significantly reducing training and inference step time by around 20% compared to GLaM.

Evaluation

To illustrate the effectiveness of Expert Choice routing, we first look at training efficiency and convergence. We use EC with a capacity factor of 2 (EC-CF2) to match the activated parameter size and computational cost on a per-token basis to GShard top-2 gating and run both for a fixed number of steps. EC-CF2 reaches the same perplexity as GShard top-2 in less than half the steps and, in addition, we find that each GShard top-2 step is 20% slower than our method.

We also scale the number of experts while fixing the expert size to 100M parameters for both EC and GShard top-2 methods. We find that both work well in terms of perplexity on the evaluation dataset during pre-training — having more experts consistently improves training perplexity.

Evaluation results on training convergence: EC routing yields 2x faster convergence at 8B/64E scale compared to top-2 gating used in GShard and GLaM (top). EC training perplexity scales better with the scaling of number of experts (bottom).

To validate whether improved perplexity directly translates to better performance in downstream tasks, we perform fine-tuning on 11 selected tasks from GLUE and SuperGLUE. We compare three MoE methods including Switch Transformer top-1 gating (ST Top-1), GShard top-2 gating (GS Top-2) and a version of our method (EC-CF2) that matches the activated parameters and computational cost of GS Top-2. The EC-CF2 method consistently outperforms the related methods and yields an average accuracy increase of more than 2% in a large 8B/64E setting. Comparing our 8B/64E model against its dense counterpart, our method achieves better fine-tuning results, increasing the average score by 3.4 points.

Our empirical results indicate that capping the number of experts for each token hurts the fine-tuning score by 1 point on average. This study confirms that allowing a variable number of experts per token is indeed helpful. On the other hand, we compute statistics on token-to-expert routing, particularly on the ratio of tokens that have been routed to a certain number of experts. We find that a majority of tokens have been routed to one or two experts while 23% have been routed to three or four experts and only about 3% tokens have been routed to more than four experts, thus verifying our hypothesis that expert choice routing learns to allocate a variable number of experts to tokens.

Final Thoughts

We propose a new routing method for sparsely activated mixture-of-experts models. This method addresses load imbalance and under-utilization of experts in conventional MoE methods, and enables the selection of different numbers of experts for each token. Our model demonstrates more than 2x training efficiency improvement when compared to the state-of-the-art GShard and Switch Transformer models, and achieves strong gains when fine-tuning on 11 datasets in the GLUE and SuperGLUE benchmark.

Our approach for expert choice routing enables heterogeneous MoE with straightforward algorithmic innovations. We hope that this may lead to more advances in this space at both the application and system levels.

Acknowledgements

Many collaborators across google research supported this work. We particularly thank Nan Du, Andrew Dai, Yanping Huang, and Zhifeng Chen for the initial ground work on MoE infrastructure and Tarzan datasets. We greatly appreciate Hanxiao Liu and Quoc Le for contributing the initial ideas and discussions. Tao Lei, Vincent Zhao, Da Huang, Chang Lan, Daiyi Peng, and Yifeng Lu contributed significantly on implementations and evaluations. Claire Cui, James Laudon, Martin Abadi, and Jeff Dean provided invaluable feedback and resource support.

Categories
Offsites

Multi-layered Mapping of Brain Tissue via Segmentation Guided Contrastive Learning

Mapping the wiring and firing activity of the human brain is fundamental to deciphering how we think — how we sense the world, learn, decide, remember, and create — as well as what issues can arise in brain disease or dysfunction. Recent efforts have delivered publicly available brain maps (high-resolution 3D mapping of brain cells and their connectivities) at unprecedented quality and scale, such as H01, a 1.4 petabyte nanometer-scale digital reconstruction of a sample of human brain tissue from Harvard / Google, and the cubic millimeter mouse cortex dataset from our colleagues at the MICrONS consortium.

To interpret brain maps at this scale requires multiple layers of analysis, including the identification of synaptic connections, cellular subcompartments, and cell types. Machine learning and computer vision technology have played a central role in enabling these analyses, but deploying such systems is still a laborious process, requiring hours of manual ground truth labeling by expert annotators and significant computational resources. Moreover, some important tasks, such as identifying the cell type from only a small fragment of axon or dendrite, can be challenging even for human experts, and have not yet been effectively automated.

Today, in “Multi-Layered Maps of Neuropil with Segmentation-Guided Contrastive Learning”, we are announcing Segmentation-Guided Contrastive Learning of Representations (SegCLR), a method for training rich, generic representations of cellular morphology (the cell’s shape) and ultrastructure (the cell’s internal structure) without laborious manual effort. SegCLR produces compact vector representations (i.e., embeddings) that are applicable across diverse downstream tasks (e.g., local classification of cellular subcompartments, unsupervised clustering), and are even able to identify cell types from only small fragments of a cell. We trained SegCLR on both the H01 human cortex dataset and the MICrONS mouse cortex dataset, and we are releasing the resulting embedding vectors, about 8 billion in total, for researchers to explore.

From brain cells segmented out of a 3D block of tissue, SegCLR embeddings capture cellular morphology and ultrastructure and can be used to distinguish cellular subcompartments (e.g., dendritic spine versus dendrite shaft) or cell types (e.g., pyramidal versus microglia cell).

Representing Cellular Morphology and Ultrastructure

SegCLR builds on recent advances in self-supervised contrastive learning. We use a standard deep network architecture to encode inputs comprising local 3D blocks of electron microscopy data (about 4 micrometers on a side) into 64-dimensional embedding vectors. The network is trained via a contrastive loss to map semantically related inputs to similar coordinates in the embedding space. This is close to the popular SimCLR setup, except that we also require an instance segmentation of the volume (tracing out individual cells and cell fragments), which we use in two important ways.

First, the input 3D electron microscopy data are explicitly masked by the segmentation, forcing the network to focus only on the central cell within each block. Second, we leverage the segmentation to automatically define which inputs are semantically related: positive pairs for the contrastive loss are drawn from nearby locations on the same segmented cell and trained to have similar representations, while inputs drawn from different cells are trained to have dissimilar representations. Importantly, publicly available automated segmentations of the human and mouse datasets were sufficiently accurate to train SegCLR without requiring laborious review and correction by human experts.

SegCLR is trained to represent rich cellular features without manual labeling. Top: The SegCLR architecture maps local masked 3D views of electron microscopy data to embedding vectors. Only the microscopy volume and a draft automated instance segmentation are required. Bottom: The segmentation is also used to define positive versus negative example pairs, whose representations are pushed closer together (positives, blue arrows) or further apart (negatives, red arrows) during training.

Reducing Annotation Training Requirements by Three Orders of Magnitude

SegCLR embeddings can be used in diverse downstream settings, whether supervised (e.g., training classifiers) or unsupervised (e.g., clustering or content-based image retrieval). In the supervised setting, embeddings simplify the training of classifiers, and can greatly reduce ground truth labeling requirements. For example, we found that for identifying cellular subcompartments (axon, dendrite, soma, etc.) a simple linear classifier trained on top of SegCLR embeddings outperformed a fully supervised deep network trained on the same task, while using only about one thousand labeled examples instead of millions.

We assessed the classification performance for axon, dendrite, soma, and astrocyte subcompartments in the human cortex dataset via mean F1-Score, while varying the number of training examples used. Linear classifiers trained on top of SegCLR embeddings matched or exceeded the performance of a fully supervised deep classifier (horizontal line), while using a fraction of the training data.

Distinguishing Cell Types, Even from Small Fragments

Distinguishing different cell types is an important step towards understanding how brain circuits develop and function in health and disease. Human experts can learn to identify some cortical cell types based on morphological features, but manual cell typing is laborious and ambiguous cases are common. Cell typing also becomes more difficult when only small fragments of cells are available, which is common for many cells in current connectomic reconstructions.

Human experts manually labeled cell types for a small number of proofread cells in each dataset. In the mouse cortex dataset, experts labeled six neuron types (top) and four glia types (not shown). In the human cortex dataset, experts labeled two neuron types (not shown) and four glia types (bottom). (Rows not to scale with each other.)

We found that SegCLR accurately infers human and mouse cell types, even for small fragments. Prior to classification, we collected and averaged embeddings within each cell over a set aggregation distance, defined as the radius from a central point. We found that human cortical cell types can be identified with high accuracy for aggregation radii as small as 10 micrometers, even for types that experts find difficult to distinguish, such as microglia (MGC) versus oligodendrocyte precursor cells (OPC).

SegCLR can classify cell types, even from small fragments. Left: Classification performance over six human cortex cell types for shallow ResNet models trained on SegCLR embeddings for different sized cell fragments. Aggregation radius zero corresponds to very small fragments with only a single embedding. Cell type performance reaches high accuracy (0.938 mean F1-Score) for fragments with aggregation radii of only 10 micrometers (boxed point). Right: Class-wise confusion matrix at 10 micrometers aggregation radius. Darker shading along the diagonal indicates that predicted cell types agree with expert labels in most cases. AC: astrocyte; MGC: microglia cell; OGC: oligodendrocyte cell; OPC: oligodendrocyte precursor cell; E: excitatory neuron; I: inhibitory neuron.

In the mouse cortex, ten cell types could be distinguished with high accuracy at aggregation radii of 25 micrometers.

Left: Classification performance over the ten mouse cortex cell types reaches 0.832 mean F1-Score for fragments with aggregation radius 25 micrometers (boxed point). Right: The class-wise confusion matrix at 25 micrometers aggregation radius. Boxes indicate broad groups (glia, excitatory neurons, and inhibitory interneurons). P: pyramidal cell; THLC: thalamocortical axon; BC: basket cell; BPC: bipolar cell; MC: Martinotti cell; NGC: neurogliaform cell.

In additional cell type applications, we used unsupervised clustering of SegCLR embeddings to reveal further neuronal subtypes, and demonstrated how uncertainty estimation can be used to restrict classification to high confidence subsets of the dataset, e.g., when only a few cell types have expert labels.

Revealing Patterns of Brain Connectivity

Finally, we showed how SegCLR can be used for automated analysis of brain connectivity by cell typing the synaptic partners of reconstructed cells throughout the mouse cortex dataset. Knowing the connectivity patterns between specific cell types is fundamental to interpreting large-scale connectomic reconstructions of brain wiring, but this typically requires manual tracing to identify partner cell types. Using SegCLR, we replicated brain connectivity findings that previously relied on intensive manual tracing, while extending their scale in terms of the number of synapses, cell types, and brain areas analyzed. (See the paper for further details.)

SegCLR automated analysis of brain connectivity. Top: An example mouse pyramidal cell, with synapse locations color-coded according to whether the synaptic partner was classified as inhibitory (blue), excitatory (red), or unknown (black). Inset shows higher detail of the soma and proximal dendrites. Bottom: We counted how many upstream synaptic partners were classified as thalamocortical axons, which bring input from sensory systems to the cortex. We found that thalamic input arrives primarily at cortical layer L4, the canonical cortical input layer, and preferentially targets primary visual area V1, rather than higher visual areas (HVA).

What’s Next?

SegCLR captures rich cellular features and can greatly simplify downstream analyses compared to working directly with raw image and segmentation data. We are excited to see what the community can discover using the ~8 billion embeddings we are releasing for the human and mouse cortical datasets (example access code; browsable human and mouse views in Neuroglancer). By reducing complex microscopy data to rich and compact embedding representations, SegCLR opens many novel avenues for biological insight, and may serve as a link to complementary modalities for high-dimensional characterization at the cellular and subcellular levels, such as spatially-resolved transcriptomics.

Categories
Offsites

ReAct: Synergizing Reasoning and Acting in Language Models

<!––>

Recent advances have expanded the applicability of language models (LM) to downstream tasks. On one hand, existing language models that are properly prompted, via chain-of-thought, demonstrate emergent capabilities that carry out self-conditioned reasoning traces to derive answers from questions, excelling at various arithmetic, commonsense, and symbolic reasoning tasks. However, with chain-of-thought prompting, a model is not grounded in the external world and uses its own internal representations to generate reasoning traces, limiting its ability to reactively explore and reason or update its knowledge. On the other hand, recent work uses pre-trained language models for planning and acting in various interactive environments (e.g., text games, web navigation, embodied tasks, robotics), with a focus on mapping text contexts to text actions via the language model’s internal knowledge. However, they do not reason abstractly about high-level goals or maintain a working memory to support acting over long horizons.

In “ReAct: Synergizing Reasoning and Acting in Language Models”, we propose a general paradigm that combines reasoning and acting advances to enable language models to solve various language reasoning and decision making tasks. We demonstrate that the Reason+Act (ReAct) paradigm systematically outperforms reasoning and acting only paradigms, when prompting bigger language models and fine-tuning smaller language models. The tight integration of reasoning and acting also presents human-aligned task-solving trajectories that improve interpretability, diagnosability, and controllability..

Model Overview

ReAct enables language models to generate both verbal reasoning traces and text actions in an interleaved manner. While actions lead to observation feedback from an external environment (“Env” in the figure below), reasoning traces do not affect the external environment. Instead, they affect the internal state of the model by reasoning over the context and updating it with useful information to support future reasoning and acting.

Previous methods prompt language models (LM) to either generate self-conditioned reasoning traces or task-specific actions. We propose ReAct, a new paradigm that combines reasoning and acting advances in language models.

ReAct Prompting

We focus on the setup where a frozen language model, PaLM-540B, is prompted with few-shot in-context examples to generate both domain-specific actions (e.g., “search” in question answering, and “go to” in room navigation), and free-form language reasoning traces (e.g., “Now I need to find a cup, and put it on the table”) for task solving.

For tasks where reasoning is of primary importance, we alternate the generation of reasoning traces and actions so that the task-solving trajectory consists of multiple reasoning-action-observation steps. In contrast, for decision making tasks that potentially involve a large number of actions, reasoning traces only need to appear sparsely in the most relevant positions of a trajectory, so we write prompts with sparse reasoning and let the language model decide the asynchronous occurrence of reasoning traces and actions for itself.

As shown below, there are various types of useful reasoning traces, e.g., decomposing task goals to create action plans, injecting commonsense knowledge relevant to task solving, extracting important parts from observations, tracking task progress while maintaining plan execution, handling exceptions by adjusting action plans, and so on.

The synergy between reasoning and acting allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting (reason to act), while also interacting with the external environments (e.g., Wikipedia) to incorporate additional information into reasoning (act to reason).

ReAct Fine-tuning

We also explore fine-tuning smaller language models using ReAct-format trajectories. To reduce the need for large-scale human annotation, we use the ReAct prompted PaLM-540B model to generate trajectories, and use trajectories with task success to fine-tune smaller language models (PaLM-8/62B).

Comparison of four prompting methods, (a) Standard, (b) Chain of thought (CoT, Reason Only), (c) Act-only, and (d) ReAct, solving a HotpotQA question. In-context examples are omitted, and only the task trajectory is shown. ReAct is able to retrieve information to support reasoning, while also using reasoning to target what to retrieve next, demonstrating a synergy of reasoning and acting.

Results

We conduct empirical evaluations of ReAct and state-of-the-art baselines across four different benchmarks: question answering (HotPotQA), fact verification (Fever), text-based game (ALFWorld), and web page navigation (WebShop). For HotPotQA and Fever, with access to a Wikipedia API with which the model can interact, ReAct outperforms vanilla action generation models while being competitive with chain of thought reasoning (CoT) performance. The approach with the best results is a combination of ReAct and CoT that uses both internal knowledge and externally obtained information during reasoning.

HotpotQA (exact match, 6-shot)    FEVER (accuracy, 3-shot)
Standard 28.7 57.1
Reason-only (CoT) 29.4 56.3
Act-only 25.7 58.9
ReAct 27.4 60.9
Best ReAct + CoT Method 35.1 64.6
Supervised SoTA 67.5 (using ~140k samples) 89.5 (using ~90k samples)

PaLM-540B prompting results on HotpotQA and Fever.

On ALFWorld and WebShop, ReAct with both one-shot and two-shot prompting outperforms imitation and reinforcement learning methods trained with ~105 task instances, with an absolute improvement of 34% and 10% in success rates, respectively, over existing baselines.

AlfWorld (2-shot) WebShop (1-shot)
Act-only 45 30.1
ReAct 71 40
Imitation Learning Baselines     37 (using ~100k samples)     29.1 (using ~90k samples)

PaLM-540B prompting task success rate results on AlfWorld and WebShop.
Scaling results for prompting and fine-tuning on HotPotQA with ReAct and different baselines. ReAct consistently achieves best fine-tuning performances.
A comparison of the ReAct (top) and CoT (bottom) reasoning trajectories on an example from Fever (observation for ReAct is omitted to reduce space). In this case ReAct provided the right answer, and it can be seen that the reasoning trajectory of ReAct is more grounded on facts and knowledge, in contrast to CoT’s hallucination behavior.

We also explore human-in-the-loop interactions with ReAct by allowing a human inspector to edit ReAct’s reasoning traces. We demonstrate that by simply replacing a hallucinating sentence with inspector hints, ReAct can change its behavior to align with inspector edits and successfully complete a task. Solving tasks becomes significantly easier when using ReAct as it only requires the manual editing of a few thoughts, which enables new forms of human-machine collaboration.

A human-in-the-loop behavior correction example with ReAct on AlfWorld. (a) ReAct trajectory fails due to a hallucinating reasoning trace (Act 17). (b) A human inspector edits two reasoning traces (Act 17, 23), ReAct then produces desirable reasoning traces and actions to complete the task.

Conclusion

We present ReAct, a simple yet effective method for synergizing reasoning and acting in language models. Through various experiments that focus on multi-hop question-answering, fact checking, and interactive decision-making tasks, we show that ReAct leads to superior performance with interpretable decision traces.

ReAct demonstrates the feasibility of jointly modeling thought, actions and feedback from the environment within a language model, making it a versatile agent that is capable of solving tasks that require interactions with the environment. We plan to further extend this line of research and leverage the strong potential of the language model for tackling broader embodied tasks, via approaches like massive multitask training and coupling ReAct with equally strong reward models.

Acknowledgements

We would like to thank Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran and Karthik Narasimhan for their great contribution in this work. We would also like to thank Google’s Brain team and the Princeton NLP Group for their joint support and feedback, including project scoping, advising and insightful discussions.

Categories
Offsites

Infinite Nature: Generating 3D Flythroughs from Still Photos

We live in a world of great natural beauty — of majestic mountains, dramatic seascapes, and serene forests. Imagine seeing this beauty as a bird does, flying past richly detailed, three-dimensional landscapes. Can computers learn to synthesize this kind of visual experience? Such a capability would allow for new kinds of content for games and virtual reality experiences: for instance, relaxing within an immersive flythrough of an infinite nature scene. But existing methods that synthesize new views from images tend to allow for only limited camera motion.

In a research effort we call Infinite Nature, we show that computers can learn to generate such rich 3D experiences simply by viewing nature videos and photographs. Our latest work on this theme, InfiniteNature-Zero (presented at ECCV 2022) can produce high-resolution, high-quality flythroughs starting from a single seed image, using a system trained only on still photographs, a breakthrough capability not seen before. We call the underlying research problem perpetual view generation: given a single input view of a scene, how can we synthesize a photorealistic set of output views corresponding to an arbitrarily long, user-controlled 3D path through that scene? Perpetual view generation is very challenging because the system must generate new content on the other side of large landmarks (e.g., mountains), and render that new content with high realism and in high resolution.

Example flythrough generated with InfiniteNature-Zero. It takes a single input image of a natural scene and synthesizes a long camera path flying into that scene, generating new scene content as it goes.

Background: Learning 3D Flythroughs from Videos

To establish the basics of how such a system could work, we’ll describe our first version, “Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image” (presented at ICCV 2021). In that work we explored a “learn from video” approach, where we collected a set of online videos captured from drones flying along coastlines, with the idea that we could learn to synthesize new flythroughs that resemble these real videos. This set of online videos is called the Aerial Coastline Imagery Dataset (ACID). In order to learn how to synthesize scenes that respond dynamically to any desired 3D camera path, however, we couldn’t simply treat these videos as raw collections of pixels; we also had to compute their underlying 3D geometry, including the camera position at each frame.

The basic idea is that we learn to generate flythroughs step-by-step. Given a starting view, like the first image in the figure below, we first compute a depth map using single-image depth prediction methods. We then use that depth map to render the image forward to a new camera viewpoint, shown in the middle, resulting in a new image and depth map from that new viewpoint.

However, this intermediate image has some problems — it has holes where we can see behind objects into regions that weren’t visible in the starting image. It is also blurry, because we are now closer to objects, but are stretching the pixels from the previous frame to render these now-larger objects.

To handle these problems, we learn a neural image refinement network that takes this low-quality intermediate image and outputs a complete, high-quality image and corresponding depth map. These steps can then be repeated, with this synthesized image as the new starting point. Because we refine both the image and the depth map, this process can be iterated as many times as desired — the system automatically learns to generate new scenery, like mountains, islands, and oceans, as the camera moves further into the scene.

Our Infinite Nature methods take an input view and its corresponding depth map (left). Using this depth map, the system renders the input image to a new desired viewpoint (center). This intermediate image has problems, such as missing pixels revealed behind foreground content (shown in magenta). We learn a deep network that refines this image to produce a new high-quality image (right). This process can be repeated to produce a long trajectory of views. We thus call this approach “render-refine-repeat”.

We train this render-refine-repeat synthesis approach using the ACID dataset. In particular, we sample a video from the dataset and then a frame from that video. We then use this method to render several new views moving into the scene along the same camera trajectory as the ground truth video, as shown in the figure below, and compare these rendered frames to the corresponding ground truth video frames to derive a training signal. We also include an adversarial setup that tries to distinguish synthesized frames from real images, encouraging the generated imagery to appear more realistic.

Infinite Nature can synthesize views corresponding to any camera trajectory. During training, we run our system for T steps to generate T views along a camera trajectory calculated from a training video sequence, then compare the resulting synthesized views to the ground truth ones. In the figure, each camera viewpoint is generated from the previous one by performing a warp operation R, followed by the neural refinement operation gθ.

The resulting system can generate compelling flythroughs, as featured on the project webpage, along with a “flight simulator” Colab demo. Unlike prior methods on video synthesis, this method allows the user to interactively control the camera and can generate much longer camera paths.

InfiniteNature-Zero: Learning Flythroughs from Still Photos

One problem with this first approach is that video is difficult to work with as training data. High-quality video with the right kind of camera motion is challenging to find, and the aesthetic quality of an individual video frame generally cannot compare to that of an intentionally captured nature photograph. Therefore, in “InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images”, we build on the render-refine-repeat strategy above, but devise a way to learn perpetual view synthesis from collections of still photos — no videos needed. We call this method InfiniteNature-Zero because it learns from “zero” videos. At first, this might seem like an impossible task — how can we train a model to generate video flythroughs of scenes when all it’s ever seen are isolated photos?

To solve this problem, we had the key insight that if we take an image and render a camera path that forms a cycle — that is, where the path loops back such that the last image is from the same viewpoint as the first — then we know that the last synthesized image along this path should be the same as the input image. Such cycle consistency provides a training constraint that helps the model learn to fill in missing regions and increase image resolution during each step of view generation.

However, training with these camera cycles is insufficient for generating long and stable view sequences, so as in our original work, we include an adversarial strategy that considers long, non-cyclic camera paths, like the one shown in the figure above. In particular, if we render T frames from a starting frame, we optimize our render-refine-repeat model such that a discriminator network can’t tell which was the starting frame and which was the final synthesized frame. Finally, we add a component trained to generate high-quality sky regions to increase the perceived realism of the results.

With these insights, we trained InfiniteNature-Zero on collections of landscape photos, which are available in large quantities online. Several resulting videos are shown below — these demonstrate beautiful, diverse natural scenery that can be explored along arbitrarily long camera paths. Compared to our prior work — and to prior video synthesis methods — these results exhibit significant improvements in quality and diversity of content (details available in the paper).

Several nature flythroughs generated by InfiniteNature-Zero from single starting photos.

Conclusion

There are a number of exciting future directions for this work. For instance, our methods currently synthesize scene content based only on the previous frame and its depth map; there is no persistent underlying 3D representation. Our work points towards future algorithms that can generate complete, photorealistic, and consistent 3D worlds.

Acknowledgements

Infinite Nature and InfiniteNature-Zero are the result of a collaboration between researchers at Google Research, UC Berkeley, and Cornell University. The key contributors to the work represented in this post include Angjoo Kanazawa, Andrew Liu, Richard Tucker, Zhengqi Li, Noah Snavely, Qianqian Wang, Varun Jampani, and Ameesh Makadia.

Categories
Offsites

Beyond Tabula Rasa: Reincarnating Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning that focuses on training intelligent agents using related experiences so they can learn to solve decision making tasks, such as playing video games, flying stratospheric balloons, and designing hardware chips. Due to the generality of RL, the prevalent trend in RL research is to develop agents that can efficiently learn tabula rasa, that is, from scratch without using previously learned knowledge about the problem. However, in practice, tabula rasa RL systems are typically the exception rather than the norm for solving large-scale RL problems. Large-scale RL systems, such as OpenAI Five, which achieves human-level performance on Dota 2, undergo multiple design changes (e.g., algorithmic or architectural changes) during their developmental cycle. This modification process can last months and necessitates incorporating such changes without re-training from scratch, which would be prohibitively expensive. 

Furthermore, the inefficiency of tabula rasa RL research can exclude many researchers from tackling computationally-demanding problems. For example, the quintessential benchmark of training a deep RL agent on 50+ Atari 2600 games in ALE for 200M frames (the standard protocol) requires 1,000+ GPU days. As deep RL moves towards more complex and challenging problems, the computational barrier to entry in RL research will likely become even higher.

To address the inefficiencies of tabula rasa RL, we present “Reincarnating Reinforcement Learning: Reusing Prior Computation To Accelerate Progress” at NeurIPS 2022. Here, we propose an alternative approach to RL research, where prior computational work, such as learned models, policies, logged data, etc., is reused or transferred between design iterations of an RL agent or from one agent to another. While some sub-areas of RL leverage prior computation, most RL agents are still largely trained from scratch. Until now, there has been no broader effort to leverage prior computational work for the training workflow in RL research. We have also released our code and trained agents to enable researchers to build on this work.

Tabula rasa RL vs. Reincarnating RL (RRL). While tabula rasa RL focuses on learning from scratch, RRL is based on the premise of reusing prior computational work (e.g., prior learned agents) when training new agents or improving existing agents, even in the same environment. In RRL, new agents need not be trained from scratch, except for initial forays into new problems.

Why Reincarnating RL?

Reincarnating RL (RRL) is a more compute and sample-efficient workflow than training from scratch. RRL can democratize research by allowing the broader community to tackle complex RL problems without requiring excessive computational resources. Furthermore, RRL can enable a benchmarking paradigm where researchers continually improve and update existing trained agents, especially on problems where improving performance has real-world impact, such as balloon navigation or chip design. Finally, real-world RL use cases will likely be in scenarios where prior computational work is available (e.g., existing deployed RL policies).

RRL as an alternative research workflow. Imagine a researcher who has trained an agent A1 for some time, but now wants to experiment with better architectures or algorithms. While the tabula rasa workflow requires retraining another agent from scratch, RRL provides the more viable option of transferring the existing agent A1 to another agent and training this agent further, or simply fine-tuning A1.

While there have been some ad hoc large-scale reincarnation efforts with limited applicability, e.g., model surgery in Dota2, policy distillation in Rubik’s cube, PBT in AlphaStar, RL fine-tuning a behavior-cloned policy in AlphaGo / Minecraft, RRL has not been studied as a research problem in its own right. To this end, we argue for developing general-purpose RRL approaches as opposed to prior ad-hoc solutions.

Case Study: Policy to Value Reincarnating RL

Different RRL problems can be instantiated depending on the kind of prior computational work provided. As a step towards developing broadly applicable RRL approaches, we present a case study on the setting of Policy to Value reincarnating RL (PVRL) for efficiently transferring an existing sub-optimal policy (teacher) to a standalone value-based RL agent (student). While a policy directly maps a given environment state (e.g., a game screen in Atari) to an action, value-based agents estimate the effectiveness of an action at a given state in terms of achievable future rewards, which allows them to learn from previously collected data.

For a PVRL algorithm to be broadly useful, it should satisfy the following requirements:

  • Teacher Agnostic: The student shouldn’t be constrained by the existing teacher policy’s architecture or training algorithm.
  • Weaning off the teacher: It is undesirable to maintain dependency on past suboptimal teachers for successive reincarnations.
  • Compute / Sample Efficient: Reincarnation is only useful if it is cheaper than training from scratch.

Given the PVRL algorithm requirements, we evaluate whether existing approaches, designed with closely related goals, will suffice. We find that such approaches either result in small improvements over tabula rasa RL or degrade in performance when weaning off the teacher.

To address these limitations, we introduce a simple method, QDagger, in which the agent distills knowledge from the suboptimal teacher via an imitation algorithm while simultaneously using its environment interactions for RL. We start with a deep Q-network (DQN) agent trained for 400M environment frames (a week of single-GPU training) and use it as the teacher for reincarnating student agents trained on only 10M frames (a few hours of training), where the teacher is weaned off over the first 6M frames. For benchmark evaluation, we report the interquartile mean (IQM) metric from the RLiable library. As shown below for the PVRL setting on Atari games, we find that the QDagger RRL method outperforms prior approaches.

Benchmarking PVRL algorithms on Atari, with teacher-normalized scores aggregated across 10 games. Tabula rasa DQN (–·–) obtains a normalized score of 0.4. Standard baseline approaches include kickstarting, JSRL, rehearsal, offline RL pre-training and DQfD. Among all methods, only QDagger surpasses teacher performance within 10 million frames and outperforms the teacher in 75% of the games.

Reincarnating RL in Practice

We further examine the RRL approach on the Arcade Learning Environment, a widely used deep RL benchmark. First, we take a Nature DQN agent that uses the RMSProp optimizer and fine-tune it with the Adam optimizer to create a DQN (Adam) agent. While it is possible to train a DQN (Adam) agent from scratch, we demonstrate that fine-tuning Nature DQN with the Adam optimizer matches the from-scratch performance using 40x less data and compute.

Reincarnating DQN (Adam) via Fine-Tuning. The vertical separator corresponds to loading network weights and replay data for fine-tuning. Left: Tabula rasa Nature DQN nearly converges in performance after 200M environment frames. Right: Fine-tuning this Nature DQN agent using a reduced learning rate with the Adam optimizer for 20 million frames obtains similar results to DQN (Adam) trained from scratch for 400M frames.

Given the DQN (Adam) agent as a starting point, fine-tuning is restricted to the 3-layer convolutional architecture. So, we consider a more general reincarnation approach that leverages recent architectural and algorithmic advances without training from scratch. Specifically, we use QDagger to reincarnate another RL agent that uses a more advanced RL algorithm (Rainbow) and a better neural network architecture (Impala-CNN ResNet) from the fine-tuned DQN (Adam) agent.

Reincarnating a different architecture / algorithm via QDagger. The vertical separator is the point at which we apply offline pre-training using QDagger for reincarnation. Left: Fine-tuning DQN with Adam. Right: Comparison of a tabula rasa Impala-CNN Rainbow agent (sky blue) to an Impala-CNN Rainbow agent (pink) trained using QDagger RRL from the fine-tuned DQN (Adam). The reincarnated Impala-CNN Rainbow agent consistently outperforms its scratch counterpart. Note that further fine-tuning DQN (Adam) results in diminishing returns (yellow).

Overall, these results indicate that past research could have been accelerated by incorporating a RRL approach to designing agents, instead of re-training agents from scratch. Our paper also contains results on the Balloon Learning Environment, where we demonstrate that RRL allows us to make progress on the problem of navigating stratospheric balloons using only a few hours of TPU-compute by reusing a distributed RL agent trained on TPUs for more than a month.

Discussion

Fairly comparing reincarnation approaches involves using the exact same computational work and workflow. Furthermore, the research findings in RRL that broadly generalize would be about how effective an algorithm is given access to existing computational work, e.g., we successfully applied QDagger developed using Atari for reincarnation on Balloon Learning Environment. As such, we speculate that research in reincarnating RL can branch out in two directions:

  • Standardized benchmarks with open-sourced computational work: Akin to NLP and vision, where typically a small set of pre-trained models are common, research in RRL may also converge to a small set of open-sourced computational work (e.g., pre-trained teacher policies) on a given benchmark.
  • Real-world domains: Since obtaining higher performance has real-world impact in some domains, it incentivizes the community to reuse state-of-the-art agents and try to improve their performance.

See our paper for a broader discussion on scientific comparisons, generalizability and reproducibility in RRL. Overall, we hope that this work motivates researchers to release computational work (e.g., model checkpoints) on which others could directly build. In this regard, we have open-sourced our code and trained agents with their final replay buffers. We believe that reincarnating RL can substantially accelerate research progress by building on prior computational work, as opposed to always starting from scratch.

Acknowledgements

This work was done in collaboration with Pablo Samuel Castro, Aaron Courville and Marc Bellemare. We’d like to thank Tom Small for the animated figure used in this post. We are also grateful for feedback by the anonymous NeurIPS reviewers and several members of the Google Research team, DeepMind and Mila.