Categories
Offsites

AdaTape: Foundation model with adaptive computation and dynamic read-and-write

Adaptive computation refers to the ability of a machine learning system to adjust its behavior in response to changes in the environment. While conventional neural networks have a fixed function and computation capacity, i.e., they spend the same number of FLOPs for processing different inputs, a model with adaptive and dynamic computation modulates the computational budget it dedicates to processing each input, depending on the complexity of the input.

Adaptive computation in neural networks is appealing for two key reasons. First, the mechanism that introduces adaptivity provides an inductive bias that can play a key role in solving some challenging tasks. For instance, enabling different numbers of computational steps for different inputs can be crucial in solving arithmetic problems that require modeling hierarchies of different depths. Second, it gives practitioners the ability to tune the cost of inference through greater flexibility offered by dynamic computation, as these models can be adjusted to spend more FLOPs processing a new input.

Neural networks can be made adaptive by using different functions or computation budgets for various inputs. A deep neural network can be thought of as a function that outputs a result based on both the input and its parameters. To implement adaptive function types, a subset of parameters are selectively activated based on the input, a process referred to as conditional computation. Adaptivity based on the function type has been explored in studies on mixture-of-experts, where the sparsely activated parameters for each input sample are determined through routing.

Another area of research in adaptive computation involves dynamic computation budgets. Unlike in standard neural networks, such as T5, GPT-3, PaLM, and ViT, whose computation budget is fixed for different samples, recent research has demonstrated that adaptive computation budgets can improve performance on tasks where transformers fall short. Many of these works achieve adaptivity by using dynamic depth to allocate the computation budget. For example, the Adaptive Computation Time (ACT) algorithm was proposed to provide an adaptive computational budget for recurrent neural networks. The Universal Transformer extends the ACT algorithm to transformers by making the computation budget dependent on the number of transformer layers used for each input example or token. Recent studies, like PonderNet, follow a similar approach while improving the dynamic halting mechanisms.

In the paper “Adaptive Computation with Elastic Input Sequence”, we introduce a new model that utilizes adaptive computation, called AdaTape. This model is a Transformer-based architecture that uses a dynamic set of tokens to create elastic input sequences, providing a unique perspective on adaptivity in comparison to previous works. AdaTape uses an adaptive tape reading mechanism to determine a varying number of tape tokens that are added to each input based on input’s complexity. AdaTape is very simple to implement, provides an effective knob to increase the accuracy when needed, but is also much more efficient compared to other adaptive baselines because it directly injects adaptivity into the input sequence instead of the model depth. Finally, Adatape offers better performance on standard tasks, like image classification, as well as algorithmic tasks, while maintaining a favorable quality and cost tradeoff.

Adaptive computation transformer with elastic input sequence

AdaTape uses both the adaptive function types and a dynamic computation budget. Specifically, for a batch of input sequences after tokenization (e.g., a linear projection of non-overlapping patches from an image in the vision transformer), AdaTape uses a vector representing each input to dynamically select a variable-sized sequence of tape tokens.

AdaTape uses a bank of tokens, called a “tape bank”, to store all the candidate tape tokens that interact with the model through the adaptive tape reading mechanism. We explore two different methods for creating the tape bank: an input-driven bank and a learnable bank.

The general idea of the input-driven bank is to extract a bank of tokens from the input while employing a different approach than the original model tokenizer for mapping the raw input to a sequence of input tokens. This enables dynamic, on-demand access to information from the input that is obtained using a different point of view, e.g., a different image resolution or a different level of abstraction.

In some cases, tokenization in a different level of abstraction is not possible, thus an input-driven tape bank is not feasible, such as when it’s difficult to further split each node in a graph transformer. To address this issue, AdaTape offers a more general approach for generating the tape bank by using a set of trainable vectors as tape tokens. This approach is referred to as the learnable bank and can be viewed as an embedding layer where the model can dynamically retrieve tokens based on the complexity of the input example. The learnable bank enables AdaTape to generate a more flexible tape bank, providing it with the ability to dynamically adjust its computation budget based on the complexity of each input example, e.g., more complex examples retrieve more tokens from the bank, which let the model not only use the knowledge stored in the bank, but also spend more FLOPs processing it, since the input is now larger.

Finally, the selected tape tokens are appended to the original input and fed to the following transformer layers. For each transformer layer, the same multi-head attention is used across all input and tape tokens. However, two different feed-forward networks (FFN) are used: one for all tokens from the original input and the other for all tape tokens. We observed slightly better quality by using separate feed-forward networks for input and tape tokens.

An overview of AdaTape. For different samples, we pick a variable number of different tokens from the tape bank. The tape bank can be driven from input, e.g., by extracting some extra fine-grained information or it can be a set of trainable vectors. Adaptive tape reading is used to recursively select different sequences of tape tokens, with variable lengths, for different inputs. These tokens are then simply appended to inputs and fed to the transformer encoder.

AdaTape provides helpful inductive bias

We evaluate AdaTape on parity, a very challenging task for the standard Transformer, to study the effect of inductive biases in AdaTape. With the parity task, given a sequence 1s, 0s, and -1s, the model has to predict the evenness or oddness of the number of 1s in the sequence. Parity is the simplest non-counter-free or periodic regular language, but perhaps surprisingly, the task is unsolvable by the standard Transformer.

Evaluation on the parity task. The standard Transformer and Universal Transformer were unable to perform this task, both showing performance at the level of a random guessing baseline.

Despite being evaluated on short, simple sequences, both the standard Transformer and Universal Transformers were unable to perform the parity task as they are unable to maintain a counter within the model. However, AdaTape outperforms all baselines, as it incorporates a lightweight recurrence within its input selection mechanism, providing an inductive bias that enables the implicit maintenance of a counter, which is not possible in standard Transformers.

Evaluation on image classification

We also evaluate AdaTape on the image classification task. To do so, we trained AdaTape on ImageNet-1K from scratch. The figure below shows the accuracy of AdaTape and the baseline methods, including A-ViT, and the Universal Transformer ViT (UViT and U2T) versus their speed (measured as number of images, processed by each code, per second). In terms of quality and cost tradeoff, AdaTape performs much better than the alternative adaptive transformer baselines. In terms of efficiency, larger AdaTape models (in terms of parameter count) are faster than smaller baselines. Such results are consistent with the finding from previous work that shows that the adaptive model depth architectures are not well suited for many accelerators, like the TPU.

We evaluate AdaTape by training on ImageNet from scratch. For A-ViT, we not only report their results from the paper but also re-implement A-ViT by training from scratch, i.e., A-ViT(Ours).

A study of AdaTape’s behavior

In addition to its performance on the parity task and ImageNet-1K, we also evaluated the token selection behavior of AdaTape with an input-driven bank on the JFT-300M validation set. To better understand the model’s behavior, we visualized the token selection results on the input-driven bank as heatmaps, where lighter colors mean that position is more frequently selected. The heatmaps reveal that AdaTape more frequently picks the central patches. This aligns with our prior knowledge, as central patches are typically more informative — especially in the context of datasets with natural images, where the main object is in the middle of the image. This result highlights the intelligence of AdaTape, as it can effectively identify and prioritize more informative patches to improve its performance.

We visualize the tape token selection heatmap of AdaTape-B/32 (left) and AdaTape-B/16 (right). The hotter / lighter color means the patch at this position is more frequently selected.

Conclusion

AdaTape is characterized by elastic sequence lengths generated by the adaptive tape reading mechanism. This also introduces a new inductive bias that enables AdaTape to have the potential to solve tasks that are challenging for both standard transformers and existing adaptive transformers. By conducting comprehensive experiments on image recognition benchmarks, we demonstrate that AdaTape outperforms standard transformers and adaptive architecture transformers when computation is held constant.

Acknowledgments

One of the authors of this post, Mostafa Dehghani, is now at Google DeepMind.

Categories
Offsites

Multimodal medical AI

Medicine is an inherently multimodal discipline. When providing care, clinicians routinely interpret data from a wide range of modalities including medical images, clinical notes, lab tests, electronic health records, genomics, and more. Over the last decade or so, AI systems have achieved expert-level performance on specific tasks within specific modalities — some AI systems processing CT scans, while others analyzing high magnification pathology slides, and still others hunting for rare genetic variations. The inputs to these systems tend to be complex data such as images, and they typically provide structured outputs, whether in the form of discrete grades or dense image segmentation masks. In parallel, the capacities and capabilities of large language models (LLMs) have become so advanced that they have demonstrated comprehension and expertise in medical knowledge by both interpreting and responding in plain language. But how do we bring these capabilities together to build medical AI systems that can leverage information from all these sources?

In today’s blog post, we outline a spectrum of approaches to bringing multimodal capabilities to LLMs and share some exciting results on the tractability of building multimodal medical LLMs, as described in three recent research papers. The papers, in turn, outline how to introduce de novo modalities to an LLM, how to graft a state-of-the-art medical imaging foundation model onto a conversational LLM, and first steps towards building a truly generalist multimodal medical AI system. If successfully matured, multimodal medical LLMs might serve as the basis of new assistive technologies spanning professional medicine, medical research, and consumer applications. As with our prior work, we emphasize the need for careful evaluation of these technologies in collaboration with the medical community and healthcare ecosystem.

A spectrum of approaches

Several methods for building multimodal LLMs have been proposed in recent months [1, 2, 3], and no doubt new methods will continue to emerge for some time. For the purpose of understanding the opportunities to bring new modalities to medical AI systems, we’ll consider three broadly defined approaches: tool use, model grafting, and generalist systems.

The spectrum of approaches to building multimodal LLMs range from having the LLM use existing tools or models, to leveraging domain-specific components with an adapter, to joint modeling of a multimodal model.

Tool use

In the tool use approach, one central medical LLM outsources analysis of data in various modalities to a set of software subsystems independently optimized for those tasks: the tools. The common mnemonic example of tool use is teaching an LLM to use a calculator rather than do arithmetic on its own. In the medical space, a medical LLM faced with a chest X-ray could forward that image to a radiology AI system and integrate that response. This could be accomplished via application programming interfaces (APIs) offered by subsystems, or more fancifully, two medical AI systems with different specializations engaging in a conversation.

This approach has some important benefits. It allows maximum flexibility and independence between subsystems, enabling health systems to mix and match products between tech providers based on validated performance characteristics of subsystems. Moreover, human-readable communication channels between subsystems maximize auditability and debuggability. That said, getting the communication right between independent subsystems can be tricky, narrowing the information transfer, or exposing a risk of miscommunication and information loss.

Model grafting

A more integrated approach would be to take a neural network specialized for each relevant domain, and adapt it to plug directly into the LLM — grafting the visual model onto the core reasoning agent. In contrast to tool use where the specific tool(s) used are determined by the LLM, in model grafting the researchers may choose to use, refine, or develop specific models during development. In two recent papers from Google Research, we show that this is in fact feasible. Neural LLMs typically process text by first mapping words into a vector embedding space. Both papers build on the idea of mapping data from a new modality into the input word embedding space already familiar to the LLM. The first paper, “Multimodal LLMs for health grounded in individual-specific data”, shows that asthma risk prediction in the UK Biobank can be improved if we first train a neural network classifier to interpret spirograms (a modality used to assess breathing ability) and then adapt the output of that network to serve as input into the LLM.

The second paper, “ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders”, takes this same tack, but applies it to full-scale image encoder models in radiology. Starting with a foundation model for understanding chest X-rays, already shown to be a good basis for building a variety of classifiers in this modality, this paper describes training a lightweight medical information adapter that re-expresses the top layer output of the foundation model as a series of tokens in the LLM’s input embeddings space. Despite fine-tuning neither the visual encoder nor the language model, the resulting system displays capabilities it wasn’t trained for, including semantic search and visual question answering.

Our approach to grafting a model works by training a medical information adapter that maps the output of an existing or refined image encoder into an LLM-understandable form.

Model grafting has a number of advantages. It uses relatively modest computational resources to train the adapter layers but allows the LLM to build on existing highly-optimized and validated models in each data domain. The modularization of the problem into encoder, adapter, and LLM components can also facilitate testing and debugging of individual software components when developing and deploying such a system. The corresponding disadvantages are that the communication between the specialist encoder and the LLM is no longer human readable (being a series of high dimensional vectors), and the grafting procedure requires building a new adapter for not just every domain-specific encoder, but also every revision of each of those encoders.

Generalist systems

The most radical approach to multimodal medical AI is to build one integrated, fully generalist system natively capable of absorbing information from all sources. In our third paper in this area, “Towards Generalist Biomedical AI”, rather than having separate encoders and adapters for each data modality, we build on PaLM-E, a recently published multimodal model that is itself a combination of a single LLM (PaLM) and a single vision encoder (ViT). In this set up, text and tabular data modalities are covered by the LLM text encoder, but now all other data are treated as an image and fed to the vision encoder.

Med-PaLM M is a large multimodal generative model that flexibly encodes and interprets biomedical data including clinical language, imaging, and genomics with the same model weights.

We specialize PaLM-E to the medical domain by fine-tuning the complete set of model parameters on medical datasets described in the paper. The resulting generalist medical AI system is a multimodal version of Med-PaLM that we call Med-PaLM M. The flexible multimodal sequence-to-sequence architecture allows us to interleave various types of multimodal biomedical information in a single interaction. To the best of our knowledge, it is the first demonstration of a single unified model that can interpret multimodal biomedical data and handle a diverse range of tasks using the same set of model weights across all tasks (detailed evaluations in the paper).

This generalist-system approach to multimodality is both the most ambitious and simultaneously most elegant of the approaches we describe. In principle, this direct approach maximizes flexibility and information transfer between modalities. With no APIs to maintain compatibility across and no proliferation of adapter layers, the generalist approach has arguably the simplest design. But that same elegance is also the source of some of its disadvantages. Computational costs are often higher, and with a unitary vision encoder serving a wide range of modalities, domain specialization or system debuggability could suffer.

The reality of multimodal medical AI

To make the most of AI in medicine, we’ll need to combine the strength of expert systems trained with predictive AI with the flexibility made possible through generative AI. Which approach (or combination of approaches) will be most useful in the field depends on a multitude of as-yet unassessed factors. Is the flexibility and simplicity of a generalist model more valuable than the modularity of model grafting or tool use? Which approach gives the highest quality results for a specific real-world use case? Is the preferred approach different for supporting medical research or medical education vs. augmenting medical practice? Answering these questions will require ongoing rigorous empirical research and continued direct collaboration with healthcare providers, medical institutions, government entities, and healthcare industry partners broadly. We look forward to finding the answers together.

Categories
Offsites

In search of a generalizable method for source-free domain adaptation

Deep learning has recently made tremendous progress in a wide range of problems and applications, but models often fail unpredictably when deployed in unseen domains or distributions. Source-free domain adaptation (SFDA) is an area of research that aims to design methods for adapting a pre-trained model (trained on a “source domain”) to a new “target domain”, using only unlabeled data from the latter.

Designing adaptation methods for deep models is an important area of research. While the increasing scale of models and training datasets has been a key ingredient to their success, a negative consequence of this trend is that training such models is increasingly computationally expensive, out of reach for certain practitioners and also harmful for the environment. One avenue to mitigate this issue is through designing techniques that can leverage and reuse already trained models for tackling new tasks or generalizing to new domains. Indeed, adapting models to new tasks is widely studied under the umbrella of transfer learning.

SFDA is a particularly practical area of this research because several real-world applications where adaptation is desired suffer from the unavailability of labeled examples from the target domain. In fact, SFDA is enjoying increasing attention [1, 2, 3, 4]. However, albeit motivated by ambitious goals, most SFDA research is grounded in a very narrow framework, considering simple distribution shifts in image classification tasks.

In a significant departure from that trend, we turn our attention to the field of bioacoustics, where naturally-occurring distribution shifts are ubiquitous, often characterized by insufficient target labeled data, and represent an obstacle for practitioners. Studying SFDA in this application can, therefore, not only inform the academic community about the generalizability of existing methods and identify open research directions, but can also directly benefit practitioners in the field and aid in addressing one of the biggest challenges of our century: biodiversity preservation.

In this post, we announce “In Search for a Generalizable Method for Source-Free Domain Adaptation”, appearing at ICML 2023. We show that state-of-the-art SFDA methods can underperform or even collapse when confronted with realistic distribution shifts in bioacoustics. Furthermore, existing methods perform differently relative to each other than observed in vision benchmarks, and surprisingly, sometimes perform worse than no adaptation at all. We also propose NOTELA, a new simple method that outperforms existing methods on these shifts while exhibiting strong performance on a range of vision datasets. Overall, we conclude that evaluating SFDA methods (only) on the commonly-used datasets and distribution shifts leaves us with a myopic view of their relative performance and generalizability. To live up to their promise, SFDA methods need to be tested on a wider range of distribution shifts, and we advocate for considering naturally-occurring ones that can benefit high-impact applications.

Distribution shifts in bioacoustics

Naturally-occurring distribution shifts are ubiquitous in bioacoustics. The largest labeled dataset for bird songs is Xeno-Canto (XC), a collection of user-contributed recordings of wild birds from across the world. Recordings in XC are “focal”: they target an individual captured in natural conditions, where the song of the identified bird is at the foreground. For continuous monitoring and tracking purposes, though, practitioners are often more interested in identifying birds in passive recordings (“soundscapes”), obtained through omnidirectional microphones. This is a well-documented problem that recent work shows is very challenging. Inspired by this realistic application, we study SFDA in bioacoustics using a bird species classifier that was pre-trained on XC as the source model, and several “soundscapes” coming from different geographical locations — Sierra Nevada (S. Nevada); Powdermill Nature Reserve, Pennsylvania, USA; Hawai’i; Caples Watershed, California, USA; Sapsucker Woods, New York, USA (SSW); and Colombia — as our target domains.

This shift from the focalized to the passive domain is substantial: the recordings in the latter often feature much lower signal-to-noise ratio, several birds vocalizing at once, and significant distractors and environmental noise, like rain or wind. In addition, different soundscapes originate from different geographical locations, inducing extreme label shifts since a very small portion of the species in XC will appear in a given location. Moreover, as is common in real-world data, both the source and target domains are significantly class imbalanced, because some species are significantly more common than others. In addition, we consider a multi-label classification problem since there may be several birds identified within each recording, a significant departure from the standard single-label image classification scenario where SFDA is typically studied.

Illustration of the “focal → soundscapes” shift. In the focalized domain, recordings are typically composed of a single bird vocalization in the foreground, captured with high signal-to-noise ratio (SNR), though there may be other birds vocalizing in the background. On the other hand, soundscapes contain recordings from omnidirectional microphones and can be composed of multiple birds vocalizing simultaneously, as well as environmental noises from insects, rain, cars, planes, etc.

Audio files           

     Focal domain
     

     

     Soundscape domain1
     

Spectogram images                 
Illustration of the distribution shift from the focal domain (left) to the soundscape domain (right), in terms of the audio files (top) and spectrogram images (bottom) of a representative recording from each dataset. Note that in the second audio clip, the bird song is very faint; a common property in soundscape recordings where bird calls aren’t at the “foreground”. Credits: Left: XC recording by Sue Riffe (CC-BY-NC license). Right: Excerpt from a recording made available by Kahl, Charif, & Klinck. (2022) “A collection of fully-annotated soundscape recordings from the Northeastern United States” [link] from the SSW soundscape dataset (CC-BY license).

State-of-the-art SFDA models perform poorly on bioacoustics shifts

As a starting point, we benchmark six state-of-the-art SFDA methods on our bioacoustics benchmark, and compare them to the non-adapted baseline (the source model). Our findings are surprising: without exception, existing methods are unable to consistently outperform the source model on all target domains. In fact, they often underperform it significantly.

As an example, Tent, a recent method, aims to make models produce confident predictions for each example by reducing the uncertainty of the model’s output probabilities. While Tent performs well in various tasks, it doesn’t work effectively for our bioacoustics task. In the single-label scenario, minimizing entropy forces the model to choose a single class for each example confidently. However, in our multi-label scenario, there’s no such constraint that any class should be selected as being present. Combined with significant distribution shifts, this can cause the model to collapse, leading to zero probabilities for all classes. Other benchmarked methods like SHOT, AdaBN, Tent, NRC, DUST and Pseudo-Labelling, which are strong baselines for standard SFDA benchmarks, also struggle with this bioacoustics task.

Evolution of the test mean average precision (mAP), a standard metric for multilabel classification, throughout the adaptation procedure on the six soundscape datasets. We benchmark our proposed NOTELA and Dropout Student (see below), as well as SHOT, AdaBN, Tent, NRC, DUST and Pseudo-Labelling. Aside from NOTELA, all other methods fail to consistently improve the source model.

Introducing NOisy student TEacher with Laplacian Adjustment (NOTELA)

Nonetheless, a surprisingly positive result stands out: the less celebrated Noisy Student principle appears promising. This unsupervised approach encourages the model to reconstruct its own predictions on some target dataset, but under the application of random noise. While noise may be introduced through various channels, we strive for simplicity and use model dropout as the only noise source: we therefore refer to this approach as Dropout Student (DS). In a nutshell, it encourages the model to limit the influence of individual neurons (or filters) when making predictions on a specific target dataset.

DS, while effective, faces a model collapse issue on various target domains. We hypothesize this happens because the source model initially lacks confidence in those target domains. We propose improving DS stability by using the feature space directly as an auxiliary source of truth. NOTELA does this by encouraging similar pseudo-labels for nearby points in the feature space, inspired by NRC’s method and Laplacian regularization. This simple approach is visualized below, and consistently and significantly outperforms the source model in both audio and visual tasks.

NOTELA in action. The audio recordings are forwarded through the full model to obtain a first set of predictions, which are then refined through Laplacian regularization, a form of post-processing based on clustering nearby points. Finally, the refined predictions are used as targets for the noisy model to reconstruct.

Conclusion

The standard artificial image classification benchmarks have inadvertently limited our understanding of the true generalizability and robustness of SFDA methods. We advocate for broadening the scope and adopt a new assessment framework that incorporates naturally-occurring distribution shifts from bioacoustics. We also hope that NOTELA serves as a robust baseline to facilitate research in that direction. NOTELA’s strong performance perhaps points to two factors that can lead to developing more generalizable models: first, developing methods with an eye towards harder problems and second, favoring simple modeling principles. However, there is still future work to be done to pinpoint and comprehend existing methods’ failure modes on harder problems. We believe that our research represents a significant step in this direction, serving as a foundation for designing SFDA methods with greater generalizability.

Acknowledgements

One of the authors of this post, Eleni Triantafillou, is now at Google DeepMind. We are posting this blog post on behalf of the authors of the NOTELA paper: Malik Boudiaf, Tom Denton, Bart van Merriënboer, Vincent Dumoulin*, Eleni Triantafillou* (where * denotes equal contribution). We thank our co-authors for the hard work on this paper and the rest of the Perch team for their support and feedback.


1Note that in this audio clip, the bird song is very faint; a common property in soundscape recordings where bird calls aren’t at the “foreground”. 

Categories
Offsites

Google at ICML 2023

Groups across Google actively pursue research in the field of machine learning (ML), ranging from theory and application. We build ML systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. We aim to build a more collaborative ecosystem with the broader ML research community through open-sourcing tools and datasets, publishing our work, and actively participating in conferences.

Google is proud to be a Diamond Sponsor of the 40th International Conference on Machine Learning (ICML 2023), a premier annual conference, which is being held this week in Honolulu, Hawaii. As a leader in ML research, Google has a strong presence at this year’s conference with over 120 accepted papers and active involvement in a number of workshops and tutorials. Google is also proud to be a Platinum Sponsor for both the LatinX in AI and Women in Machine Learning workshops. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.

Registered for ICML 2023? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Visit the @GoogleAI Twitter account to find out about Google booth activities (e.g., demos and Q&A sessions). See Google DeepMind’s blog to learn about their technical participation at ICML 2023.

Take a look below to learn more about the Google research being presented at ICML 2023 (Google affiliations in bold).

Board and Organizing Committee

Board Members include: Corinna Cortes, Hugo Larochelle

Tutorial Chairs include: Hanie Sedghi

Google Research booth activities

Presenters: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer

Title: Unsupervised Graph Embedding @ Google (paper, EXPO workshop)

Tuesday, July 25th at 10:30 AM HST

Presenters: Zheng Xu

Title: Federated Learning of Gboard Language Models with Differential Privacy (paper 1, paper 2, blog post)

Tuesday, July 25th at 3:30 PM HST

Presenters: Thomas Kipf

Title: Self-supervised scene understanding (paper 1, paper 2)

Wednesday, July 26th at 10:30 AM HST

Presenters: Johannes von Oswald, Max Vladymyrov

Title: Transformers learn in-context by gradient descent (paper)

Wednesday, July 26th at 3:30 PM HST

Accepted papers

Scaling Vision Transformers to 22 Billion Parameters (see blog post)

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan, Matan Kalman, Yossi Matias

Best of Both Worlds Policy Optimization

Christoph Dann, Chen-Yu Wei, Julian Zimmert

Inflow, Outflow, and Reciprocity in Machine Learning

Mukund Sundararajan, Walid Krichene

Transformers Learn In-Context by Gradient Descent

Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, Max Vladymyrov

Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models

Luke Vilnis, Yury Zemlyanskiy, Patrick Murray*, Alexandre Passos*, Sumit Sanghai

Differentially Private Hierarchical Clustering with Provable Approximation Guarantees (see blog post)

Jacob Imola*, Alessandro Epasto, Mohammad Mahdian, Vincent Cohen-Addad, Vahab Mirrokni

Multi-Epoch Matrix Factorization Mechanisms for Private Machine Learning

Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, Abhradeep Thakurta

Random Classification Noise Does Not Defeat All Convex Potential Boosters Irrespective of Model Choice

Yishay Mansour, Richard Nock, Robert Williamson

Simplex Random Features

Isaac Reid, Krzysztof Choromanski, Valerii Likhosherstov, Adrian Weller

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

Kenton Lee, Mandar Joshi, Iulia Turc, Hexiang Hu, Fangyu Liu, Julian Eisenschlos, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, Kristina Toutanova

Mu2SLAM: Multitask, Multilingual Speech and Language Models

Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna

Robust Budget Pacing with a Single Sample

Santiago Balseiro, Rachitesh Kumar*, Vahab Mirrokni, Balasubramanian Sivan, Di Wang

A Statistical Perspective on Retrieval-Based Models

Soumya Basu, Ankit Singh Rawat, Manzil Zaheer

Approximately Optimal Core Shapes for Tensor Decompositions

Mehrdad Ghadiri, Matthew Fahrbach, Gang Fu, Vahab Mirrokni

Efficient List-Decodable Regression Using Batches

Abhimanyu Das, Ayush Jain*, Weihao Kong, Rajat Sen

Efficient Training of Language Models Using Few-Shot Learning

Sashank J. Reddi, Sobhan Miryoosefi, Stefani Karp, Shankar Krishnan, Satyen Kale, Seungyeon Kim, Sanjiv Kumar

Fully Dynamic Submodular Maximization Over Matroids

Paul Duetting, Federico Fusco, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

GFlowNet-EM for Learning Compositional Latent Variable Models

Edward J Hu, Nikolay Malkin, Moksh Jain, Katie Everett, Alexandros Graikos, Yoshua Bengio

Improved Online Learning Algorithms for CTR Prediction in Ad Auctions

Zhe Feng, Christopher Liaw, Zixin Zhou

Large Language Models Struggle to Learn Long-Tail Knowledge

Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel

Multi-channel Autobidding with Budget and ROI Constraints

Yuan Deng, Negin Golrezaei, Patrick Jaillet, Jason Cheuk Nam Liang, Vahab Mirrokni

Multi-layer Neural Networks as Trainable Ladders of Hilbert Spaces

Zhengdao Chen

On User-Level Private Convex Optimization

Badih Ghazi, Pritish Kamath, Ravi Kumar, Raghu Meka, Pasin Manurangsi, Chiyuan Zhang

PAC Generalization via Invariant Representations

Advait U Parulekar, Karthikeyan Shanmugam, Sanjay Shakkottai

Regularization and Variance-Weighted Regression Achieves Minimax Optimality in Linear MDPs: Theory and Practice

Toshinori Kitamura, Tadashi Kozuno, Yunhao Tang, Nino Vieillard, Michal Valko, Wenhao Yang, Jincheng Mei, Pierre Menard, Mohammad Gheshlaghi Azar, Remi Munos, Olivier Pietquin, Matthieu Geist,Csaba Szepesvari, Wataru Kumagai, Yutaka Matsuo

Speeding Up Bellman Ford via Minimum Violation Permutations

Silvio Lattanzi, Ola Svensson, Sergei Vassilvitskii

Statistical Indistinguishability of Learning Algorithms

Alkis Kalavasis, Amin Karbasi, Shay Moran, Grigoris Velegkas

Test-Time Adaptation with Slot-Centric Models

Mihir Prabhudesai, Anirudh Goyal, Sujoy Paul, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gaurav Aggarwal, Thomas Kipf, Deepak Pathak, Katerina Fragkiadaki>

Algorithms for Bounding Contribution for Histogram Estimation Under User-Level Privacy

Yuhan Liu*, Ananda Theertha Suresh, Wennan Zhu, Peter Kairouz, Marco Gruteser

Bandit Online Linear Optimization with Hints and Queries

Aditya Bhaskara, Ashok Cutkosky, Ravi Kumar, Manish Purohit

CLUTR: Curriculum Learning via Unsupervised Task Representation Learning

Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, Ion Stoica

CSP: Self-Supervised Contrastive Spatial Pre-training for Geospatial-Visual Representations

Gengchen Mai, Ni Lao, Yutong He, Jiaming Song, Stefano Ermon

Ewald-Based Long-Range Message Passing for Molecular Graphs

Arthur Kosmala, Johannes Gasteiger, Nicholas Gao, Stephan Günnemann

Fast (1+ε)-Approximation Algorithms for Binary Matrix Factorization

Ameya Velingker, Maximilian Vötsch, David Woodruff, Samson Zhou

Federated Linear Contextual Bandits with User-Level Differential Privacy

Ruiquan Huang, Huanyu Zhang, Luca Melis, Milan Shen, Meisam Hejazinia, Jing Yang

Investigating the Role of Model-Based Learning in Exploration and Transfer

Jacob C Walker, Eszter Vértes, Yazhe Li, Gabriel Dulac-Arnold, Ankesh Anand, Theophane Weber, Jessica B Hamrick

Label Differential Privacy and Private Training Data Release

Robert Busa-Fekete, Andres Munoz, Umar Syed, Sergei Vassilvitskii

Lifelong Language Pretraining with Distribution-Specialized Experts

Wuyang Chen*, Yanqi Zhou, Nan Du, Yanping Huang, James Laudon, Zhifeng Chen, Claire Cui

Multi-User Reinforcement Learning with Low Rank Rewards

Dheeraj Mysore Nagaraj, Suhas S Kowshik, Naman Agarwal, Praneeth Netrapalli, Prateek Jain

Multi-View Masked World Models for Visual Robotic Manipulation

Younggyo Seo, Junsu Kim, Stephen James, Kimin Lee, Jinwoo Shin, Pieter Abbeel

PaLM-E: An Embodied Multimodal Language Model (see blog post)

Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter,Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence

Private Federated Learning with Autotuned Compression

Enayat Ullah*, Christopher A. Choquette-Choo, Peter Kairouz, Sewoong Oh

Refined Regret for Adversarial MDPs with Linear Function Approximation

Yan Dai, Haipeng Luo, Chen-Yu Wei, Julian Zimmert

Scaling Up Dataset Distillation to ImageNet-1K with Constant Memory

Justin Cui, Ruoche Wan, Si Si, Cho-Jui Hsieh

SGD with AdaGrad Stepsizes: Full Adaptivity with High Probability to Unknown Parameters, Unbounded Gradients and Affine Variance

Amit Attia, Tomer Koren

The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation

Mark Rowland, Yunhao Tang, Clare Lyle, Rémi Munos, Marc G. Bellemare, Will Dabney

Unveiling The Mask of Position-Information Pattern Through the Mist of Image Features

Chieh Hubert Lin, Hung-Yu Tseng, Hsin-Ying Lee, Maneesh Kumar Singh, Ming-Hsuan Yang

User-Level Private Stochastic Convex Optimization with Optimal Rates

Raef Bassily, Ziteng Sun

A Simple Zero-Shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models

James Urquhart Allingham*, Jie Ren, Michael W Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, Balaji Lakshminarayanan

Can Large Language Models Reason About Program Invariants?

Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, Pengcheng Yin

Concurrent Shuffle Differential Privacy Under Continual Observation

Jay Tenenbaum, Haim Kaplan, Yishay Mansour, Uri Stemmer

Constant Matters: Fine-Grained Error Bound on Differentially Private Continual Observation

Hendrik Fichtenberger, Monika Henzinger, Jalaj Upadhyay

Cross-Entropy Loss Functions: Theoretical Analysis and Applications

Anqi Mao, Mehryar Mohri, Yutao Zhong

Efficient Rate Optimal Regret for Adversarial Contextual MDPs Using Online Function Approximation

Orin Levy, Alon Cohen, Asaf Cassel, Yishay Mansour

Fairness in Streaming Submodular Maximization Over a Matroid Constraint

Marwa El Halabi, Federico Fusco, Ashkan Norouzi-Fard, Jakab Tardos, Jakub Tarnawski

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning (see blog post)

Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le, Barret Zoph, Jason Wei, Adam Roberts

Graph Reinforcement Learning for Network Control via Bi-level Optimization

Daniele Gammelli, James Harrison, Kaidi Yang, Marco Pavone, Filipe Rodrigues, Francisco C. Pereira

Learning-Augmented Private Algorithms for Multiple Quantile Release

Mikhail Khodak*, Kareem Amin, Travis Dick, Sergei Vassilvitskii

LegendreTron: Uprising Proper Multiclass Loss Learning

Kevin H Lam, Christian Walder, Spiridon Penev, Richard Nock

Measuring the Impact of Programming Language Distribution

Gabriel Orlanski*, Kefan Xiao, Xavier Garcia, Jeffrey Hui, Joshua Howland, Jonathan Malmaud, Jacob Austin, Rishabh Singh, Michele Catasta*

Multi-task Differential Privacy Under Distribution Skew

Walid Krichene, Prateek Jain, Shuang Song, Mukund Sundararajan, Abhradeep Thakurta, Li Zhang

Muse: Text-to-Image Generation via Masked Generative Transformers

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, José Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan

On the Convergence of Federated Averaging with Cyclic Client Participation

Yae Jee Cho, Pranay Sharma, Gauri Joshi, Zheng Xu, Satyen Kale, Tong Zhang

Optimal Stochastic Non-smooth Non-convex Optimization Through Online-to-Non-convex Conversion

Ashok Cutkosky, Harsh Mehta, Francesco Orabona

Out-of-Domain Robustness via Targeted Augmentations

Irena Gao, Shiori Sagawa, Pang Wei Koh, Tatsunori Hashimoto, Percy Liang

Polynomial Time and Private Learning of Unbounded Gaussian Mixture Models

Jamil Arbas, Hassan Ashtiani, Christopher Liaw

Pre-computed Memory or On-the-Fly Encoding? A Hybrid Approach to Retrieval Augmentation Makes the Most of Your Compute

Michiel de Jong, Yury Zemlyanskiy, Nicholas FitzGerald, Joshua Ainslie, Sumit Sanghai, Fei Sha, William W. Cohen

Scalable Adaptive Computation for Iterative Generation

Allan Jabri*, David J. Fleet, Ting Chen

Scaling Spherical CNNs

Carlos Esteves, Jean-Jacques Slotine, Ameesh Makadia

STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition

Yucheng Lu, Shivani Agrawal, Suvinay Subramanian, Oleg Rybakov, Christopher De Sa, Amir Yazdanbakhsh

Stratified Adversarial Robustness with Rejection

Jiefeng Chen, Jayaram Raghuram, Jihye Choi, Xi Wu, Yingyu Liang, Somesh Jha

When Does Privileged information Explain Away Label Noise?

Guillermo Ortiz-Jimenez*, Mark Collier, Anant Nawalgaria, Alexander D’Amour, Jesse Berent, Rodolphe Jenatton, Effrosyni Kokiopoulou

Adaptive Computation with Elastic Input Sequence

Fuzhao Xue*, Valerii Likhosherstov, Anurag Arnab, Neil Houlsby, Mostafa Dehghani, Yang You

Can Neural Network Memorization Be Localized?

Pratyush Maini, Michael C. Mozer, Hanie Sedghi, Zachary C. Lipton, J. Zico Kolter, Chiyuan Zhang

Controllability-Aware Unsupervised Skill Discovery

Seohong Park, Kimin Lee, Youngwoon Lee, Pieter Abbeel

Efficient Learning of Mesh-Based Physical Simulation with Bi-Stride Multi-Scale Graph Neural Network

Yadi Cao, Menglei Chai, Minchen Li, Chenfanfu Jiang

Federated Heavy Hitter Recovery Under Linear Sketching

Adria Gascon, Peter Kairouz, Ziteng Sun, Ananda Theertha Suresh

Graph Generative Model for Benchmarking Graph Neural Networks

Minji Yoon, Yue Wu, John Palowitch, Bryan Perozzi, Russ Salakhutdinov

H-Consistency Bounds for Pairwise Misranking Loss Surrogates

Anqi Mao, Mehryar Mohri, Yutao Zhong

Improved Regret for Efficient Online Reinforcement Learning with Linear Function Approximation

Uri Sherman, Tomer Koren, Yishay Mansour

Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames

Ondrej Biza*, Sjoerd van Steenkiste, Mehdi S. M. Sajjadi, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Thomas Kipf

Multi-task Off-Policy Learning from Bandit Feedback

Joey Hong, Branislav Kveton, Manzil Zaheer, Sumeet Katariya, Mohammad Ghavamzadeh

Optimal No-Regret Learning for One-Sided Lipschitz Functions

Paul Duetting, Guru Guruganesh, Jon Schneider, Joshua Ruizhi Wang

Policy Mirror Ascent for Efficient and Independent Learning in Mean Field Games

Batuhan Yardim, Semih Cayci, Matthieu Geist, Niao He

Regret Minimization and Convergence to Equilibria in General-Sum Markov Games

Liad Erez, Tal Lancewicki, Uri Sherman, Tomer Koren, Yishay Mansour

Reinforcement Learning Can Be More Efficient with Multiple Rewards

Christoph Dann, Yishay Mansour, Mehryar Mohri

Reinforcement Learning with History-Dependent Dynamic Contexts

Guy Tennenholtz, Nadav Merlis, Lior Shani, Martin Mladenov, Craig Boutlier

User-Defined Event Sampling and Uncertainty Quantification in Diffusion Models for Physical Dynamical Systems

Marc Anton Finzi*, Anudhyan Boral, Andrew Gordon Wilson, Fei Sha, Leonardo Zepeda-Nunez

Discrete Key-Value Bottleneck

Frederik Träuble, Anirudh Goyal, Nasim Rahaman, Michael Curtis Mozer, Kenji Kawaguchi, Yoshua Bengio, Bernhard Schölkopf

DSGD-CECA: Decentralized SGD with Communication-Optimal Exact Consensus Algorithm

Lisang Ding, Kexin Jin, Bicheng Ying, Kun Yuan, Wotao Yin

Exphormer: Sparse Transformers for Graphs

Hamed Shirzad, Ameya Velingker, Balaji Venkatachalam, Danica J. Sutherland, Ali Kemal Sinop

Fast, Differentiable and Sparse Top-k: A Convex Analysis Perspective

Michael Eli Sander*, Joan Puigcerver, Josip Djolonga, Gabriel Peyré, Mathieu Blondel

Improved Policy Evaluation for Randomized Trials of Algorithmic Resource Allocation

Aditya Mate, Bryan Wilder, Aparna Taneja, Milind Tambe

In Search for a Generalizable Method for Source Free Domain Adaptation

Malik Boudiaf*, Tom Denton, Bart van Merrienboer, Vincent Dumoulin, Eleni Triantafillou

Learning Rate Schedules in the Presence of Distribution Shift

Matthew Fahrbach, Adel Javanmard, Vahab Mirrokni, Pratik Worah

Not All Semantics Are Created Equal: Contrastive Self-Supervised Learning with Automatic Temperature Individualization

Zi-Hao Qiu, Quanqi Hu, Zhuoning Yuan, Denny Zhou, Lijun Zhang, Tianbao Yang

On the Relationship Between Explanation and Prediction: A Causal View

Amir-Hossein Karimi*, Krikamol Muandet, Simon Kornblith, Bernhard Schölkopf, Been Kim

On the Role of Attention in Prompt-Tuning

Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, Christos Thrampoulidis

PLay: Parametrically Conditioned Layout Generation Using Latent Diffusion

Chin-Yi Cheng, Forrest Huang, Gang Li, Yang Li

The Power of Learned Locally Linear Models for Nonlinear Policy Optimization

Daniel Pfrommer, Max Simchowitz, Tyler Westenbroek, Nikolai Matni, Stephen Tu

Relevant Walk Search for Explaining Graph Neural Networks

Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus Robert Muller,Shinichi Nakajima

Repository-Level Prompt Generation for Large Language Models of Code

Disha Shrivastava, Hugo Larochelle, Daniel Tarlow

Robust and Private Stochastic Linear Bandits

Vasileios Charisopoulos*, Hossein Esfandiari, Vahab Mirrokni

Simple Diffusion: End-to-End Diffusion for High Resolution Images

Emiel Hoogeboom, Jonathan Heek, Tim Salimans

Tied-Augment: Controlling Representation Similarity Improves Data Augmentation

Emirhan Kurtulus, Zichao Li, Yann Dauphin, Ekin D. Cubuk

Why Is Public Pre-Training Necessary for Private Model Training?

Arun Ganesh, Mahdi Haghifam*, Milad Nasr, Sewoong Oh, Thomas Steinke, Om Thakkar, Abhradeep Guha Thakurta, Lun Wang

A Connection Between One-Step RL and Critic Regularization in Reinforcement Learning

Benjamin Eysenbach, Matthieu Geist, Sergey Levine, Ruslan Salakhutdinov

Beyond Uniform Lipschitz Condition in Differentially Private Optimization

Rudrajit Das*, Satyen Kale, Zheng Xu, Tong Zhang, Sujay Sanghavi

Efficient Graph Field Integrators Meet Point Clouds

Krzysztof Choromanski, Arijit Sehanobish, Han Lin, Yunfan Zhao, Eli Berger, Tetiana Parshakova, Alvin Pan, David Watkins, Tianyi Zhang, Valerii Likhosherstov, Somnath Basu Roy Chowdhury, Avinava Dubey, Deepali Jain, Tamas Sarlos, Snigdha Chaturvedi, Adrian Weller

Fast as CHITA: Neural Network Pruning with Combinatorial Optimization

Riade Benbaki, Wenyu Chen, Xiang Meng, Hussein Hazimeh, Natalia Ponomareva, Zhe Zhao, Rahul Mazumder

Jump-Start Reinforcement Learning (see blog post)

Ikechukwu Uchendu*, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, Sergey Levine, Karol Hausman

Learning in POMDPs is Sample-Efficient with Hindsight Observability

Jonathan Lee, Alekh Agarwal, Christoph Dann, Tong Zhang

Low-Variance Gradient Estimation in Unrolled Computation Graphs with ES-Single

Paul Vicol

Masked Trajectory Models for Prediction, Representation, and Control

Philipp Wu, Arjun Majumdar, Kevin Stone, Yixin Lin, Igor Mordatch, Pieter Abbeel, Aravind Rajeswaran

Overcoming Simplicity Bias in Deep Networks Using a Feature Sieve

Rishabh Tiwari, Pradeep Shenoy

Pairwise Ranking Losses of Click-Through Rates Prediction for Welfare Maximization in Ad Auctions

Boxiang Lyu, Zhe Feng, Zachary Robertson, Sanmi Koyejo

Predictive Flows for Faster Ford-Fulkerson

Sami Davies, Benjamin Moseley, Sergei Vassilvitskii, Yuyan Wang

Scaling Laws for Multilingual Neural Machine Translation

Patrick Fernandes, Behrooz Ghorbani, Xavier Garcia, Markus Freitag, Orhan Firat

Sequential Monte Carlo Learning for Time Series Structure Discovery

Feras Saad, Brian Patton, Matthew Douglas Hoffman, Rif A. Saurous, Vikash Mansinghka

Stochastic Gradient Succeeds for Bandits

Jincheng Mei, Zixin Zhong, Bo Dai, Alekh Agarwal, Csaba Szepesvari, Dale Schuurmans

Subset-Based Instance Optimality in Private Estimation

Travis Dick, Alex Kulesza, Ziteng Sun, Ananda Theertha Suresh

The Unreasonable Effectiveness of Few-Shot Learning for Machine Translation

Xavier Garcia, Yamini Bansal, Colin Cherry, George Foster, Maxim Krikun, Melvin Johnson, Orhan Firat

Tutorials

Self-Supervised Learning in Vision: from Research Advances to Best Practices

Xinlei Chen, Ishan Misra, Randall Balestriero, Mathilde Caron, Christoph Feichtenhofer, Mark Ibrahim

How to DP-fy ML: A Practical Tutorial to Machine Learning with Differential Privacy (see blog post)

Sergei Vassilvitskii, Natalia Ponomareva, Zheng Xu

Recent Advances in the Generalization Theory of Neural Networks

Tengyu Ma, Alex Damian

EXPO Day workshops

Graph Neural Networks in Tensorflow: A Practical Guide

Workshop Organizers include: Bryan Perozzi, Anton Tsitsulin, Brandon Mayer, Jonathan Halcrow

Google sponsored affinity workshops

LatinX in AI (LAXAI)

Platinum Sponsor

Keynote Speaker: Monica Ribero

Panelist: Yao Qin

Women in Machine Learning (WiML)

Platinum Sponsor

Panelists: Yao Qin

Workshops

Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities

Organizer: Peter Kairouz, Zheng Xu

Speaker: Brendan McMahan

Interpretable Machine Learning in Healthcare (IMLH)

Organizer: Ramin Zabih

Knowledge and Logical Reasoning in the Era of Data-Driven Learning

Organizer: Beliz Günel

The Many Facets of Preference-Based Learning (MFPL)

Organizer: Robert Busa-Fekete, Mohammad Ghavamzadeh

The Synergy of Scientific and Machine Learning Modelling (SynS & ML)

Speaker: Sercan Arik

Theory of Mind in Communicating Agents

Organizer: Pei Zhou

Artificial Intelligence & Human Computer Interaction

Organizer: Yang Li, Forrest Huang

Data-Centric Machine Learning Research (DMLR)

Organizer: Alicia Parrish, Najoung Kim

Speaker: Peter Mattson

Neural Compression: from Information Theory to Applications

Speaker: Johannes Ballé

Panelist: George Toderici

Neural Conversational AI Workshop – What’s Left to TEACH (Trustworthy, Enhanced, Adaptable, Capable and Human-centric) Chatbots?

Organizer: Ahmad Beirami

Spurious Correlations, Invariance and Stability (SCIS)

Organizer: Amir Feder


* Work done while at Google

Categories
Offsites

Using societal context knowledge to foster the responsible application of AI

AI-related products and technologies are constructed and deployed in a societal context: that is, a dynamic and complex collection of social, cultural, historical, political and economic circumstances. Because societal contexts by nature are dynamic, complex, non-linear, contested, subjective, and highly qualitative, they are challenging to translate into the quantitative representations, methods, and practices that dominate standard machine learning (ML) approaches and responsible AI product development practices.

The first phase of AI product development is problem understanding, and this phase has tremendous influence over how problems (e.g., increasing cancer screening availability and accuracy) are formulated for ML systems to solve as well many other downstream decisions, such as dataset and ML architecture choice. When the societal context in which a product will operate is not articulated well enough to result in robust problem understanding, the resulting ML solutions can be fragile and even propagate unfair biases.

When AI product developers lack access to the knowledge and tools necessary to effectively understand and consider societal context during development, they tend to abstract it away. This abstraction leaves them with a shallow, quantitative understanding of the problems they seek to solve, while product users and society stakeholders — who are proximate to these problems and embedded in related societal contexts — tend to have a deep qualitative understanding of those same problems. This qualitative–quantitative divergence in ways of understanding complex problems that separates product users and society from developers is what we call the problem understanding chasm.

This chasm has repercussions in the real world: for example, it was the root cause of racial bias discovered by a widely used healthcare algorithm intended to solve the problem of choosing patients with the most complex healthcare needs for special programs. Incomplete understanding of the societal context in which the algorithm would operate led system designers to form incorrect and oversimplified causal theories about what the key problem factors were. Critical socio-structural factors, including lack of access to healthcare, lack of trust in the health care system, and underdiagnosis due to human bias, were left out while spending on healthcare was highlighted as a predictor of complex health need.

To bridge the problem understanding chasm responsibly, AI product developers need tools that put community-validated and structured knowledge of societal context about complex societal problems at their fingertips — starting with problem understanding, but also throughout the product development lifecycle. To that end, Societal Context Understanding Tools and Solutions (SCOUTS) — part of the Responsible AI and Human-Centered Technology (RAI-HCT) team within Google Research — is a dedicated research team focused on the mission to “empower people with the scalable, trustworthy societal context knowledge required to realize responsible, robust AI and solve the world’s most complex societal problems.” SCOUTS is motivated by the significant challenge of articulating societal context, and it conducts innovative foundational and applied research to produce structured societal context knowledge and to integrate it into all phases of the AI-related product development lifecycle. Last year we announced that Jigsaw, Google’s incubator for building technology that explores solutions to threats to open societies, leveraged our structured societal context knowledge approach during the data preparation and evaluation phases of model development to scale bias mitigation for their widely used Perspective API toxicity classifier. Going forward SCOUTS’ research agenda focuses on the problem understanding phase of AI-related product development with the goal of bridging the problem understanding chasm.

Bridging the AI problem understanding chasm

Bridging the AI problem understanding chasm requires two key ingredients: 1) a reference frame for organizing structured societal context knowledge and 2) participatory, non-extractive methods to elicit community expertise about complex problems and represent it as structured knowledge. SCOUTS has published innovative research in both areas.

An illustration of the problem understanding chasm.

A societal context reference frame

An essential ingredient for producing structured knowledge is a taxonomy for creating the structure to organize it. SCOUTS collaborated with other RAI-HCT teams (TasC, Impact Lab), Google DeepMind, and external system dynamics experts to develop a taxonomic reference frame for societal context. To contend with the complex, dynamic, and adaptive nature of societal context, we leverage complex adaptive systems (CAS) theory to propose a high-level taxonomic model for organizing societal context knowledge. The model pinpoints three key elements of societal context and the dynamic feedback loops that bind them together: agents, precepts, and artifacts.

  • Agents: These can be individuals or institutions.
  • Precepts: The preconceptions — including beliefs, values, stereotypes and biases — that constrain and drive the behavior of agents. An example of a basic precept is that “all basketball players are over 6 feet tall.” That limiting assumption can lead to failures in identifying basketball players of smaller stature.
  • Artifacts: Agent behaviors produce many kinds of artifacts, including language, data, technologies, societal problems and products.

The relationships between these entities are dynamic and complex. Our work hypothesizes that precepts are the most critical element of societal context and we highlight the problems people perceive and the causal theories they hold about why those problems exist as particularly influential precepts that are core to understanding societal context. For example, in the case of racial bias in a medical algorithm described earlier, the causal theory precept held by designers was that complex health problems would cause healthcare expenditures to go up for all populations. That incorrect precept directly led to the choice of healthcare spending as the proxy variable for the model to predict complex healthcare need, which in turn led to the model being biased against Black patients who, due to societal factors such as lack of access to healthcare and underdiagnosis due to bias on average, do not always spend more on healthcare when they have complex healthcare needs. A key open question is how can we ethically and equitably elicit causal theories from the people and communities who are most proximate to problems of inequity and transform them into useful structured knowledge?

Illustrative version of societal context reference frame.
Taxonomic version of societal context reference frame.

Working with communities to foster the responsible application of AI to healthcare

Since its inception, SCOUTS has worked to build capacity in historically marginalized communities to articulate the broader societal context of the complex problems that matter to them using a practice called community based system dynamics (CBSD). System dynamics (SD) is a methodology for articulating causal theories about complex problems, both qualitatively as causal loop and stock and flow diagrams (CLDs and SFDs, respectively) and quantitatively as simulation models. The inherent support of visual qualitative tools, quantitative methods, and collaborative model building makes it an ideal ingredient for bridging the problem understanding chasm. CBSD is a community-based, participatory variant of SD specifically focused on building capacity within communities to collaboratively describe and model the problems they face as causal theories, directly without intermediaries. With CBSD we’ve witnessed community groups learn the basics and begin drawing CLDs within 2 hours.

Data 4 Black Lives community members learning system dynamics.

There is a huge potential for AI to improve medical diagnosis. But the safety, equity, and reliability of AI-related health diagnostic algorithms depends on diverse and balanced training datasets. An open challenge in the health diagnostic space is the dearth of training sample data from historically marginalized groups. SCOUTS collaborated with the Data 4 Black Lives community and CBSD experts to produce qualitative and quantitative causal theories for the data gap problem. The theories include critical factors that make up the broader societal context surrounding health diagnostics, including cultural memory of death and trust in medical care.

The figure below depicts the causal theory generated during the collaboration described above as a CLD. It hypothesizes that trust in medical care influences all parts of this complex system and is the key lever for increasing screening, which in turn generates data to overcome the data diversity gap.

Causal loop diagram of the health diagnostics data gap

These community-sourced causal theories are a first step to bridge the problem understanding chasm with trustworthy societal context knowledge.

Conclusion

As discussed in this blog, the problem understanding chasm is a critical open challenge in responsible AI. SCOUTS conducts exploratory and applied research in collaboration with other teams within Google Research, external community, and academic partners across multiple disciplines to make meaningful progress solving it. Going forward our work will focus on three key elements, guided by our AI Principles:

  1. Increase awareness and understanding of the problem understanding chasm and its implications through talks, publications, and training.
  2. Conduct foundational and applied research for representing and integrating societal context knowledge into AI product development tools and workflows, from conception to monitoring, evaluation and adaptation.
  3. Apply community-based causal modeling methods to the AI health equity domain to realize impact and build society’s and Google’s capability to produce and leverage global-scale societal context knowledge to realize responsible AI.
SCOUTS flywheel for bridging the problem understanding chasm.

Acknowledgments

Thank you to John Guilyard for graphics development, everyone in SCOUTS, and all of our collaborators and sponsors.

Categories
Offsites

SimPer: Simple self-supervised learning of periodic targets

Learning from periodic data (signals that repeat, such as a heart beat or the daily temperature changes on Earth’s surface) is crucial for many real-world applications, from monitoring weather systems to detecting vital signs. For example, in the environmental remote sensing domain, periodic learning is often needed to enable nowcasting of environmental changes, such as precipitation patterns or land surface temperature. In the health domain, learning from video measurement has shown to extract (quasi-)periodic vital signs such as atrial fibrillation and sleep apnea episodes.

Approaches like RepNet highlight the importance of these types of tasks, and present a solution that recognizes repetitive activities within a single video. However, these are supervised approaches that require a significant amount of data to capture repetitive activities, all labeled to indicate the number of times an action was repeated. Labeling such data is often challenging and resource-intensive, requiring researchers to manually capture gold-standard temporal measurements that are synchronized with the modality of interest (e.g., video or satellite imagery).

Alternatively, self-supervised learning (SSL) methods (e.g., SimCLR and MoCo v2), which leverage a large amount of unlabeled data to learn representations that capture periodic or quasi-periodic temporal dynamics, have demonstrated success in solving classification tasks. However, they overlook the intrinsic periodicity (i.e., the ability to identify if a frame is part of a periodic process) in data and fail to learn robust representations that capture periodic or frequency attributes. This is because periodic learning exhibits characteristics that are distinct from prevailing learning tasks.

Feature similarity is different in the context of periodic representations as compared to static features (e.g., images). For example, videos that are offset by short time delays or are reversed should be similar to the original sample, whereas videos that have been upsampled or downsampled by a factor x should be different from the original sample by a factor of x.

To address these challenges, in “SimPer: Simple Self-Supervised Learning of Periodic Targets”, published at the eleventh International Conference on Learning Representations (ICLR 2023), we introduced a self-supervised contrastive framework for learning periodic information in data. Specifically, SimPer leverages the temporal properties of periodic targets using temporal self-contrastive learning, where positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. We propose periodic feature similarity that explicitly defines how to measure similarity in the context of periodic learning. Moreover, we design a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). Next, we demonstrate that SimPer effectively learns period feature representations compared to state-of-the-art SSL methods, highlighting its intriguing properties including better data efficiency, robustness to spurious correlations, and generalization to distribution shifts. Finally, we are excited to release the SimPer code repo with the research community.

The SimPer framework

SimPer introduces a temporal self-contrastive learning framework. Positive and negative samples are obtained through periodicity-invariant and periodicity-variant augmentations from the same input instance. For temporal video examples, periodicity-invariant changes are cropping, rotation or flipping, whereas periodicity-variant changes involve increasing or decreasing the speed of a video.

To explicitly define how to measure similarity in the context of periodic learning, SimPer proposes periodic feature similarity. This construction allows us to formulate training as a contrastive learning task. A model can be trained with data without any labels and then fine-tuned if necessary to map the learned features to specific frequency values.

Given an input sequence x, we know there’s an underlying associated periodic signal. We then transform x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo- speed or frequency labels for the unlabeled input x.

Conventional similarity measures such as cosine similarity emphasize strict proximity between two feature vectors, and are sensitive to index shifted features (which represent different time stamps), reversed features, and features with changed frequencies. In contrast, periodic feature similarity should be high for samples with small temporal shifts and or reversed indexes, while capturing a continuous similarity change when the feature frequency varies. This can be achieved via a similarity metric in the frequency domain, such as the distance between two Fourier transforms.

To harness the intrinsic continuity of augmented samples in the frequency domain, SimPer designs a generalized contrastive loss that extends the classic InfoNCE loss to a soft regression variant that enables contrasting over continuous labels (frequency). This makes it suitable for regression tasks, where the goal is to recover a continuous signal, such as a heart beat.

SimPer constructs negative views of data through transformations in the frequency domain. The input sequence x has an underlying associated periodic signal. SimPer transforms x to create a series of speed or frequency altered samples, which changes the underlying periodic target, thus creating different negative views. Although the original frequency is unknown, we effectively devise pseudo speed or frequency labels for unlabeled input x (periodicity-variant augmentations τ). SimPer takes transformations that do not change the identity of the input and defines these as periodicity-invariant augmentations σ, thus creating different positive views of the sample. Then, it sends these augmented views to the encoder f, which extracts corresponding features.

Results

To evaluate SimPer’s performance, we benchmarked it against state-of-the-art SSL schemes (e.g., SimCLR, MoCo v2, BYOL, CVRL) on a set of six diverse periodic learning datasets for common real-world tasks in human behavior analysis, environmental remote sensing, and healthcare. Specifically, below we present results on heart rate measurement and exercise repetition counting from video. The results show that SimPer outperforms the state-of-the-art SSL schemes across all six datasets, highlighting its superior performance in terms of data efficiency, robustness to spurious correlations, and generalization to unseen targets.

Here we show quantitative results on two representative datasets using SimPer pre-trained using various SSL methods and fine-tuned on the labeled data. First, we pre-train SimPer using the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) dataset, a human photoplethysmography and heart rate prediction dataset, and compare its performance to state-of-the-art SSL methods. We observe that SimPer outperforms SimCLR, MoCo v2, BYOL, and CVRL methods. The results on the human action counting dataset, Countix, further confirm the benefits of SimPer over others methods as it notably outperforms the supervised baseline. For the feature evaluation results and performance on other datasets, please refer to the paper.

Results of SimCLR, MoCo v2, BYOL, CVRL and SimPer on the Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy (UBFC) and Countix datasets. Heart rate and repetition count performance is reported as mean absolute error (MAE).

Conclusion and applications

We present SimPer, a self-supervised contrastive framework for learning periodic information in data. We demonstrate that by combining a temporal self-contrastive learning framework, periodicity-invariant and periodicity-variant augmentations, and continuous periodic feature similarity, SimPer provides an intuitive and flexible approach for learning strong feature representations for periodic signals. Moreover, SimPer can be applied to various fields, ranging from environmental remote sensing to healthcare.

Acknowledgements

We would like to thank Yuzhe Yang, Xin Liu, Ming-Zher Poh, Jiang Wu, Silviu Borac, and Dina Katabi for their contributions to this work.

Categories
Offsites

Symbol tuning improves in-context learning in language models

A key feature of human intelligence is that humans can learn to perform new tasks by reasoning using only a few examples. Scaling up language models has unlocked a range of new applications and paradigms in machine learning, including the ability to perform challenging reasoning tasks via in-context learning. Language models, however, are still sensitive to the way that prompts are given, indicating that they are not reasoning in a robust manner. For instance, language models often require heavy prompt engineering or phrasing tasks as instructions, and they exhibit unexpected behaviors such as performance on tasks being unaffected even when shown incorrect labels.

In “Symbol tuning improves in-context learning in language models”, we propose a simple fine-tuning procedure that we call symbol tuning, which can improve in-context learning by emphasizing input–label mappings. We experiment with symbol tuning across Flan-PaLM models and observe benefits across various settings.

  • Symbol tuning boosts performance on unseen in-context learning tasks and is much more robust to underspecified prompts, such as those without instructions or without natural language labels.
  • Symbol-tuned models are much stronger at algorithmic reasoning tasks.
  • Finally, symbol-tuned models show large improvements in following flipped-labels presented in-context, meaning that they are more capable of using in-context information to override prior knowledge.
An overview of symbol tuning, where models are fine-tuned on tasks where natural language labels are replaced with arbitrary symbols. Symbol tuning relies on the intuition that when instruction and relevant labels are not available, models must use in-context examples to learn the task.

Motivation

Instruction tuning is a common fine-tuning method that has been shown to improve performance and allow models to better follow in-context examples. One shortcoming, however, is that models are not forced to learn to use the examples because the task is redundantly defined in the evaluation example via instructions and natural language labels. For example, on the left in the figure above, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary since the model could ignore the examples and just read the instruction that indicates what the task is.

In symbol tuning, the model is fine-tuned on examples where the instructions are removed and natural language labels are replaced with semantically-unrelated labels (e.g., “Foo,” “Bar,” etc.). In this setup, the task is unclear without looking at the in-context examples. For example, on the right in the figure above, multiple in-context examples would be needed to figure out the task. Because symbol tuning teaches the model to reason over the in-context examples, symbol-tuned models should have better performance on tasks that require reasoning between in-context examples and their labels.

Datasets and task types used for symbol tuning.

Symbol-tuning procedure

We selected 22 publicly-available natural language processing (NLP) datasets that we use for our symbol-tuning procedure. These tasks have been widely used in the past, and we only chose classification-type tasks since our method requires discrete labels. We then remap labels to a random label from a set of ~30K arbitrary labels selected from one of three categories: integers, character combinations, and words.

For our experiments, we symbol tune Flan-PaLM, the instruction-tuned variants of PaLM. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B, and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B at 1.3T tokens instead of 780B tokens), which we abbreviate as 62B-c.

We use a set of ∼300K arbitrary symbols from three categories (integers, character combinations, and words). ∼30K symbols are used during tuning and the rest are held out for evaluation.

Experimental setup

We want to evaluate a model’s ability to perform unseen tasks, so we cannot evaluate on tasks used in symbol tuning (22 datasets) or used during instruction tuning (1.8K tasks). Hence, we choose 11 NLP datasets that were not used during fine-tuning.

In-context learning

In the symbol-tuning procedure, models must learn to reason with in-context examples in order to successfully perform tasks because prompts are modified to ensure that tasks cannot simply be learned from relevant labels or instructions. Symbol-tuned models should perform better in settings where tasks are unclear and require reasoning between in-context examples and their labels. To explore these settings, we define four in-context learning settings that vary the amount of reasoning required between inputs and labels in order to learn the task (based on the availability of instructions/relevant labels)

Depending on the availability of instructions and relevant natural language labels, models may need to do varying amounts of reasoning with in-context examples. When these features are not available, models must reason with the given in-context examples to successfully perform the task.

Symbol tuning improves performance across all settings for models 62B and larger, with small improvements in settings with relevant natural language labels (+0.8% to +4.2%) and substantial improvements in settings without relevant natural language labels (+5.5% to +15.5%). Strikingly, when relevant labels are unavailable, symbol-tuned Flan-PaLM-8B outperforms FlanPaLM-62B, and symbol-tuned Flan-PaLM-62B outperforms Flan-PaLM-540B. This performance difference suggests that symbol tuning can allow much smaller models to perform as well as large models on these tasks (effectively saving ∼10X inference compute).

Large-enough symbol-tuned models are better at in-context learning than baselines, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks.

Algorithmic reasoning

We also experiment on algorithmic reasoning tasks from BIG-Bench. There are two main groups of tasks: 1) List functions — identify a transformation function (e.g., remove the last element in a list) between input and output lists containing non-negative integers; and 2) simple turing concepts — reason with binary strings to learn the concept that maps an input to an output (e.g., swapping 0s and 1s in a string).

On the list function and simple turing concept tasks, symbol tuning results in an average performance improvement of 18.2% and 15.3%, respectively. Additionally, Flan-cont-PaLM-62B with symbol tuning outperforms Flan-PaLM-540B on the list function tasks on average, which is equivalent to a ∼10x reduction in inference compute. These improvements suggest that symbol tuning strengthens the model’s ability to learn in-context for unseen task types, as symbol tuning did not include any algorithmic data.

Symbol-tuned models achieve higher performance on list function tasks and simple turing concept tasks. (A–E): categories of list functions tasks. (F): simple turing concepts task.

Flipped labels

In the flipped-label experiment, labels of in-context and evaluation examples are flipped, meaning that prior knowledge and input-label mappings disagree (e.g., sentences containing positive sentiment labeled as “negative sentiment”), thereby allowing us to study whether models can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow flipped labels presented in-context, instruction tuning degraded this ability.

We see that there is a similar trend across all model sizes — symbol-tuned models are much more capable of following flipped labels than instruction-tuned models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, symbol-tuned models achieve similar or better than average performance as pre-training–only models.

Symbol-tuned models are much better at following flipped labels presented in-context than instruction-tuned models are.

Conclusion

We presented symbol tuning, a new method of tuning models on tasks where natural language labels are remapped to arbitrary symbols. Symbol tuning is based off of the intuition that when models cannot use instructions or relevant labels to determine a presented task, it must do so by instead learning from in-context examples. We tuned four language models using our symbol-tuning procedure, utilizing a tuning mixture of 22 datasets and approximately 30K arbitrary symbols as labels.

We first showed that symbol tuning improves performance on unseen in-context learning tasks, especially when prompts do not contain instructions or relevant labels. We also found that symbol-tuned models were much better at algorithmic reasoning tasks, despite the lack of numerical or algorithmic data in the symbol-tuning procedure. Finally, in an in-context learning setting where inputs have flipped labels, symbol tuning (for some datasets) restores the ability to follow flipped labels that was lost during instruction tuning.

Future work

Through symbol tuning, we aim to increase the degree to which models can examine and learn from input–label mappings during in-context learning. We hope that our results encourage further work towards improving language models’ ability to reason over symbols presented in-context.

Acknowledgements

The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xiangning Chen, Da Huang, Yi Tay, Xinyun Chen, Yifeng Lu, Denny Zhou, Tengyu Ma, and Quoc V. Le. We would like to thank our colleagues at Google Research and Google DeepMind for their advice and helpful discussions.

Categories
Offsites

An open-source gymnasium for machine learning assisted computer architecture design

Computer Architecture research has a long history of developing simulators and tools to evaluate and shape the design of computer systems. For example, the SimpleScalar simulator was introduced in the late 1990s and allowed researchers to explore various microarchitectural ideas. Computer architecture simulators and tools, such as gem5, DRAMSys, and many more have played a significant role in advancing computer architecture research. Since then, these shared resources and infrastructure have benefited industry and academia and have enabled researchers to systematically build on each other’s work, leading to significant advances in the field.

Nonetheless, computer architecture research is evolving, with industry and academia turning towards machine learning (ML) optimization to meet stringent domain-specific requirements, such as ML for computer architecture, ML for TinyML accelerationDNN accelerator datapath, memory controllers, power consumption, security, and privacy. Although prior work has demonstrated the benefits of ML in design optimization, the lack of strong, reproducible baselines hinders fair and objective comparison across different methods and poses several challenges to their deployment. To ensure steady progress, it is imperative to understand and tackle these challenges collectively.

To alleviate these challenges, in “ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design”, accepted at ISCA 2023, we introduced ArchGym, which includes a variety of computer architecture simulators and ML algorithms. Enabled by ArchGym, our results indicate that with a sufficiently large number of samples, any of a diverse collection of ML algorithms are capable of finding the optimal set of architecture design parameters for each target problem; no one solution is necessarily better than another. These results further indicate that selecting the optimal hyperparameters for a given ML algorithm is essential for finding the optimal architecture design, but choosing them is non-trivial. We release the code and dataset across multiple computer architecture simulations and ML algorithms.

Challenges in ML-assisted architecture research

ML-assisted architecture research poses several challenges, including:

  1. For a specific ML-assisted computer architecture problem (e.g., finding an optimal solution for a DRAM controller) there is no systematic way to identify optimal ML algorithms or hyperparameters (e.g., learning rate, warm-up steps, etc.). There is a wider range of ML and heuristic methods, from random walk to reinforcement learning (RL), that can be employed for design space exploration (DSE). While these methods have shown noticeable performance improvement over their choice of baselines, it is not evident whether the improvements are because of the choice of optimization algorithms or hyperparameters.

    Thus, to ensure reproducibility and facilitate widespread adoption of ML-aided architecture DSE, it is necessary to outline a systematic benchmarking methodology.

  2. While computer architecture simulators have been the backbone of architectural innovations, there is an emerging need to address the trade-offs between accuracy, speed, and cost in architecture exploration. The accuracy and speed of performance estimation widely varies from one simulator to another, depending on the underlying modeling details (e.g., cycleaccurate vs. MLbased proxy models). While analytical or ML-based proxy models are nimble by virtue of discarding low-level details, they generally suffer from high prediction error. Also, due to commercial licensing, there can be strict limits on the number of runs collected from a simulator. Overall, these constraints exhibit distinct performance vs. sample efficiency trade-offs, affecting the choice of optimization algorithm for architecture exploration.

    It is challenging to delineate how to systematically compare the effectiveness of various ML algorithms under these constraints.

  3. Finally, the landscape of ML algorithms is rapidly evolving and some ML algorithms need data to be useful. Additionally, rendering the outcome of DSE into meaningful artifacts such as datasets is critical for drawing insights about the design space.

    In this rapidly evolving ecosystem, it is consequential to ensure how to amortize the overhead of search algorithms for architecture exploration. It is not apparent, nor systematically studied how to leverage exploration data while being agnostic to the underlying search algorithm.

ArchGym design

ArchGym addresses these challenges by providing a unified framework for evaluating different ML-based search algorithms fairly. It comprises two main components: 1) the ArchGym environment and 2) the ArchGym agent. The environment is an encapsulation of the architecture cost model — which includes latency, throughput, area, energy, etc., to determine the computational cost of running the workload, given a set of architectural parameters — paired with the target workload(s). The ArchGym agent is an encapsulation of the ML algorithm used for the search and consists of hyperparameters and a guiding policy. The hyperparameters are intrinsic to the algorithm for which the model is to be optimized and can significantly influence performance. The policy, on the other hand, determines how the agent selects a parameter iteratively to optimize the target objective.

Notably, ArchGym also includes a standardized interface that connects these two components, while also saving the exploration data as the ArchGym Dataset. At its core, the interface entails three main signals: hardware state, hardware parameters, and metrics. These signals are the bare minimum to establish a meaningful communication channel between the environment and the agent. Using these signals, the agent observes the state of the hardware and suggests a set of hardware parameters to iteratively optimize a (user-defined) reward. The reward is a function of hardware performance metrics, such as performance, energy consumption, etc. 

ArchGym comprises two main components: the ArchGym environment and the ArchGym agent. The ArchGym environment encapsulates the cost model and the agent is an abstraction of a policy and hyperparameters. With a standardized interface that connects these two components, ArchGym provides a unified framework for evaluating different ML-based search algorithms fairly while also saving the exploration data as the ArchGym Dataset.

ML algorithms could be equally favorable to meet user-defined target specifications

Using ArchGym, we empirically demonstrate that across different optimization objectives and DSE problems, at least one set of hyperparameters exists that results in the same hardware performance as other ML algorithms. A poorly selected (random selection) hyperparameter for the ML algorithm or its baseline can lead to a misleading conclusion that a particular family of ML algorithms is better than another. We show that with sufficient hyperparameter tuning, different search algorithms, even random walk (RW), are able to identify the best possible normalized reward. However, note that finding the right set of hyperparameters may require exhaustive search or even luck to make it competitive.

With a sufficient number of samples, there exists at least one set of hyperparameters that results in the same performance across a range of search algorithms. Here the dashed line represents the maximum normalized reward. Cloud-1, cloud-2, stream, and random indicate four different memory traces for DRAMSys (DRAM subsystem design space exploration framework).

Dataset construction and high-fidelity proxy model training

Creating a unified interface using ArchGym also enables the creation of datasets that can be used to design better data-driven ML-based proxy architecture cost models to improve the speed of architecture simulation. To evaluate the benefits of datasets in building an ML model to approximate architecture cost, we leverage ArchGym’s ability to log the data from each run from DRAMSys to create four dataset variants, each with a different number of data points. For each variant, we create two categories: (a) Diverse Dataset (DD), which represents the data collected from different agents (ACO, GA, RW, and BO), and (b) ACO only, which shows the data collected exclusively from the ACO agent, both of which are released along with ArchGym. We train a proxy model on each dataset using random forest regression with the objective to predict the latency of designs for a DRAM simulator. Our results show that:

  1. As we increase the dataset size, the average normalized root mean squared error (RMSE) slightly decreases.
  2. However, as we introduce diversity in the dataset (e.g., collecting data from different agents), we observe 9× to 42× lower RMSE across different dataset sizes.

Diverse dataset collection across different agents using ArchGym interface.
The impact of a diverse dataset and dataset size on the normalized RMSE.

The need for a community-driven ecosystem for ML-assisted architecture research

While, ArchGym is an initial effort towards creating an open-source ecosystem that (1) connects a broad range of search algorithms to computer architecture simulators in an unified and easy-to-extend manner, (2) facilitates research in ML-assisted computer architecture, and (3) forms the scaffold to develop reproducible baselines, there are a lot of open challenges that need community-wide support. Below we outline some of the open challenges in ML-assisted architecture design. Addressing these challenges requires a well coordinated effort and a community driven ecosystem.

Key challenges in ML-assisted architecture design.

We call this ecosystem Architecture 2.0. We outline the key challenges and a vision for building an inclusive ecosystem of interdisciplinary researchers to tackle the long-standing open problems in applying ML for computer architecture research. If you are interested in helping shape this ecosystem, please fill out the interest survey.

Conclusion

ArchGym is an open source gymnasium for ML architecture DSE and enables an standardized interface that can be readily extended to suit different use cases. Additionally, ArchGym enables fair and reproducible comparison between different ML algorithms and helps to establish stronger baselines for computer architecture research problems.

We invite the computer architecture community as well as the ML community to actively participate in the development of ArchGym. We believe that the creation of a gymnasium-type environment for computer architecture research would be a significant step forward in the field and provide a platform for researchers to use ML to accelerate research and lead to new and innovative designs.

Acknowledgements

This blogpost is based on joint work with several co-authors at Google and Harvard University. We would like to acknowledge and highlight Srivatsan Krishnan (Harvard) who contributed several ideas to this project in collaboration with Shvetank Prakash (Harvard), Jason Jabbour (Harvard), Ikechukwu Uchendu (Harvard), Susobhan Ghosh (Harvard), Behzad Boroujerdian (Harvard), Daniel Richins (Harvard), Devashree Tripathy (Harvard), and Thierry Thambe (Harvard).  In addition, we would also like to thank James Laudon, Douglas Eck, Cliff Young, and Aleksandra Faust for their support, feedback, and motivation for this work. We would also like to thank John Guilyard for the animated figure used in this post. Amir Yazdanbakhsh is now a Research Scientist at Google DeepMind and Vijay Janapa Reddi is an Associate Professor at Harvard.

Categories
Offsites

So why is the “central limit” a normal distribution?

Categories
Offsites

Google at ACL 2023

This week, the 61st annual meeting of the Association for Computational Linguistics (ACL), a premier conference covering a broad spectrum of research areas that are concerned with computational approaches to natural language, is taking place online.

As a leader in natural language processing and understanding, and a Diamond Level sponsor of ACL 2023, Google will showcase the latest research in the field with over 50 publications, and active involvement in a variety of workshops and tutorials.

If you’re registered for ACL 2023, we hope that you’ll visit the Google booth to learn more about the projects at Google that go into solving interesting problems for billions of people. You can also learn more about Google’s participation below (Google affiliations in bold).

Board and Organizing Committee

Area chairs include: Dan Garrette

Workshop chairs include: Annie Louis

Publication chairs include: Lei Shu

Program Committee includes: Vinodkumar Prabhakaran, Najoung Kim, Markus Freitag

Spotlight papers

NusaCrowd: Open Source Initiative for Indonesian NLP Resources

Samuel Cahyawijaya, Holy Lovenia, Alham Fikri Aji, Genta Winata, Bryan Wilie, Fajri Koto, Rahmad Mahendra, Christian Wibisono, Ade Romadhony, Karissa Vincentio, Jennifer Santoso, David Moeljadi, Cahya Wirawan, Frederikus Hudi, Muhammad Satrio Wicaksono, Ivan Parmonangan, Ika Alfina, Ilham Firdausi Putra, Samsul Rahmadani, Yulianti Oenang, Ali Septiandri, James Jaya, Kaustubh Dhole, Arie Suryani, Rifki Afina Putri, Dan Su, Keith Stevens, Made Nindyatama Nityasya, Muhammad Adilazuarda, Ryan Hadiwijaya, Ryandito Diandaru, Tiezheng Yu, Vito Ghifari, Wenliang Dai, Yan Xu, Dyah Damapuspita, Haryo Wibowo, Cuk Tho, Ichwanul Karo Karo, Tirana Fatyanosa, Ziwei Ji, Graham Neubig, Timothy Baldwin, Sebastian Ruder, Pascale Fung, Herry Sujaini, Sakriani Sakti, Ayu Purwarianti

Optimizing Test-Time Query Representations for Dense Retrieval

Mujeen Sung, Jungsoo Park, Jaewoo Kang, Danqi Chen, Jinhyuk Lee

PropSegmEnt: A Large-Scale Corpus for Proposition-Level Segmentation and Entailment Recognition

Sihao Chen*, Senaka Buthpitiya, Alex Fabrikant, Dan Roth, Tal Schuster

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh*, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

Large Language Models with Controllable Working Memory

Daliang Li, Ankit Singh Rawat, Manzil Zaheer, Xin Wang, Michal Lukasik, Andreas Veit, Felix Yu, Sanjiv Kumar

OpineSum: Entailment-Based Self-Training for Abstractive Opinion Summarization

Annie Louis, Joshua Maynez

RISE: Leveraging Retrieval Techniques for Summarization Evaluation

David Uthus, Jianmo Ni

Follow the Leader(board) with Confidence: Estimating p-Values from a Single Test Set with Item and Response Variance


Shira Wein*, Christopher Homan, Lora Aroyo, Chris Welty

SamToNe: Improving Contrastive Loss for Dual Encoder Retrieval Models with Same Tower Negatives

Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, Zhe Dong

Papers

Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability

Eleftheria Briakou, Colin Cherry, George Foster

Prompting PaLM for Translation: Assessing Strategies and Performance

David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, George Foster

Query Refinement Prompts for Closed-Book Long-Form QA

Reinald Kim Amplayo, Kellie Webster, Michael Collins, Dipanjan Das, Shashi Narayan

To Adapt or to Annotate: Challenges and Interventions for Domain Adaptation in Open-Domain Question Answering

Dheeru Dua*, Emma Strubell, Sameer Singh, Pat Verga

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation (see blog post)

Parker Riley, Timothy Dozat, Jan A. Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, Noah Constant

Conditional Generation with a Question-Answering Blueprint

Shashi Narayan, Joshua Maynez, Reinald Kim Amplayo, Kuzman Ganchev, Annie Louis, Fantine Huot, Anders Sandholm, Dipanjan Das, Mirella Lapata

Coreference Resolution Through a Seq2Seq Transition-Based System

Bernd Bohnet, Chris Alberti, Michael Collins

Cross-Lingual Transfer with Language-Specific Subnetworks for Low-Resource Dependency Parsing

Rochelle Choenni, Dan Garrette, Ekaterina Shutova

DAMP: Doubly Aligned Multilingual Parser for Task-Oriented Dialogue

William Held*, Christopher Hidey, Fei Liu, Eric Zhu, Rahul Goel, Diyi Yang, Rushin Shah

RARR: Researching and Revising What Language Models Say, Using Language Models

Luyu Gao*, Zhuyun Dai, Panupong Pasupat, Anthony Chen*, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, Kelvin Guu

Benchmarking Large Language Model Capabilities for Conditional Generation

Joshua Maynez, Priyanka Agrawal, Sebastian Gehrmann

Crosslingual Generalization Through Multitask Fine-Tuning

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts, Stella Biderman, Teven Le Scao, M. Saiful Bari, Sheng Shen, Zheng Xin Yong, Hailey Schoelkopf, Xiangru Tang, Dragomir Radev, Alham Fikri Aji, Khalid Almubarak, Samuel Albanie, Zaid Alyafeai, Albert Webson, Edward Raff, Colin Raffel

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering

Ella Neeman, Roee Aharoni, Or Honovich, Leshem Choshen, Idan Szpektor, Omri Abend

Resolving Indirect Referring Expressions for Entity Selection

Mohammad Javad Hosseini, Filip Radlinski, Silvia Pareti, Annie Louis

SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models

Akshita Jha*, Aida Mostafazadeh Davani, Chandan K Reddy, Shachi Dave, Vinodkumar Prabhakaran, Sunipa Dev

The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

Nikil Selvam, Sunipa Dev, Daniel Khashabi, Tushar Khot, Kai-Wei Chang

Character-Aware Models Improve Visual Text Rendering

Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, Noah Constant

Cold-Start Data Selection for Better Few-Shot Language Model Fine-Tuning: A Prompt-Based Uncertainty Propagation Approach

Yue Yu, Rongzhi Zhang, Ran Xu, Jieyu Zhang, Jiaming Shen, Chao Zhang

Covering Uncommon Ground: Gap-Focused Question Generation for Answer Assessment

Roni Rabin, Alexandre Djerbetian, Roee Engelberg, Lidan Hackmon, Gal Elidan, Reut Tsarfaty, Amir Globerson

FormNetV2: Multimodal Graph Contrastive Learning for Form Document Information Extraction

Chen-Yu Lee, Chun-Liang Li, Hao Zhang, Timothy Dozat, Vincent Perot, Guolong Su, Xiang Zhang, Kihyuk Sohn, Nikolay Glushinev, Renshen Wang, Joshua Ainslie, Shangbang Long, Siyang Qin, Yasuhisa Fujii, Nan Hua, Tomas Pfister

Dialect-Robust Evaluation of Generated Text

Jiao Sun*, Thibault Sellam, Elizabeth Clark, Tu Vu*, Timothy Dozat, Dan Garrette, Aditya Siddhant, Jacob Eisenstein, Sebastian Gehrmann

MISGENDERED: Limits of Large Language Models in Understanding Pronouns

Tamanna Hossain, Sunipa Dev, Sameer Singh

LAMBADA: Backward Chaining for Automated Reasoning in Natural Language

Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, Deepak Ramachandran

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Jeremiah Milbauer*, Annie Louis, Mohammad Javad Hosseini, Alex Fabrikant, Donald Metzler, Tal Schuster

Modular Visual Question Answering via Code Generation (see blog post)

Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, Dan Klein

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen, You Wu, Luke Zettlemoyer and Huan Sun

Better Zero-Shot Reasoning with Self-Adaptive Prompting

Xingchen Wan*, Ruoxi Sun, Hanjun Dai, Sercan Ö. Arik, Tomas Pfister

Factually Consistent Summarization via Reinforcement Learning with Textual Entailment Feedback

Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Léonard Hussenot, Orgad Keller, Nikola Momchev, Sabela Ramos, Piotr Stanczyk, Nino Vieillard, Olivier Bachem, Gal Elidan, Avinatan Hassidim, Olivier Pietquin, Idan Szpektor

Natural Language to Code Generation in Interactive Data Science Notebooks

Pengcheng Yin, Wen-Ding Li, Kefan Xiao, Abhishek Rao, Yeming Wen, Kensen Shi, Joshua Howland, Paige Bailey, Michele Catasta, Henryk Michalewski, Oleksandr Polozov, Charles Sutton

Teaching Small Language Models to Reason

Lucie Charlotte Magister*, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn

Using Domain Knowledge to Guide Dialog Structure Induction via Neural Probabilistic Soft Logic

Connor Pryor*, Quan Yuan, Jeremiah Liu, Mehran Kazemi, Deepak Ramachandran, Tania Bedrax-Weiss, Lise Getoor

A Needle in a Haystack: An Analysis of High-Agreement Workers on MTurk for Summarization

Lining Zhang, Simon Mille, Yufang Hou, Daniel Deutsch, Elizabeth Clark, Yixin Liu, Saad Mahamood, Sebastian Gehrmann, Miruna Clinciu, Khyathi Raghavi Chandu and João Sedoc

Industry Track papers

Federated Learning of Gboard Language Models with Differential Privacy

Zheng Xu, Yanxiang Zhang, Galen Andrew, Christopher Choquette, Peter Kairouz, Brendan McMahan, Jesse Rosenstock, Yuanbo Zhang

KAFA: Rethinking Image Ad Understanding with Knowledge-Augmented Feature Adaptation of Vision-Language Models

Zhiwei Jia*, Pradyumna Narayana, Arjun Akula, Garima Pruthi, Hao Su, Sugato Basu, Varun Jampani

ACL Findings papers

Multilingual Summarization with Factual Consistency Evaluation

Roee Aharoni, Shashi Narayan, Joshua Maynez, Jonathan Herzig, Elizabeth Clark, Mirella Lapata

Parameter-Efficient Fine-Tuning for Robust Continual Multilingual Learning

Kartikeya Badola, Shachi Dave, Partha Talukdar

FiDO: Fusion-in-Decoder Optimized for Stronger Performance and Faster Inference

Michiel de Jong*, Yury Zemlyanskiy, Joshua Ainslie, Nicholas FitzGerald, Sumit Sanghai, Fei Sha, William Cohen

A Simple, Yet Effective Approach to Finding Biases in Code Generation

Spyridon Mouselinos, Mateusz Malinowski, Henryk Michalewski

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Scharli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, Jason Wei

QueryForm: A Simple Zero-Shot Form Entity Query Framework

Zifeng Wang*, Zizhao Zhang, Jacob Devlin, Chen-Yu Lee, Guolong Su, Hao Zhang, Jennifer Dy, Vincent Perot, Tomas Pfister

ReGen: Zero-Shot Text Classification via Training Data Generation with Progressive Dense Retrieval

Yue Yu, Yuchen Zhuang, Rongzhi Zhang, Yu Meng, Jiaming Shen, Chao Zhang

Multilingual Sequence-to-Sequence Models for Hebrew NLP

Matan Eyal, Hila Noga, Roee Aharoni, Idan Szpektor, Reut Tsarfaty

Triggering Multi-Hop Reasoning for Question Answering in Language Models Using Soft Prompts and Random Walks

Kanishka Misra*, Cicero Nogueira dos Santos, Siamak Shakeri

Tutorials

Complex Reasoning in Natural Language
Wenting Zhao, Mor Geva, Bill Yuchen Lin, Michihiro Yasunaga, Aman Madaan, Tao Yu

Generating Text from Language Models
Afra Amini, Ryan Cotterell, John Hewitt, Clara Meister, Tiago Pimentel

Workshops

Simple and Efficient Natural Language Processing (SustaiNLP)

Organizers include: Tal Schuster

Workshop on Online Abuse and Harms (WOAH)

Organizers include: Aida Mostafazadeh Davani

Document-Grounded Dialogue and Conversational Question Answering (DialDoc)

Organizers include: Roee Aharoni

NLP for Conversational AI

Organizers include: Abhinav Rastogi

Computation and Written Language (CAWL)

Organizers include: Kyle Gorman, Brian Roark, Richard Sproat

Computational Morphology and Phonology (SIGMORPHON)

Speakers include: Kyle Gorman

Workshop on Narrative Understanding (WNU)

Organizers include: Elizabeth Clark


* Work done while at Google