Mapping Africa’s Buildings with Satellite Imagery

An accurate record of building footprints is important for a range of applications, from population estimation and urban planning to humanitarian response and environmental science. After a disaster, such as a flood or an earthquake, authorities need to estimate how many households have been affected. Ideally there would be up-to-date census information for this, but in practice such records may be out of date or unavailable. Instead, data on the locations and density of buildings can be a valuable alternative source of information.

A good way to collect such data is through satellite imagery, which can map the distribution of buildings across the world, particularly in areas that are isolated or difficult to access. However, detecting buildings with computer vision methods in some environments can be a challenging task. Because satellite imaging involves photographing the earth from several hundred kilometres above the ground, even at high resolution (30–50 cm per pixel), a small building or tent shelter occupies only a few pixels. The task is even more difficult for informal settlements, or rural areas where buildings constructed with natural materials can visually blend into the surroundings. There are also many types of natural and artificial features that can be easily confused with buildings in overhead imagery.

Objects that can confuse computer vision models for building identification (clockwise from top left) pools, rocks, enclosure walls and shipping containers.

In “Continental-Scale Building Detection from High-Resolution Satellite Imagery”, we address these challenges, using new methods for detecting buildings that work in rural and urban settings across different terrains, such as savannah, desert, and forest, as well as informal settlements and refugee facilities. We use this building detection model to create the Open Buildings dataset, a new open-access data resource containing the locations and footprints of 516 million buildings with coverage across most of the African continent. The dataset will support several practical, scientific and humanitarian applications, ranging from disaster response or population mapping to planning services such as new medical facilities or studying human impact on the natural environment.

Model Development
We built a training dataset for the building detection model by manually labelling 1.75 million buildings in 100k images. The figure below shows some examples of how we labelled images in the training data, taking into account confounding characteristics of different areas across the African continent. In rural areas, for example, it was necessary to identify different types of dwelling places and to disambiguate them from natural features, while in urban areas we needed to develop labelling policies for dense and contiguous structures.

(1) Example of a compound containing both dwelling places as well as smaller outbuildings such as grain stores. (2) Example of a round, thatched-roof structure that can be difficult for a model to distinguish from trees, and where it is necessary to use cues from pathways, clearings and shadows to disambiguate. (3) Example of several contiguous buildings for which the boundaries cannot be easily distinguished.

We trained the model to detect buildings in a bottom-up way, first by classifying each pixel as building or non-building, and then grouping these pixels together into individual instances. The detection pipeline was based on the U-Net model, which is commonly used in satellite image analysis. One advantage of U-Net is that it is a relatively compact architecture, and so can be applied to large quantities of imaging data without a heavy compute burden. This is critical, because the final task of applying this to continental-scale satellite imagery means running the model on many billions of image tiles.

Example of segmenting buildings in satellite imagery. Left: Source image; Center: Semantic segmentation, with each pixel assigned a confidence score that it is a building vs. non-building; Right: Instance segmentation, obtained by thresholding and grouping together connected components.

Initial experiments with the basic model had low precision and recall, for example due to the variety of natural and artificial features with building-like appearance. We found a number of methods that improved performance. One was the use of mixup as a regularisation method, where random training images are blended together by taking a weighted average. Though mixup was originally proposed for image classification, we modified it to be used for semantic segmentation. Regularisation is important in general for this building segmentation task, because even with 100k training images, the training data do not capture the full variation of terrain, atmospheric and lighting conditions that the model is presented with at test time, and hence, there is a tendency to overfit. This is mitigated by mixup as well as random augmentation of training images.

Another method that we found to be effective was the use of unsupervised self-training. We prepared a set of 100 million satellite images from across Africa, and filtered these to a subset of 8.7 million images that mostly contained buildings. This dataset was used for self-training using the Noisy Student method, in which the output of the best building detection model from the previous stage is used as a ‘teacher’ to then train a ‘student’ model that makes similar predictions from augmented images. In practice, we found that this reduced false positives and sharpened the detection output. The student model gave higher confidence to buildings and lower confidence to background.

Difference in model output between the student and teacher models for a typical image. In panel (d), red areas are those that the student model finds more likely to be buildings than the teacher model, and blue areas more likely to be background.

One problem that we faced initially was that our model had a tendency to create “blobby” detections, without clearly delineated edges and with a tendency for neighbouring buildings to be merged together. To address this, we applied another idea from the original U-Net paper, which is to use distance weighting to adapt the loss function to emphasise the importance of making correct predictions near boundaries. During training, distance weighting places greater emphasis at the edges by adding weight to the loss — particularly where there are instances that nearly touch. For building detection, this encourages the model to correctly identify the gaps in between buildings, which is important so that many close structures are not merged together. We found that the original U-Net distance weighting formulation was helpful but slow to compute. So, we developed an alternative based on Gaussian convolution of edges, which was both faster and more effective.

Distance weighting schemes to emphasise nearby edges: U-Net (left) and Gaussian convolution of edges (right).

Our technical report has more details on each of these methods.

We evaluated the performance of the model on several different regions across the continent, in different categories: urban, rural, and medium-density. In addition, with the goal of preparing for potential humanitarian applications, we tested the model on regions with displaced persons and refugee settlements. Precision and recall did vary between regions, so achieving consistent performance across the continent is an ongoing challenge.

Precision-recall curves, measured at 0.5 intersection-over-union threshold.

When visually inspecting the detections for low-scoring regions, we noted various causes. In rural areas, label errors were problematic. For example, single buildings within a mostly-empty area can be difficult for labellers to spot. In urban areas, the model had a tendency to split large buildings into separate instances. The model also underperformed in desert terrain, where buildings were hard to distinguish against the background.

We carried out an ablation study to understand which methods contributed most to the final performance, measured in mean average precision (mAP). Distance weighting, mixup and the use of ImageNet pre-training were the biggest factors for the performance of the supervised learning baseline. The ablated models that did not use these methods had a mAP difference of -0.33, -0.12 and -0.07 respectively. Unsupervised self-training gave a further significant boost of +0.06 mAP.

Ablation study of training methods. The first row shows the mAP performance of the best model combined with self-training, and the second row shows the best model with supervised learning only (the baseline). By disabling each training optimization from the baseline in turn, we observe the impact on mAP test performance. Distance weighting has the most significant effect.

Generating the Open Buildings Dataset
To create the final dataset, we applied our best building detection model to satellite imagery across the African continent (8.6 billion image tiles covering 19.4 million km2, 64% of the continent), which resulted in the detection of 516M distinct structures.

Each building’s outline was simplified as a polygon and associated with a Plus Code, which is a geographic identifier made up of numbers and letters, akin to a street address, and useful for identifying buildings in areas that don’t have formal addressing systems. We also include confidence scores and guidance on suggested thresholds to achieve particular precision levels.

The sizes of the structures vary as shown below, tending towards small footprints. The inclusion of small structures is important, for example, to support analyses of informal settlements or refugee facilities.

Distribution of building footprint sizes.

The data is freely available and we look forward to hearing how it is used. In the future, we may add new features and regions, depending on usage and feedback.

This work is part of our AI for Social Good efforts and was led by Google Research, Ghana. Thanks to the co-authors of this work: Wojciech Sirko, Sergii Kashubin, Marvin Ritter, Abigail Annkah, Yasser Salah Edine Bouchareb, Yann Dauphin, Daniel Keysers, Maxim Neumann and Moustapha Cisse. We are grateful to Abdoulaye Diack, Sean Askay, Ruth Alcantara and Francisco Moneo for help with coordination. Rob Litzke, Brian Shucker, Yan Mayster and Michelina Pallone provided valuable assistance with geo infrastructure.


Advances in TF-Ranking

In December 2018, we introduced TF-Ranking, an open-source TensorFlow-based library for developing scalable neural learning-to-rank (LTR) models, which are useful in settings where users expect to receive an ordered list of items in response to their query. LTR models — unlike standard classification models that classify one item at a time — receive an entire list of items as an input, and learn an ordering that maximizes the utility of the entire list. While search and recommendation systems are the most common applications of LTR models, since its release, we have seen TF-Ranking being applied in diverse domains beyond search, including e-commerce, SAT solvers, and smart city planning.

The goal of learning-to-rank (LTR) is to learn a function f() that takes as an input a list of items (documents, products, movies, etc.) and outputs the list of items in the optimal order (descending order of relevance). Here, green shade indicates item relevance level, and the red item marked with ‘x’ is non-relevant.

In May 2021, we published a major release of TF-Ranking that enables full support for natively building LTR models using Keras, a high-level API of TensorFlow 2. Our native Keras ranking model has a brand-new workflow design, including a flexible ModelBuilder, a DatasetBuilder to set up training data, and a Pipeline to train the model with the provided dataset. These components make building a customized LTR model easier than ever, and facilitate rapid exploration of new model structures for production and research. If RaggedTensors are your tool of choice, TF-Ranking is now working with them as well. In addition, our most recent release, which incorporates the Orbit training library, contains a long list of advances — the culmination of two and half years of neural LTR research. Below we share a few of the key improvements available in the latest TF-Ranking version.

Workflow to build and train a native Keras ranking model. Blue modules are provided by TF-Ranking, and green modules are customizable.

Learning-to-Rank with TFR-BERT
Recently, pretrained language models like BERT have achieved state-of-the-art performance on various language understanding tasks. To capture the expressiveness of these models, TF-Ranking implements a novel TFR-BERT architecture that couples BERT with the power of LTR to optimize the ordering of list inputs. As an example, consider a query and a list of n documents that one might like to rank in response to this query. Instead of learning an independent BERT representation for each <query, document> pair, LTR models apply a ranking loss to jointly learn a BERT representation that maximizes the utility of the entire ranked list with respect to the ground-truth labels.

The figure below illustrates this process. First, we flatten a list of n documents to rank in response to a query into a list <query, document> tuples. These tuples are fed into a pre-trained language model (e.g., BERT). The pooled BERT outputs for the entire document list are then jointly fine-tuned with one of the specialized ranking losses available in TF-Ranking. Our experience shows that this TFR-BERT architecture delivers significant improvements in pretrained language model performance, leading to state-of-the-art performance for several popular ranking tasks, especially when multiple pretrained language models are ensembled. Our users can now get started with TFR-BERT using this simple example.

An illustration of the TFR-BERT architecture, in which a joint LTR model over a list of n documents is constructed using BERT representations of individual <query, document> pairs.

Interpretable Learning-to-Rank
Transparency and interpretability are important factors in deploying LTR models in ranking systems that can be involved in determining the outcomes of processes such as loan eligibility assessment, advertisement targeting, or guiding medical treatment decisions. In such cases, the contribution of each individual feature to the final ranking should be examinable and understandable to ensure transparency, accountability and fairness of the outcomes.

One possible way to achieve this is using generalized additive models (GAMs) — intrinsically interpretable machine learning models that are linearly composed of smooth functions of individual features. However, while GAMs have been extensively studied on regression and classification tasks, it is less clear how to apply them in a ranking setting. For instance, while GAMs can be straightforwardly applied to model each individual item in the list, modeling both item interactions and the context in which these items are ranked is a more challenging research problem. To this end, we have developed a neural ranking GAM — an extension of generalized additive models to ranking problems.

Unlike standard GAMs, a neural ranking GAM can take into account both the features of the ranked items and the context features (e.g., query or user profile) to derive an interpretable, compact model. This ensures that not only the contribution of each item-level feature is interpretable, but also the contribution of the context features. For example, in the figure below, using a neural ranking GAM makes visible how distance, price, and relevance, in the context of a given user device, contribute to the final ranking of the hotel. Neural ranking GAMs are now available as a part of TF-Ranking,

An example of applying neural ranking GAM for local search. For each input feature (e.g., price, distance), a sub-model produces a sub-score that can be examined, providing transparency. Context features (e.g., user device type) can be utilized to derive importance weights of submodels.

Neural Ranking or Gradient Boosting?
While neural models have achieved state of the art performance in multiple domains, specialized gradient boosted decision trees (GBDTs) like LambdaMART remained the baseline to beat in a variety of open LTR datasets. The success of GBDTs in open datasets is due to several reasons. First, due to their relatively small size, neural models are prone to overfitting on these datasets. Second, since GBDTs partition their input feature space using decision trees, they are naturally more resilient to variations in numerical scales in ranking data, which often contain features with Zipfian or otherwise skewed distributions. However, GBDTs do have their limitations in more realistic ranking scenarios, which often combine both textual and numerical features. For instance, GBDTs cannot be directly applied to large discrete feature spaces, such as raw document text. They are also, in general, less scalable than neural ranking models.

Therefore, since the TF-Ranking release, our team has significantly deepened the understanding of how best to leverage neural models in ranking with numerical features. This culminated in a Data Augmented Self-Attentive Latent Cross (DASALC) model, described in an ICLR 2021 paper, which is the first to establish parity, and in some cases statistically significant improvements, of neural ranking models over strong LambdaMART baselines on open LTR datasets. This achievement is made possible through a combination of techniques, which include data augmentation, neural feature transformation, self-attention for modeling document interactions, listwise ranking loss, and model ensembling similar to boosting in GBDTs. The architecture of the DASALC model was entirely implemented using the TF-Ranking library.

All in all, we believe that the new Keras-based TF-Ranking version will make it easier to conduct neural LTR research and deploy production-grade ranking systems. We encourage everyone to try out the latest version and follow this introductory example for a hands-on experience. While we are very excited about this new release, our research and development journey is far from over, so we will continue to advance our understanding of learning-to-rank problems and share these advances with our users.

This project was only possible thanks to the current and past members of the TF-Ranking team: Honglei Zhuang, ‎Le Yan, Rama Pasumarthi, Rolf Jagerman, Zhen Qin, Shuguang Han, Sebastian Bruch, Nathan Cordeiro, Marc Najork and Patrick McGregor. We also extend special thanks to our collaborators from the Tensorflow team: Zhenyu Tan, Goldie Gadde, Rick Chao, Yuefeng Zhou‎, Hongkun Yu, and Jing Li.


함께 자라기: 우리는 함께 성장할 수 있을까?

우리는 점점 협업이 중요해지는 시대에 살고 있습니다. 도메인과 기술, 각각의 분야는 갈수록 세밀해지고 고도화되고 있기 때문에, 혼자서 이 모든 것을 다 알기란 불가능에 가까워지고 있습니다. 그래서 한명의 천재보다는 훌륭한 팀이 더 좋은 결과들을 만들어 내는 시대입니다.


출처: pixabay

면접에서 커뮤니케이션 스킬 역시 중요하게 평가되고 있죠. ‘팀원과의 협업에서 어려움이 있을 때 어떻게 하셨나요?’ 이런 질문들은 흔하게 접하셨을 것 같습니다. 여기에서 저는 개인적으로 ‘팀으로 일하면서 팀원 모두의 성장을 위해서 무엇을 해보았나요?’ 이 질문을 좋아합니다. 개인이 성장하는 것이 선형적이라면, 팀으로 성장하는 것은 기하급수적으로 볼 수 있기 때문입니다.

이번에 소개하는 책의 저자께서도 이 책을 읽으며, 다음과 같은 질문들로 생각이 나아갈 수 있기를 기대하고 있습니다.

  • 우리가 정말 함께 자랄 수 있을까?
  • 우리가 정말 매일매일 함께 자랄 수 있을까?

함께 자라기 : 애자일로 가는 길


출처: 알라딘 ‘함께 자라기’

이번 책은 애자일 컨설팅으로 알려져 있는 김창준님의 ≪함께 자라기≫ 입니다. 이 책은 그 동안 블로그와 페이스 북 등에서 공유해오시던 효과적으로 배우는 방법과 협업에 대한 다양한 글들을 엮은 결과입니다. 이 책의 특징 중 하나는 연구, 논문 등의 자료를 기반으로 조금 더 구체적이고 분석적으로 성장과 협업에 대해서 바라 본다는 것 입니다.

그럼 책의 내용들을 조금 더 살펴보겠습니다. 1장 자라기 에서는 성장을주제로 다양한 이야기를 하고 있습니다.


저는 시스템과 프로세스가 중요하다고 생각을 합니다. 적합한 사람들을 뽑는 것이 무엇보다 중요하지만, 이 사람들이 마음껏 능력을 펼칠 수 있는 조직의 시스템도 그에 못지 않게 중요합니다.

조직은 개인이 자신의 전문성을 좀 더 발전시키고 관리할 수 있게 최대한 지원을 해야 합니다. 그것이 윈윈하는 길입니다. 뽑고 나서 잘 교육하고 성장하게 도와주는 것 이상으로 중요한 것이 또 있습니다. 시스템입니다. 아무리 훌륭한 사람을 뽑아도 조직의 시스템과 문화에 문제가 있으면 그런 사람은 묻혀버리기 쉽고, 반대로 실력이 평범한 사람일지라도 좋은 시스템 속에서 뛰어난 성과를 낼 수도 있습니다.

  • 잘 뽑는 것 이상으로 중요한 것 중에서

프로세스와 시스템은 아래 더글러스의 말에서 B와 C단계에 해당하는 일 입니다. 이렇게 한 단계 혹은 한 차원 높게 개선을 함으로써 그 조직은 계속해서 발전할 수 있는 것이죠. 항상 일을 함에 있어서 언제 무엇에 집중해야 할지를 생각하는 것이 필요합니다. 일례로 스타트업에서는 빠르게 A 작업을 해내는 것이 중요한 반면, 대기업에서는 더 빠르게 확장할 수 있도록 B작업, 즉 프로세스를 개선하는데 집중해야 하는 것이죠.

더글러스는 작업을 세 가지 수준으로 구분합니다. A, B, C 작업입니다.
A 작업은 원래 그 조직이 하기로 되어 있는 일을 하는 걸 말합니다.
B 작업은 A 작업을 개선하는 걸 말합니다. 제품을 만드는 사이클에서 시간과 품질을 개선하는 것이죠
C 작업은 B 작업을 개선하는 것 입니다. 개선 사이클 자체의 시간과 품질을 개선하는 것입니다. … 한마디로 개선하는 능력을 개선하는 걸 말합니다.
더글러스는 “우리가 더 잘하는 것을 더 잘하게 될수록 우리는 더 잘하는 걸 더 잘 그리고 더 빨리 하게 될 것이다”

  • 복리의 비밀 중에서

의도적 수련


출처: 함께자라기 ‘제자리걸음에서 벗어나기’ 중에서

의도적 수련은 자신의 실력에 맞춰서 가장 빠르게 배울 수 있는 방법 중에 하나입니다. 위 그림처럼, ‘작업 난이도’ 와 ‘실력’ 을 유사한 수준으로 맞춰서 일에 몰입할 수 있도록 하는 것이죠. 너무 쉬운 일이라면, 스스로 퀘스트를 부여하면서 더 문제를 어렵게 만들거나 어려운 일의 경우에는 주변의 도움을 받기도 하고, 문제를 구조적으로 접근함녀서 난이도를 낮추는 방법 등을 제시하고 있습니다.

의도적 수련이 되려면 나의 실력과 작업의 난이도가 비슷해야 합니다. 이것은 미하이 칙센트미하이의 몰입이론(무슨 활동을 하냐가 중요한게 아니라 뭘 하든지 몰입해서 하면 만족도가 올라갔다)과도 일치하는 부분인데요, … 우리가 주목해야 할 부분은 C 영역입니다. 난이도와 실력이 엇비슷하게 맞는 부분이죠. 미하이는 이 부분에서 인간이 몰입을 경험한다고 합니다. 그리고 바로 이때 최고 수준의 집중력을 보이고, 그 덕분에 퍼포먼스나 학습 능력이 최대치가 될 수 있다고 합니다. 또한 그때 최고 수준의 행복감을 경험한다는 흥미로운 사실을 발견하기도 했습니다. 비슷한 이야기를 언어학자인 크라센이 입력가설을 통해 말합니다. i+1 이론이라고 하는데, 현재 언어 학습자의 언어 수준을 i라고 할 때 딱 한 단계 높은 i+1 수준의 입력이 주어질 때에만 언어 능력이 유의미하게 진전한다는 이론이죠.

  • 의도적 수련의 필수조건, 적절한 난이도 중에서

다음으로 2장 함께 에서는 협업에 대한 다양한 주제들을 다루고 있습니다.

심리적 안전감

성공적인 팀의 특징들 중에서 가장 중요하다고 이야기 되는 요소가 바로 ‘심리적 안전감’ 입니다. 이 ‘심리적 안전감’ 하나의 주제만을 가지고 다양한 이야기하는 ≪두려움 없는 조직≫ 이라는 책도 있죠. 어떻게 보면 뻔하게 보이기도 하지만 그 만큼 심리적 안전감을 팀 내에 정착시키는 것은 어렵기도 합니다.

구글은 데이터 중심 회사답게 데이터 기반으로 뛰어난 관리자의 특징을 찾는 옥시전 프로젝트 이후에도 뛰어난 팀의 특징을 찾기 위해 2년간 노력했습니다. 이름하여 아리스토텔레스 프로젝트 입니다.

  1. 팀에 누가 있는지 (전문가, 내향/외향, 지능 등) 보다 팀원들이 서로 어떻게 상호작용하고 자신의 일을 어떻게 바라보는지가 훨씬 중요했다.
  2. 5가지 성공적 팀의 특징을 찾았는데, 그중 압도적으로 높은 예측력을 보인 변수는 팀의 심리적 안전감이었다.
  3. 팀 토론 등 특별히 고안된 활동을 통해 심리적 안전감을 개선할 수 있었다.
    • 구글이 밝힌 탁월한 팀의 비밀 중에서

심리적 안전감은 보통 조직문화를 기반으로 하고 있다고 이야기합니다. 조직문화 중에서도 특히 ‘투명성’ 에 연결이 됩니다. 아래 사례처럼, 실수를 투명하게 공개하고 더 나은 방향으로 모두 나아갈 수 있는 것. 그 외에도 회사 내에서 정보가 투명하게 흐르게 되면 서로 간의 신뢰가 생기기 때문입니다. 이 신뢰가 곧 심리적 안전감으로 직결되게 되죠.

마이클 프레제는 회사에서의 실수 문화에 대해 연구를 했습니다. 그에 따르면 실수 문화에는 크게 두 가지가 있습니다. 실수 예방과 실수 관리. 실수 예방은 행동에서 실수로 가는 경로를 차단하려고 합니다. 즉, 실수를 저지르지 말라고 요구합니다. 근데, 사실 이것이 불가능에 가깝습니다. 전문가도 1시간에 평균 3~5개의 실수를 저지른다고 합니다. … 실수 예방 문화에서는 실수를 한 사람을 비난하고, 처벌하고, 따라서 실수를 감추고 그에 대해 논의하기 꺼리며 문제가 생겼을 때 협력도 덜하게 됩니다. 실수에서 배우지 못하겠지요. 반대로 실수 관리 문화에서는 실수가 나쁜 결과를 내기 전에 빨리 회복하도록 돕고, 실수를 공개하고, 실수에 대해 서로 이야기하고 거기에서 배우는 분위기가 생깁니다.
이 부분이 굉장히 중요합니다. 실수 연구의 역사를 보면, 초기에는 기술적인 부분만 보다가 그 다음에는 인간적인 부분 (결국 80%가 사람 실수라든지)을 보다가 이제는 문화적인 부분을 이야기합니다. 심리적 안전감이라고 하는 것이 이 문화의 일부입니다.

  • 두 가지의 실수 문화 중에서


다음은 개발자들끼리 많이 진행하는 짝 프로그래밍에 대한 이야기 입니다. 그 동안 많이 해봤음에도, 왜 효과적인지 잘 모르고 있다가 이 책을 읽으면서 깨닫게 되는 사례 중에 하나였습니다. 짝 프로그래밍까지 가지 않더라도 문제에 대해서 설명하다가 스스로 좋은 방법을 찾기도 하는데, 이것 역시 설명의 과정에서 추상화를 시키면서 스스로 이해도가 높아지기 때문이 아닐까 싶습니다.

짝 프로그래밍은 두 사람이 한 컴퓨터를 사용해 함께 프로그래밍하는 것입니다. 생각할수록 짝 프로그래밍의 구성은 절묘합니다. 두 사람이라는 구성은 대화를 통해 추상화를 높이게 합니다. 한 컴퓨터라는 구성은 구체화를 통해 검증하게 합니다. 미루고 헤아리는 것) 이 빈번히 교차합니다. 그리고 그 사이에서 “아하”가 터져 나옵니다. … 자신이 작성하는 코드의 추상성을 높이고 싶다면 혼자서 고민하지 말고 다른 사람들과 협동하고, 대화하세요. 같이 그림도 그려보고 함께 소스코드를 편집하세요. 인간에게는 다른 인간과 소통하고 협력할 수 있는 놀라운 능력이 있습니다. 대화는 기적입니다.

  • 대화하는 프로그래밍 중에서

새로운 방법론의 도입

아마 많은 이런 경험이 많이 있으실 것 같습니다. 같이 일을 하면서 새로운 프레임워크 혹은 애자일 등의 방법론 혹은 도구를 도입하는 것이죠. 무난하게 도입을 한 경우도 있을 것이고, 생각하지 못한 반대의견을 맞닥뜨린 경우도 있을 것 입니다. 어떻게 하는 것이 가장 좋은 방법인지 모르겠지만, 동료분들과 이야기를 충분히 하고 니즈를 이해해야 한다는 것 입니다. 이 도구가 왜 좋은지 보다는 동료분들이 어떤 생각을 가지고 있는지 알아보는 것이 어떨까요?

그리고 이렇게 대화를 하면서, 중간의 매개체가 될 수 있다면 단순히 도구를 도입하려는 시도에서 더 나아가 팀에서 필요로 하는 것이 무엇인지 제대로 이해하고 더 좋은 방안을 제시할 수 있을 것 입니다.

팀장 자리에 있으면 새로운 아이디어 전파가 쉬울 거라고 생각하는 것은 환상입니다. … 그 중 어떤 분들은 이미 나름의 객관적 수치들을 수집하고 계시죠. 그런 분들을 만나면 저는 다음과 같은 질문을 던집니다. “상대방에 대해 얼마나 이해를 하고 계신가요? 얼마나 대화를 해보셨나요?” 십중팔구는 “그분이랑은 별로 이야기 못 해봤습니다.” 란 답이 돌아옵니다. 만약 그렇다면 앞으로도 설득에 성공할 확률은 낫다고 봐야 합니다.

  • 객관성의 주관성 중에서

복잡한 분야일수록 어떤 특정 기법의 효과보다도 치료자 효과가 더 큰 영향을 미칠 것입니다. 그렇다면 어떻게 해야 할까요? 슈퍼슈링크들을 찾고 그들을 연구하고 육성해야 합니다. … 소프트웨어 개발 방법론, 새 프로젝트를 진행할 때에 우리가 어떤 방법론을 쓰느냐는 문제보다도 누가 참여하는가가 훨씬 더 압도적으로 중요한 문제가 아닐까요? 여러분은 어떻게 생각하시나요? 저는 이렇게 생각합니다. 예를 들어 애자일 방법론 도입을 원하는 팀장이라면 “나는 어떤 팀장인가”를 먼저 자문해봐야 하지 않을까 싶습니다.

  • 당신의 조직에 새 방법론이 먹히지 않는 이유 중에서

다음은 전문가들끼리 팀이 구성되었을 때, 가장 효과적일지에 대한 이야기가 있습니다. 분야가 겹치지 않는 상황에서는 전문가들이 서로의 전문성을 믿고 각자 최고의 결과를 만들어 낼 수 있지만, 비슷한 분야에서 전문가들이 같이 일을 하는 것은 개인에서 협업을 하게되는 상황이기도 합니다. 이때에는 필연적으로 생산성이 떨어지는 순간들이 있게 되는 것 같습니다. 협업에는 연습이 필요하기 때문이죠.

회사에서의 올스타는 어떨까요? 그로이스버그(Groysberg) 등의 연구에 따르면 이런 스타들이 한 명씩 팀에 추가될 때마다 팀의 추가적 성과 향상은 한계효용(점차 줄어듬)을 보이며 어느 수준을 지나면 음의 방향으로 작용한다(즉, 전체 팀의 성과를 깎아먹음)”고 합니다. … 성과를 깎아먹는 경향은 특히 전문가들이 전문성이 서로 유사할 때 도드라졌습니다. 이 연구는 그 원인 중 하나로 전문가들의 에고(ego)를 꼽습니다.

  • 전문가팀이 실패하는 이유 중에서


마지막 3장에서는 애자일에 대한 이야기가 간단하게 다루어집니다. 사실 앞의 1장, 2장에서도 ‘애자일’ 이라는 용어만 쓰지 않았지, 주제는 애자일에 포함되는 이야기였기 때문이죠.

그 동안 일을 해오면서, 아래의 사례처럼 ‘고객 참여’는 무엇보다 중요한 요소 입니다. 고객 참여에는 다양한 방식이 있을 것 입니다. 고객이 바로 옆에서 도움을 줄 수도 있고, CS를 통해서 피드백을 받을 수도 있고, 인터뷰를 진행할 수도 있습니다. 고객이 무엇을 원하는지 알아볼 수 있는 선구안은 정말 흔하지 않기 때문에, 고객 참여를 통해서 니즈를 발견하고 빠르게 개발해나가는 것이 중요하죠.

성숙도가 낮은 조직의 경우 (성숙도 4 이하), 고객 참여 (0.94), 통계적으로 유의미한 실천법 딱 하나입니다. 고객 참여. 그리고 기여도는 0.94로 아까 전체로 볼 때보다 더 높습니다. 거의 1 입니다. 성숙도가 낮아도 고객 참여를 잘하면 프로젝트 성공도가 한 칸 올라간다는 뜻 입니다. … 성숙도가 높은 조직을 보시죠. 짧은 반복 개발 주기가 1등입니다. 고객 참여보다 더 기여도가 높습니다. 그 말은 성숙도가 높은 조직에서는 고객 참여보다 짧은 반복 개발 주기가 성공에 더 도움이 될 수 있다는 뜻입니다. 그만큼 짧은 반복 개발 주기를 통해 고객 참여가 잘 안 될 때를 어느 정도 보완할 수 있다는 뜻일 수도 있겠습니다.

  • 성숙도가 낮다면 고객 참여는 필수 중에서



출처: 존잡생각 Ep.18 회사에서 본인을 빠르게 성장시키는 방법 – People Scaling

포스트를 작성하면서 협업에 대해서 생각을 하다보니, 최근에 자주 보고 있는 존잡생각 이라는 샌드버드 CEO인 김동선 대표님의 유투브 채널에서 다뤘던 내용이 생각났습니다. 저 문장이 협업의 측면에서 핵심이 되는 요소라고 생각합니다. 문제가 되는 약점은 고쳐야 하지만, 기본적으로 개개인이 가진 강점을 기반으로 팀으로서의 합이 최대치가 되도록 하는 것이죠.

이렇게 팀이 성장하는 방향으로, 함께 자랄 수 있기를 바랍니다!


Applying Advanced Speech Enhancement in Cochlear Implants

For the ~466 million people in the world who are deaf or hard of hearing, the lack of easy access to accessibility services can be a barrier to participating in spoken conversations encountered daily. While hearing aids can help alleviate this, simply amplifying sound is insufficient for many. One additional option that may be available is the cochlear implant (CI), which is an electronic device that is surgically inserted into a part of the inner ear, called the cochlea, and stimulates the auditory nerve electrically via external sound processors. While many individuals with these cochlear implants can learn to interpret these electrical stimulations as audible speech, the listening experience can be quite varied and particularly challenging in noisy environments.

Modern cochlear implants drive electrodes with pulsatile signals (i.e., discrete stimulation pulses) that are computed by external sound processors. The main challenge still facing the CI field is how to best process sounds — to convert sounds to pulses on electrodes — in a way that makes them more intelligible to users. Recently, to stimulate progress on this problem, scientists in industry and academia organized a CI Hackathon to open the problem up to a wider range of ideas.

In this post, we share exploratory research demonstrating that a speech enhancement preprocessor — specifically, a noise suppressor — can be used at the input of a CI’s processor to enhance users’ understanding of speech in noisy environments. We also discuss how we built on this work in our entry for the CI Hackathon and how we will continue developing this work.

Improving CIs with Noise Suppression
In 2019, a small internal project demonstrated the benefits of noise suppression at the input of a CI’s processor. In this project, participants listened to 60 pre-recorded and pre-processed audio samples and ranked them by their listening comfort. CI users listened to the audio using their devices’ existing strategy for generating electrical pulses.

Audio without background noise
Audio with background noise
Audio with background noise + noise suppression

Background audio clip from “IMG_0991.MOV” by Kenny MacCarthy, license: CC-BY 2.0.

As shown below, both listening comfort and intelligibility usually increased, sometimes dramatically, when speech with noise (the lightest bar) was processed with noise suppression.

CI users in an early research study have improved listening comfort — qualitatively scored from “very poor” (0.0) to “OK” (0.5) to “very good” (1.0) — and speech intelligibility (i.e., the fraction of words in a sentence correctly transcribed) when trying to listen to noisy audio samples of speech with noise suppression applied.

For the CI Hackathon, we built on the project above, continuing to leverage our use of a noise suppressor while additionally exploring an approach to compute the pulses too

Overview of the Processing Approach
The hackathon considered a CI with 16 electrodes. Our approach decomposes the audio into 16 overlapping frequency bands, corresponding to the positions of the electrodes in the cochlea. Next, because the dynamic range of sound easily spans multiple orders of magnitude more than what we expect the electrodes to represent, we aggressively compress the dynamic range of the signal by applying “per-channel energy normalization” (PCEN). Finally, the range-compressed signals are used to create the electrodogram (i.e., what the CI displays on the electrodes).

In addition, the hackathon required a submission be evaluated in multiple audio categories, including music, which is an important but notoriously difficult category of sounds for CI users to enjoy. However, the speech enhancement network was trained to suppress non-speech sounds, including both noise and music, so we needed to take extra measures to avoid suppressing instrumental music (note that in general, music suppression might be preferred by some users in certain contexts). To do this, we created a “mix” of the original audio with the noise-suppressed audio so that enough of the music would pass through to remain audible. We varied in real-time the fraction of original audio mixed from 0% to 40% (0% if all of the input is estimated as speech, up to 40% as more of the input is estimated as non-speech) based on the estimate from the open-source YAMNet classifier on every ~1 second window of audio of whether the input is speech or non-speech.

The Conv-TasNet Speech Enhancement Model
To implement a speech enhancement module that suppresses non-speech sounds, such as noise and music, we use the Conv-TasNet model, which can separate different kinds of sounds. To start, the raw audio waveforms are transformed and processed into a form that can be used by a neural network. The model transforms short, 2.5 millisecond frames of input audio with a learnable analysis transform to generate features optimized for sound separation. The network then produces two “masks” from those features: one mask for speech and one mask for noise. These masks indicate the degree to which each feature corresponds to either speech or noise. Separated speech and noise are reconstructed back to the audio domain by multiplying the masks with the analysis features, applying a synthesis transform back to audio-domain frames, and stitching the resulting short frames together. As a final step, the speech and noise estimates are processed by a mixture consistency layer, which improves the quality of the estimated waveforms by ensuring that they sum up to the original input mixture waveform.

Block diagram of the speech enhancement system, which is based on Conv-TasNet.

The model is both causal and low latency: for each 2.5 milliseconds of input audio, the model produces estimates of separated speech and noise, and thus could be used in real-time. For the hackathon, to demonstrate what could be possible with increased compute power in future hardware, we chose to use a model variant with 2.9 million parameters. This model size is too large to be practically implemented in a CI today, but demonstrates what kind of performance would be possible with more capable hardware in the future.

Listening to the Results
As we optimized our models and overall solution, we used the hackathon-provided vocoder (which required a fixed temporal spacing of electrical pulses) to produce audio simulating what CI users might perceive. We then conducted blind A-B listening tests as typical hearing users.

Listening to the vocoder simulations below, the speech in the reconstructed sounds — from the vocoder processing the electrodograms — is reasonably intelligible when the input sound doesn’t contain too much background noise, however there is still room to improve the clarity of the speech. Our submission performed well in the speech-in-noise category and achieved second place overall.

Simulated audio with fixed temporal spacing
Vocoder simulation of what CI users might perceive from audio from an electrodogram with fixed temporal spacing, with background noise and noise suppression applied.

A bottleneck on quality is that the fixed temporal spacing of stimulation pulses sacrifices fine-time structure in the audio. A change to the processing to produce pulses timed to peaks in the filtered sound waveforms captures more information about the pitch and structure of sound than is conventionally represented in implant stimulation patterns.

Simulated audio with adaptive spacing and fine time structure
Vocoder simulation, using the same vocoder as above, but on an electrodogram from the modified processing that synchronizes stimulation pulses to peaks of the sound waveform.

It’s important to note that this second vocoder output is overly optimistic about how well it might sound to a real CI user. For instance, the simple vocoder used here does not model how current spread in the cochlea blurs the stimulus, making it harder to resolve different frequencies. But this at least suggests that preserving fine-time structure is valuable and that the electrodogram itself is not the bottleneck.

Ideally, all processing approaches would be evaluated by a broad range of CI users, with the electrodograms implemented directly on their CIs rather than relying upon vocoder simulations.

Conclusion and a Call to Collaborate
We are planning to follow up on this experience in two main directions. First, we plan to explore the application of noise suppression to other hearing-accessibility modalities, including hearing aids, transcription, and vibrotactile sensory substitution. Second, we’ll take a deeper dive into the creation of electrodogram patterns for cochlear implants, exploiting fine temporal structure that is not accommodated in the usual CIS (continous interleaved sampling) patterns that are standard in the industry. According to Louizou: “It remains a puzzle how some single-channel patients can perform so well given the limited spectral information they receive”. Therefore, using fine temporal structure might be a critical step towards achieving an improved CI experience.

Google is committed to building technology with and for people with disabilities. If you are interested in collaborating to improve the state of the art in cochlear implants (or hearing aids), please reach out to

We would like to thank the Cochlear Impact hackathon organizers for giving us this opportunity and partnering with us. The participating team within Google is Samuel J. Yang, Scott Wisdom, Pascal Getreuer, Chet Gnegy, Mihajlo Velimirović, Sagar Savla, and Richard F. Lyon with guidance from Dan Ellis and Manoj Plakal.


Multi-task Prediction of Organ Dysfunction in ICUs

The intensive care unit (ICU) of a hospital looks after the most medically vulnerable patients, many of whom require organ support, such as mechanical ventilation or dialysis. While always critical, the demand on ICU services during the COVID-19 pandemic has further underscored the importance of data-driven decision-making in healthcare. Furthermore, the ability to accurately predict the clinical outcomes of ICU patients has the potential to guide therapy and may inform decisions about most effective care, including staffing and triage support.

Applying machine learning (ML) to electronic health records (EHRs) has shown promise in predicting clinical outcomes. However, many of these ML models are based on single-task learning (ST), where the models are trained only to predict a specific adverse event, such as an organ dysfunction or the need for a life-support intervention. Of greater benefit would be to train multi-task models, which take into account a variety of competing risks along with the interdependencies between organ systems that factor into patient outcomes in a realistic setting.

In “Multi-task prediction of organ dysfunction in the ICU using sequential sub-network routing”, we propose a multi-task learning (MTL) architecture, called Sequential Sub-Network Routing (SeqSNR), that better captures the complexity of a realistic setting. Inspired by a clinician’s holistic approach to diagnosing problems, SeqSNR is designed to use flexible parameter sharing and routing to find related tasks and encourage cross-learning between them. We successfully applied SeqSNR to the task of continuous adverse event prediction in an ICU setting and showed advantages over single-task and naïve multi-tasking, especially in low training data scenarios.

Data and Labels
In this study, we used the freely available, open access, de-identified MIMIC-III EHR dataset, which includes a patient cohort consisting of 36,498 adults across 52,038 critical care admissions at the Beth Israel Deaconess Medical Center between 2001 and 2012. Similar to our previous studies, we employed a version of the MIMIC-III dataset that was mapped to the Fast Healthcare Interoperability Resource (FHIR) standard and used a comprehensive set of features, including a sequence of vital signs, laboratory results, past medications, procedures, diagnoses, and more.

The MIMIC-III database contains multi-modal recordings from ICU patients. Unlike most datasets in ML, the input and targets are often not explicitly defined and must be inferred from the data. So, using a combination of automated rule-based methods and clinical review, we defined a suite of diverse endpoints, including critical care interventions, specific organ dysfunctions, and overall patient outcomes.

The task given to the model was to predict the onset of a selection of adverse events within 24–48 hours for every hour after a patient’s admission into the ICU. The defined adverse events included acute kidney injury (AKI), continuous renal replacement therapy (CRRT) dialysis, administration of vasopressors and inotropes, mechanical ventilation (MV), mortality, and remaining length of stay (LoS).

The SeqSNR Algorithm
While multi-task learning captures the interdependencies between organ systems and balances competing risks, it can be challenging to implement successfully. In practice, jointly-trained tasks often impair one another, an effect called “negative transfer”. The intuition behind SeqSNR was that modular ‘sub-networks’ would mitigate this issue by automatically optimizing how information is shared across multiple tasks.

SeqSNR is a time series adaptation of the SNR architecture and is a combination of a deep embedding layer followed by stacked recurrent neural network (RNN) layers. Modularisation is achieved by splitting both the embedding layer and the RNN stack into multiple modules connected by routing variables that are learned during the training phase. The routing connections are always created between blocks in one layer and the next. This approach minimizes negative transfer by ensuring that data of low relevance to a particular task layer is filtered out. In essence, this means that each task utilizes a different path through the model.

A high-level overview of the SeqSNR architecture.

SeqSNR shows a modest improvement in discriminative performance overall relative to single-task and naïve multitasking. However, it’s performance improvement is more significant in scenarios with few training labels.

Because the prevalence of different outcomes varied widely in the dataset (e.g. ~38% of patients had MV, but CRRT dialysis is present for only ~3%), many accuracy metrics are not suitable. Instead, we report the area under the precision recall curve (AU PRC), which is more reliable given imbalanced data. Moreover, we performed the Wilcoxon Signed Rank Tests to draw statistically significant conclusions for pairwise comparisons of ST learning, shared-bottom (SB) multi-task learning (i.e., naïve multi-task learning), and SeqSNR across bootstrapped samples from the held-out test set. The performance differences between the three architectures were modest, but SeqSNR outperformed both ST and SB in four out of six tasks (p-values are reported in the paper).

Comparison of single task (ST), shared bottom (SB) and SeqSNR performance on the MIMIC-III dataset.

Label Efficiency
We hypothesized that multi-task learning could assist in low-data scenarios by using easy-to-label auxiliary tasks to boost the performance of the main tasks. We formulated prediction tasks with only a portion of the training labels available for the primary prediction task, but kept the entire dataset for the “helper tasks”. The latter are chosen because they are reliably encoded in the EHR and are straightforward to timestamp. An example of such a helper task is length of stay, since the start and end of admissions are accurately timestamped in MIMIC-III. On the other hand, the start and end of mechanical ventilation events are not reliably timestamped. So, we defined a set of rules based on expert-defined heuristics to determine the ventilation times using multiple sources of mechanical ventilator–related settings along with physiological measurements in the EHR dataset that are indicative of MV.

The development of these rules for a new clinical endpoint was time-consuming and involved manual review of the dataset by experts. The difficulty in exhaustively labeling the dataset led us to test the model performance with only 1–10% of the data labeled, which resulted in a decline in model performance. The “helper tasks” are useful in this scenario since they are 100% labeled and can be used with the primary tasks (1–10% labeled) to jointly train the multi-task model for improved overall performance.

We chose AKI, mechanical ventilation, CRRT Dialysis, and vasoactive medications as primary endpoints using 1%, 5%, and 10% of the training labels, along with 100% of labels for the helper tasks — labs and vitals, mortality, and LoS. Performance of both ST and SeqSNR decreased as the percentage of labels for the primary endpoint was reduced, but SeqSNR outperformed ST across all tasks and all training data reduction percentages, with a statistically significant boost in performance for all cases.

Label efficiency results showing the discriminative performance when the training dataset for the primary endpoint is reduced to 1%, 5% and 10% while the helper tasks have access to all training labels.

This is a useful finding, given the difficulties of annotating endpoint labels in EHR datasets, which frequently necessitates human evaluation by doctors. The ability to use numerous endpoints, some of which may be easier to label (like duration of stay or mortality), could lessen the need for manual curation on more difficult endpoints that are annotated differently (like mechanical ventilation).

Subgroup Performance
While the version of the MIMIC-III dataset used contained labels for gender and age, it did not contain information on race and the information on ethnicity was limited. We computed the performance of all selected models across age and gender subgroups. We observed that in the scenarios with few instances in the dataset, the MTL models (both SB models and SeqSNR) often outperform ST. Even though there are exceptions, on average all models seem to be relatively balanced across age and gender subgroups. We invite the reader to refer to the supplemental section of our paper for a detailed performance breakdown.

Next Steps
This work is a proof of concept for SeqSNR on a set of canonical EHR prediction tasks. The code for this architecture is publicly available here. And will hopefully stimulate further research in EHR multi-tasking and other deep learning architectures inspired by clinical reasoning.

In future, it will be important to evaluate the performance of SeqSNR on different combinations of tasks, different time horizons and different datasets. One other area of potential growth in this project is to expand subgroup analysis by including datasets with additional population information, race, ethnicity, etc. Another area we are exploring is expanding subgroup analysis by including datasets with additional population information, such as race, ethnicity, etc. We also emphasize that these are prototype models designed to showcase methodologies, and more rigorous evaluation would be needed to bring these tools into deployment.

This work involved collaborative efforts from a multidisciplinary team of researchers, software engineers, clinicians, and cross-functional contributors. We thank our co-authors: Eric Loreaux, Anne Mottram, Ivan Protsyuk, Natalie Harris, Sebastien Baur, Yuan Xue, Jessica Schrouff, Ali Connell, Alan Karthikesalingam, Martin Seneviratne from Google, Nenad Tomasev from Deepmind, and Hugh Montgomery from University College London. We also thank Zhe Zhao from Google Research and Kathryn Rough, Cian Hughes, Megumi Morigami and Doris Wong from Google Health for their input and review, and the MIMIC team for curating this open access dataset for the research community.


Google at ICML 2021

Groups across Google are actively pursuing research across the field of machine learning, ranging from theory to application. With scalable tools and architectures, we build machine learning systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, and more.

Google is proud to be a Platinum Sponsor of the thirty-eighth International Conference on Machine Learning (ICML 2021), a premier annual event happening this week. As a leader in machine learning research — with over 100 accepted publications and Googlers participating in workshops — we look forward to our continued partnership with the broader machine learning research community.

Registered for ICML 2021? We hope you’ll visit the Google virtual booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Take a look below to learn more about the Google research being presented at ICML 2021 (Google affiliations in bold).

Organizing Committee
ICML Board Members include: Corinna Cortes, Hugo Larochelle, Shakir Mohamed
ICML Emeritus Board includes: William Cohen, Andrew McCallum
Tutorial Co-Chair member: Quoc Lee

Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
Yihe Dong, Jean-Baptiste Cordonnier, Andreas Loukas

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot
Joel Z. Leibo, Edgar Duéñez-Guzmán, Alexander Sasha Vezhnevets, John P. Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charles Beattie, Igor Mordatch, Thore Graepel

On the Optimality of Batch Policy Optimization Algorithms
Chenjun Xiao, Yifan Wu, Tor Lattimore, Bo Dai, Jincheng Mei, Lihong Li*, Csaba Szepesvari, Dale Schuurmans

Low-Rank Sinkhorn Factorization
Meyer Scetbon, Marco Cuturi, Gabriel Peyré

Oops I Took A Gradient: Scalable Sampling for Discrete Distributions
Will Grathwohl, Kevin Swersky, Milad Hashemi, David Duvenaud, Chris J. Maddison

PID Accelerated Value Iteration Algorithm
Amir-Massoud Farahmand, Mohammad Ghavamzadeh

Dueling Convex Optimization
Aadirupa Saha, Tomer Koren, Yishay Mansour

What Are Bayesian Neural Network Posteriors Really Like?
Pavel Izmailov, Sharad Vikram, Matthew D. Hoffman, Andrew Gordon Wilson

Offline Reinforcement Learning with Pseudometric Learning
Robert Dadashi, Shideh Rezaeifar, Nino Vieillard, Léonard Hussenot, Olivier Pietquin, Matthieu Geist

Revisiting Rainbow: Promoting More Insightful and Inclusive Deep Reinforcement Learning Research (see blog post)
Johan S. Obando-Ceron, Pablo Samuel Castro

EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline and Online RL
Seyed Kamyar Seyed Ghasemipour*, Dale Schuurmans, Shixiang Shane Gu

Variational Data Assimilation with a Learned Inverse Observation Operator
Thomas Frerix, Dmitrii Kochkov, Jamie A. Smith, Daniel Cremers, Michael P. Brenner, Stephan Hoyer

Tilting the Playing Field: Dynamical Loss Functions for Machine Learning
Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

Model-Based Reinforcement Learning via Latent-Space Collocation
Oleh Rybkin, Chuning Zhu, Anusha Nagabandi, Kostas Daniilidis, Igor Mordatch, Sergey Levine

Momentum Residual Neural Networks
Michael E. Sander, Pierre Ablin, Mathieu Blondel, Gabriel Peyré

OmniNet: Omnidirectional Representations from Transformers
Yi Tay, Mostafa Dehghani, Vamsi Aribandi, Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Donald Metzler

Synthesizer: Rethinking Self-Attention for Transformer Models
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, Che Zheng

Towards Domain-Agnostic Contrastive Learning
Vikas Verma, Minh-Thang Luong, Kenji Kawaguchi, Hieu Pham, Quoc V. Le

Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning
Shariq Iqbal, Christian A. Schroeder de Witt, Bei Peng, Wendelin Böhmer, Shimon Whiteson, Fei Sha

LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning
Yuhuai Wu, Markus Rabe, Wenda Li, Jimmy Ba, Roger Grosse, Christian Szegedy

Emergent Social Learning via Multi-agent Reinforcement Learning
Kamal Ndousse, Douglas Eck, Sergey Levine, Natasha Jaques

Improved Contrastive Divergence Training of Energy-Based Models
Yilun Du, Shuang Li, Joshua Tenenbaum, Igor Mordatch

Characterizing Structural Regularities of Labeled Data in Overparameterized Models
Ziheng Jiang*, Chiyuan Zhang, Kunal Talwar, Michael Mozer

Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills
Yevgen Chebotar, Karol Hausman, Yao Lu, Ted Xiao, Dmitry Kalashnikov, Jake Varley, Alex Irpan, Benjamin Eysenbach, Ryan Julian, Chelsea Finn, Sergey Levine

PsiPhi-Learning: Reinforcement Learning with Demonstrations using Successor Features and Inverse Temporal Difference Learning
Angelos Filos, Clare Lyle, Yarin Gal, Sergey Levine, Natasha Jaques, Gregory Farquhar

EfficientNetV2: Smaller Models and Faster Training
Mingxing Tan, Quoc V. Le

Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies
Paul Vicol, Luke Metz, Jascha Sohl-Dickstein

Federated Composite Optimization
Honglin Yuan*, Manzil Zaheer, Sashank Reddi

Light RUMs
Flavio Chierichetti, Ravi Kumar, Andrew Tomkins

Catformer: Designing Stable Transformers via Sensitivity Analysis
Jared Quincy Davis, Albert Gu, Krzysztof Choromanski, Tri Dao, Christopher Re, Chelsea Finn, Percy Liang

Representation Matters: Offline Pretraining for Sequential Decision Making
Mengjiao Yang, Ofir Nachum

Variational Empowerment as Representation Learning for Goal-Conditioned Reinforcement Learning
Jongwook Choi*, Archit Sharma*, Honglak Lee, Sergey Levine, Shixiang Shane Gu

Beyond Variance Reduction: Understanding the True Impact of Baselines on Policy Optimization
Wesley Chung, Valentin Thomas, Marlos C. Machado, Nicolas Le Roux

Whitening and Second Order Optimization Both Make Information in the Dataset Unusable During Training, and Can Reduce or Prevent Generalization
Neha S. Wadia, Daniel Duckworth, Samuel S. Schoenholz, Ethan Dyer, Jascha Sohl-Dickstein

Understanding Invariance via Feedforward Inversion of Discriminatively Trained Classifiers
Piotr Teterwak*, Chiyuan Zhang, Dilip Krishnan, Michael C. Mozer

Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning
Hiroki Furuta, Tatsuya Matsushima, Tadashi Kozuno, Yutaka Matsuo, Sergey Levine, Ofir Nachum, Shixiang Shane Gu

Hyperparameter Selection for Imitation Learning
Leonard Hussenot, Marcin Andrychowicz, Damien Vincent, Robert Dadashi, Anton Raichuk, Lukasz Stafiniak, Sertan Girgin, Raphael Marinier, Nikola Momchev, Sabela Ramos, Manu Orsini, Olivier Bachem, Matthieu Geist, Olivier Pietquin

Disentangling Sampling and Labeling Bias for Learning in Large-Output Spaces
Ankit Singh Rawat, Aditya Krishna Menon, Wittawat Jitkrittum, Sadeep Jayasumana, Felix X. Yu, Sashank J. Reddi, Sanjiv Kumar

Revenue-Incentive Tradeoffs in Dynamic Reserve Pricing
Yuan Deng, Sebastien Lahaie, Vahab Mirrokni, Song Zuo

Debiasing a First-Order Heuristic for Approximate Bi-Level Optimization
Valerii Likhosherstov, Xingyou Song, Krzysztof Choromanski, Jared Davis, Adrian Weller

Characterizing the Gap Between Actor-Critic and Policy Gradient
Junfeng Wen, Saurabh Kumar, Ramki Gummadi, Dale Schuurmans

Composing Normalizing Flows for Inverse Problems
Jay Whang, Erik Lindgren, Alexandros Dimakis

Online Policy Gradient for Model Free Learning of Linear Quadratic Regulators with √T Regret
Asaf Cassel, Tomer Koren

Learning to Price Against a Moving Target
Renato Paes Leme, Balasubramanian Sivan, Yifeng Teng, Pratik Worah

Fairness and Bias in Online Selection
Jose Correa, Andres Cristi, Paul Duetting, Ashkan Norouzi-Fard

The Impact of Record Linkage on Learning from Feature Partitioned Data
Richard Nock, Stephen Hardy, Wilko Henecka, Hamish Ivey-Law, Jakub Nabaglo, Giorgio Patrini, Guillaume Smith, Brian Thorne

Reserve Price Optimization for First Price Auctions in Display Advertising
Zhe Feng*, Sébastien Lahaie, Jon Schneider, Jinchao Ye

A Regret Minimization Approach to Iterative Learning Control
Naman Agarwal, Elad Hazan, Anirudha Majumdar, Karan Singh

A Statistical Perspective on Distillation
Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Seungyeon Kim, Sanjiv Kumar

Best Model Identification: A Rested Bandit Formulation
Leonardo Cella, Massimiliano Pontil, Claudio Gentile

Generalised Lipschitz Regularisation Equals Distributional Robustness
Zac Cranko, Zhan Shi, Xinhua Zhang, Richard Nock, Simon Kornblith

Stochastic Multi-Armed Bandits with Unrestricted Delay Distributions
Tal Lancewicki, Shahar Segal, Tomer Koren, Yishay Mansour

Regularized Online Allocation Problems: Fairness and Beyond
Santiago Balseiro, Haihao Lu, Vahab Mirrokni

Implicit Rate-Constrained Optimization of Non-decomposable Objectives
Abhishek Kumar, Harikrishna Narasimhan, Andrew Cotter

Leveraging Non-uniformity in First-Order Non-Convex Optimization
Jincheng Mei, Yue Gao, Bo Dai, Csaba Szepesvari, Dale Schuurmans

Dynamic Balancing for Model Selection in Bandits and RL
Ashok Cutkosky, Christoph Dann, Abhimanyu Das, Claudio Gentile, Aldo Pacchiano, Manish Purohit

Adversarial Dueling Bandits
Aadirupa Saha, Tomer Koren, Yishay Mansour

Optimizing Black-Box Metrics with Iterative Example Weighting
Gaurush Hiranandani*, Jatin Mathur, Harikrishna Narasimhan, Mahdi Milani Fard, Oluwasanmi Koyejo

Relative Deviation Margin Bounds
Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh

MC-LSTM: Mass-Conserving LSTM
Pieter-Jan Hoedt, Frederik Kratzert, Daniel Klotz, Christina Halmich, Markus Holzleitner, Grey Nearing, Sepp Hochreiter, Günter Klambauer

12-Lead ECG Reconstruction via Koopman Operators
Authors:Tomer Golany, Kira Radinsky, Daniel Freedman, Saar Minha

Finding Relevant Information via a Discrete Fourier Expansion
Mohsen Heidari, Jithin Sreedharan, Gil Shamir, Wojciech Szpankowski

LEGO: Latent Execution-Guided Reasoning for Multi-Hop Question Answering on Knowledge Graphs
Hongyu Ren, Hanjun Dai, Bo Dai, Xinyun Chen, Michihiro Yasunaga, Haitian Sun, Dale Schuurmans, Jure Leskovec, Denny Zhou

SpreadsheetCoder: Formula Prediction from Semi-structured Context
Xinyun Chen, Petros Maniatis, Rishabh Singh, Charles Sutton, Hanjun Dai, Max Lin, Denny Zhou

Combinatorial Blocking Bandits with Stochastic Delays
Alexia Atsidakou, Orestis Papadigenopoulos, Soumya Basu, Constantine Caramani, Sanjay Shakkottai

Beyond log2(T) Regret for Decentralized Bandits in Matching Markets
Soumya Basu, Karthik Abinav Sankararaman, Abishek Sankararaman

Robust Pure Exploration in Linear Bandits with Limited Budget
Ayya Alieva, Ashok Cutkosky, Abhimanyu Das

Latent Programmer: Discrete Latent Codes for Program Synthesis
Joey Hong, David Dohan, Rishabh Singh, Charles Sutton, Manzil Zaheer

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (see blog post)
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig

On Linear Identifiability of Learned Representations
Geoffrey Roeder, Luke Metz, Diederik P. Kingma

Hierarchical Clustering of Data Streams: Scalable Algorithms and Approximation Guarantees
Anand Rajagopalan, Fabio Vitale, Danny Vainstein, Gui Citovsky, Cecilia M Procopiuc, Claudio Gentile

Differentially Private Quantiles
Jennifer Gillenwater, Matthew Joseph, Alex Kulesza

Active Covering
Heinrich Jiang, Afshin Rostamizadeh

Sharf: Shape-Conditioned Radiance Fields from a Single View
Konstantinos Rematas, Ricardo Martin-Brualla, Vittorio Ferrari

Learning a Universal Template for Few-Shot Dataset Generalization
Eleni Triantafillou*, Hugo Larochelle, Richard Zemel, Vincent Dumoulin

Private Alternating Least Squares: Practical Private Matrix Completion with Tighter Rates
Steve Chien, Prateek Jain, Walid Krichene, Steffen Rendle, Shuang Song, Abhradeep Thakurta, Li Zhang

Differentially-Private Clustering of Easy Instances
Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, Eliad Tsfadia

Label-Only Membership Inference Attacks
Christopher A. Choquette-Choo, Florian Tramèr, Nicholas Carlini, Nicolas Papernot

Neural Feature Matching in Implicit 3D Representations
Yunlu Chen, Basura Fernando, Hakan Bilen, Thomas Mensink, Efstratios Gavves

Locally Private k-Means in One Round
Alisa Chang, Badih Ghazi, Ravi Kumar, Pasin Manurangsi

Large-Scale Meta-Learning with Continual Trajectory Shifting
Jaewoong Shin, Hae Beom Lee, Boqing Gong, Sung Ju Hwang

Statistical Estimation from Dependent Data
Vardis Kandiros, Yuval Dagan, Nishanth Dikkala, Surbhi Goel, Constantinos Daskalakis

Oneshot Differentially Private Top-k Selection
Gang Qiao, Weijie J. Su, Li Zhang

Unsupervised Part Representation by Flow Capsules
Sara Sabour, Andrea Tagliasacchi, Soroosh Yazdani, Geoffrey E. Hinton, David J. Fleet

Private Stochastic Convex Optimization: Optimal Rates in L1 Geometry
Hilal Asi, Vitaly Feldman, Tomer Koren, Kunal Talwar

Practical and Private (Deep) Learning Without Sampling or Shuffling
Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, Zheng Xu

Differentially Private Aggregation in the Shuffle Model: Almost Central Accuracy in Almost a Single Message
Badih Ghazi, Ravi Kumar, Pasin Manurangsi, Rasmus Pagh, Amer Sinha

Leveraging Public Data for Practical Private Query Release
Terrance Liu, Giuseppe Vietri, Thomas Steinke, Jonathan Ullman, Zhiwei Steven Wu

Meta-Thompson Sampling
Branislav Kveton, Mikhail Konobeev, Manzil Zaheer, Chih-wei Hsu, Martin Mladenov, Craig Boutilier, Csaba Szepesvári

Implicit-PDF: Non-Parametric Representation of Probability Distributions on the Rotation Manifold
Kieran A Murphy, Carlos Esteves, Varun Jampani, Srikumar Ramalingam, Ameesh Makadia

Improving Ultrametrics Embeddings Through Coresets
Vincent Cohen-Addad, Rémi de Joannis de Verclos, Guillaume Lagarde

A Discriminative Technique for Multiple-Source Adaptation
Corinna Cortes, Mehryar Mohri, Ananda Theertha Suresh, Ningshan Zhang

Self-Supervised and Supervised Joint Training for Resource-Rich Machine Translation
Yong Cheng, Wei Wang*, Lu Jiang, Wolfgang Macherey

Correlation Clustering in Constant Many Parallel Rounds
Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, Jakub Tarnawski

Hierarchical Agglomerative Graph Clustering in Nearly-Linear Time
Laxman Dhulipala, David Eisenstat, Jakub Łącki, Vahab Mirrokni, Jessica Shi

Meta-Learning Bidirectional Update Rules
Mark Sandler, Max Vladymyrov, Andrey Zhmoginov, Nolan Miller, Andrew Jackson, Tom Madams, Blaise Aguera y Arcas

Discretization Drift in Two-Player Games
Mihaela Rosca, Yan Wu, Benoit Dherin, David G.T. Barrett

Reasoning Over Virtual Knowledge Bases With Open Predicate Relations
Haitian Sun*, Pat Verga, Bhuwan Dhingra, Ruslan Salakhutdinov, William W. Cohen

Learn2Hop: Learned Optimization on Rough Landscapes
Amil Merchant, Luke Metz, Samuel Schoenholz, Ekin Cubuk

Locally Adaptive Label Smoothing Improves Predictive Churn
Dara Bahri, Heinrich Jiang

Overcoming Catastrophic Forgetting by Bayesian Generative Regularization
Patrick H. Chen, Wei Wei, Cho-jui Hsieh, Bo Dai

Workshops (only Google affiliations are noted)
LatinX in AI (LXAI) Research at ICML 2021
Hosts: Been Kim, Natasha Jaques

Uncertainty and Robustness in Deep Learning
Organizers: Balaji Lakshminarayanan, Jasper Snoek Invited Speaker: Dustin Tran

Reinforcement Learning for Real Life
Organizers: Minmin Chen, Lihong Li Invited Speaker: Ed Chi

Interpretable Machine Learning in Healthcare
Organizers: Alan Karthikesalingam Invited Speakers: Abhijit Guha Roy, Jim Winkens

The Neglected Assumptions in Causal Inference
Organizer: Alexander D’Amour

ICML Workshop on Algorithmic Recourse
Invited Speakers: Been Kim, Berk Ustun

A Blessing in Disguise: The Prospects and Perils of Adversarial Machine Learning
Invited Speaker: Nicholas Carlini

Overparameterization: Pitfalls and Opportunities
Organizers: Yasaman Bahri, Hanie Sedghi

Information-Theoretic Methods for Rigorous, Responsible, and Reliable Machine Learning (ITR3)
Invited Speaker: Thomas Steinke

Beyond First-Order Methods in Machine Learning Systems
Invited Speaker: Courtney Paquette

ICML 2021 Workshop: Self-Supervised Learning for Reasoning and Perception
Invited Speaker: Chelsea Finn

Workshop on Reinforcement Learning Theory
Invited Speaker: Bo Dai

Tutorials (only Google affiliations are noted)
Responsible AI in Industry: Practical Challenges and Lessons Learned
Organizers: Ben Packer

Online and Non-stochastic Control
Organizers: Elad Hazan

Random Matrix Theory and ML (RMT +ML)
Organizers: Fabian Pedregosa, Jeffrey Pennington, Courntey Paquette Self-Attention for Computer Vision Organizers: Prajit Ramachandran, Ashish Vaswani

* Indicates work done while at Google


Why aren’t you making math videos? (Also, now there’s a 3b1b podcast)


High Fidelity Image Generation Using Diffusion Models

Natural image synthesis is a broad class of machine learning (ML) tasks with wide-ranging applications that pose a number of design challenges. One example is image super-resolution, in which a model is trained to transform a low resolution image into a detailed high resolution image (e.g., RAISR). Super-resolution has many applications that can range from restoring old family portraits to improving medical imaging systems. Another such image synthesis task is class-conditional image generation, in which a model is trained to generate a sample image from an input class label. The resulting generated sample images can be used to improve performance of downstream models for image classification, segmentation, and more.

Generally, these image synthesis tasks are performed by deep generative models, such as GANs, VAEs, and autoregressive models. Yet each of these generative models has its downsides when trained to synthesize high quality samples on difficult, high resolution datasets. For example, GANs often suffer from unstable training and mode collapse, and autoregressive models typically suffer from slow synthesis speed.

Alternatively, diffusion models, originally proposed in 2015, have seen a recent revival in interest due to their training stability and their promising sample quality results on image and audio generation. Thus, they offer potentially favorable trade-offs compared to other types of deep generative models. Diffusion models work by corrupting the training data by progressively adding Gaussian noise, slowly wiping out details in the data until it becomes pure noise, and then training a neural network to reverse this corruption process. Running this reversed corruption process synthesizes data from pure noise by gradually denoising it until a clean sample is produced. This synthesis procedure can be interpreted as an optimization algorithm that follows the gradient of the data density to produce likely samples.

Today we present two connected approaches that push the boundaries of the image synthesis quality for diffusion models — Super-Resolution via Repeated Refinements (SR3) and a model for class-conditioned synthesis, called Cascaded Diffusion Models (CDM). We show that by scaling up diffusion models and with carefully selected data augmentation techniques, we can outperform existing approaches. Specifically, SR3 attains strong image super-resolution results that surpass GANs in human evaluations. CDM generates high fidelity ImageNet samples that surpass BigGAN-deep and VQ-VAE2 on both FID score and Classification Accuracy Score by a large margin.

SR3: Image Super-Resolution
SR3 is a super-resolution diffusion model that takes as input a low-resolution image, and builds a corresponding high resolution image from pure noise. The model is trained on an image corruption process in which noise is progressively added to a high-resolution image until only pure noise remains. It then learns to reverse this process, beginning from pure noise and progressively removing noise to reach a target distribution through the guidance of the input low-resolution image..

<!– –>

With large scale training, SR3 achieves strong benchmark results on the super-resolution task for face and natural images when scaling to resolutions 4x–8x that of the input low-resolution image. These super-resolution models can further be cascaded together to increase the effective super-resolution scale factor, e.g., stacking a 64×64 → 256×256 and a 256×256 → 1024×1024 face super-resolution model together in order to perform a 64×64 → 1024×1024 super-resolution task.

We compare SR3 with existing methods using human evaluation study. We conduct a Two-Alternative Forced Choice Experiment where subjects are asked to choose between the reference high resolution image, and the model output when asked the question, “Which image would you guess is from a camera?” We measure the performance of the model through confusion rates (% of time raters choose the model outputs over reference images, where a perfect algorithm would achieve a 50% confusion rate). The results of this study are shown in the figure below.

Above: We achieve close to 50% confusion rate on the task of 16×16 → 128×128 faces, outperforming state-of-the-art face super-resolution methods PULSE and FSRGAN. Below: We also achieve a 40% confusion rate on the much more difficult task of 64×64 → 256×256 natural images, outperforming the regression baseline by a large margin.

CDM: Class-Conditional ImageNet Generation
Having shown the effectiveness of SR3 in performing natural image super-resolution, we go a step further and use these SR3 models for class-conditional image generation. CDM is a class-conditional diffusion model trained on ImageNet data to generate high-resolution natural images. Since ImageNet is a difficult, high-entropy dataset, we built CDM as a cascade of multiple diffusion models. This cascade approach involves chaining together multiple generative models over several spatial resolutions: one diffusion model that generates data at a low resolution, followed by a sequence of SR3 super-resolution diffusion models that gradually increase the resolution of the generated image to the highest resolution. It is well known that cascading improves quality and training speed for high resolution data, as shown by previous studies (for example in autoregressive models and VQ-VAE-2) and in concurrent work for diffusion models. As demonstrated by our quantitative results below, CDM further highlights the effectiveness of cascading in diffusion models for sample quality and usefulness in downstream tasks, such as image classification.

Example of the cascading pipeline that includes a sequence of diffusion models: the first generates a low resolution image, and the rest perform upsampling to the final high resolution image. Here the pipeline is for class-conditional ImageNet generation, which begins with a class-conditional diffusion model at 32×32 resolution, followed by 2x and 4x class-conditional super-resolution using SR3.
Selected generated images from our 256×256 cascaded class-conditional ImageNet model.

Along with including the SR3 model in the cascading pipeline, we also introduce a new data augmentation technique, which we call conditioning augmentation, that further improves the sample quality results of CDM. While the super-resolution models in CDM are trained on original images from the dataset, during generation they need to perform super-resolution on the images generated by a low-resolution base model, which may not be of sufficiently high quality in comparison to the original images. This leads to a train-test mismatch for the super-resolution models. Conditioning augmentation refers to applying data augmentation to the low-resolution input image of each super-resolution model in the cascading pipeline. These augmentations, which in our case include Gaussian noise and Gaussian blur, prevents each super-resolution model from overfitting to its lower resolution conditioning input, eventually leading to better higher resolution sample quality for CDM.

Altogether, CDM generates high fidelity samples superior to BigGAN-deep and VQ-VAE-2 in terms of both FID score and Classification Accuracy Score on class-conditional ImageNet generation. CDM is a pure generative model that does not use a classifier to boost sample quality, unlike other models such as ADM and VQ-VAE-2. See below for quantitative results on sample quality.

Class-conditional ImageNet FID scores at the 256×256 resolution for methods that do not use extra classifiers to boost sample quality. BigGAN-deep is reported at its best truncation value. (Lower is better.)
ImageNet classification accuracy scores at the 256×256 resolution, measuring the validation set accuracy of a classifier trained on generated data. CDM generated data attains significant gains over existing methods, closing the gap in classification accuracy between real and generated data. (Higher is better.)

With SR3 and CDM, we have pushed the performance of diffusion models to state-of-the-art on super-resolution and class-conditional ImageNet generation benchmarks. We are excited to further test the limits of diffusion models for a wide variety of generative modeling problems. For more information on our work, please visit Image Super-Resolution via Iterative Refinement and Cascaded Diffusion Models for High Fidelity Image Generation.

We thank our co-authors William Chan, Mohammad Norouzi, Tim Salimans, and David Fleet, and we are grateful for research discussions and assistance from Ben Poole, Jascha Sohl-Dickstein, Doug Eck, and the rest of the Google Research, Brain Team.


Speeding Up Reinforcement Learning with a New Physics Simulation Engine

Reinforcement learning (RL) is a popular method for teaching robots to navigate and manipulate the physical world, which itself can be simplified and expressed as interactions between rigid bodies1 (i.e., solid physical objects that do not deform when a force is applied to them). In order to facilitate the collection of training data in a practical amount of time, RL usually leverages simulation, where approximations of any number of complex objects are composed of many rigid bodies connected by joints and powered by actuators. But this poses a challenge: it frequently takes millions to billions of simulation frames for an RL agent to become proficient at even simple tasks, such as walking, using tools, or assembling toy blocks.

While progress has been made to improve training efficiency by recycling simulation frames, some RL tools instead sidestep this problem by distributing the generation of simulation frames across many simulators. These distributed simulation platforms yield impressive results that train very quickly, but they must run on compute clusters with thousands of CPUs or GPUs which are inaccessible to most researchers.

In “Brax – A Differentiable Physics Engine for Large Scale Rigid Body Simulation”, we present a new physics simulation engine that matches the performance of a large compute cluster with just a single TPU or GPU. The engine is designed to both efficiently run thousands of parallel physics simulations alongside a machine learning (ML) algorithm on a single accelerator and scale millions of simulations seamlessly across pods of interconnected accelerators. We’ve open sourced the engine along with reference RL algorithms and simulation environments that are all accessible via Colab. Using this new platform, we demonstrate 100-1000x faster training compared to a traditional workstation setup.

Three typical RL workflows. The left shows a typical workstation flow: on a single machine, with the environment on CPU, training takes hours or days. The middle shows a typical distributed simulation flow: training takes minutes by farming simulation out to thousands of machines. The right shows the Brax flow: learning and large batch simulation occur side by side on a single CPU/GPU chip.

Physics Simulation Engine Design Opportunities
Rigid body physics are used in video games, robotics, molecular dynamics, biomechanics, graphics and animation, and other domains. In order to accurately model such systems, simulators integrate forces from gravity, motor actuation, joint constraints, object collisions, and others to simulate the motion of a physical system across time.

Simulation of three spherical bodies, a wall, two joints, and one actuator. For each simulation timestep, forces and torques are integrated together to update the positions, rotations, and velocities of each physical body.

Taking a closer look at how most physics simulation engines are designed today, there are a few large opportunities to improve efficiency. As we noted above, a typical robotics learning pipeline places a single learner in a tight feedback with many simulations in parallel, but upon analyzing this architecture, one finds that:

  1. This layout imposes an enormous latency bottleneck. Because the data must travel over the network within a datacenter, the learner must wait for 10,000+ nanoseconds to fetch experience from the simulator. Were this experience instead already on the same device as the learner’s neural network, latency would drop to <1 nanosecond.
  2. The computation necessary for training the agent (one simulation step, followed by one update of the agent’s neural network) is overshadowed by the computation spent packaging the data (i.e., marshalling data within the engine, then into a wire format such as protobuf, then into TCP buffers, and then undoing all these steps on the learner side).
  3. The computations happening within each simulator are remarkably similar, but not exactly the same.

Brax Design
In response to these observations, Brax is designed so that its physics calculations are exactly the same across each of its thousands of parallel environments by ensuring that the simulation is free of branches (i.e., simulation “if” logic that diverges as a result of the environment state). An example of a branch in a physics engine is the application of a contact force between a ball and a wall: different code paths will execute depending on whether the ball is touching the wall. That is, if the ball contacts the wall, separate code for simulating the ball’s bounce off the wall will execute. Brax employs a mix of the following three strategies to avoid branching:

  • Replace the discrete branching logic with a continuous function, such as approximating the ball-wall contact force using a signed distance function. This approach results in the most efficiency gains.
  • Evaluate the branch during JAX’s just-in-time compile. Many branches based on static properties of the environment, such as whether it’s even possible for two objects to collide, may be evaluated prior to simulation time.
  • Run both sides of the branch during simulation but then select only the required results. Because this executes some code that isn’t ultimately used, it wastes operations compared to the above.

Once the calculations are guaranteed to be exactly uniform, the entire training architecture can be reduced in complexity to be executed on a single TPU or GPU. Doing so removes the computational overhead and latency of cross-machine communication. In practice, these changes lower the cost of training by 100x-1000x for comparable workloads.

Brax Environments
Environments are tiny packaged worlds that define a task for an RL agent to learn. Environments contain not only the means to simulate a world, but also functions, such as how to observe the world and the definition of the goal in that world.

A few standard benchmark environments have emerged in recent years for testing new RL algorithms and for evaluating the impact of those algorithms using metrics commonly understood by research scientists. Brax includes four such ready-to-use environments that come from the popular OpenAI gym: Ant, HalfCheetah, Humanoid, and Reacher.


From left to right: Ant, HalfCheetah, Humanoid, and Reacher are popular baseline environments for RL research.

Brax also includes three novel environments: dexterous manipulation of an object (a popular challenge in robotics), generalized locomotion (an agent that goes to a target placed anywhere around it), and a simulation of an industrial robot arm.

Left: Grasp, a claw hand that learns dexterous manipulation. Middle: Fetch, a toy, box-like dog learns a general goal-based locomotion policy. Right: Simulation of UR5e, an industrial robot arm.

Performance Benchmarks
The first step for analyzing Brax’s performance is to measure the speed at which it can simulate large batches of environments, because this is the critical bottleneck to overcome in order for the learner to consume enough experience to learn quickly.

These two graphs below show how many physics steps (updates to the state of the environment) Brax can produce as it is tasked with simulating more and more environments in parallel. The graph on the left shows that Brax scales the number of steps per second linearly with the number of parallel environments, only hitting memory bandwidth bottlenecks at 10,000 environments, which is not only enough for training single agents, but also suitable for training entire populations of agents. The graph on the right shows two things: first, that Brax performs well not only on TPU, but also on high-end GPUs (see the V100 and P100 curves), and second, that by leveraging JAX’s device parallelism primitives, Brax scales seamlessly across multiple devices, reaching hundreds of millions of physics steps per second (see the TPUv3 8×8 curve, which is 64 TPUv3 chips directly connected to each other over a high speed interconnect) .

Left: Scaling of the simulation steps per second for each Brax environment on a 4×2 TPU v3. Right: Scaling of the simulation steps per second for several accelerators on the Ant environment.

Another way to analyze Brax’s performance is to measure its impact on the time it takes to run a reinforcement learning experiment on a single workstation. Here we compare Brax training the popular Ant benchmark environment to its OpenAI counterpart, powered by the MuJoCo physics engine.

In the graph below, the blue line represents a standard workstation setup, where a learner runs on the GPU and the simulator runs on the CPU. We see that the time it takes to train an ant to run with reasonable proficiency (a score of 4000 on the y axis) drops from about 3 hours for the blue line, to about 10 seconds using Brax on accelerator hardware. It’s interesting to note that even on CPU alone (the grey line), Brax performs more than an order of magnitude faster, benefitting from learner and simulator both sitting in the same process.

Brax’s optimized PPO versus a standard GPU-backed PPO learning the MuJoCo-Ant-v2 environment, evaluated for 10 million steps. Note the x-axis is log-wallclock-time in seconds. Shaded region indicates lowest and highest performing seeds over 5 replicas, and solid line indicates mean.

Physics Fidelity
Designing a simulator that matches the behavior of the real world is a known hard problem that this work does not address. Nevertheless, it is useful to compare Brax to a reference simulator to ensure it is producing output that is at least as valid. In this case, we again compare Brax to MuJoCo, which is well-regarded for its simulation quality. We expect to see that, all else being equal, a policy has a similar reward trajectory whether trained in MuJoCo or Brax.

MuJoCo-Ant-v2 vs. Brax Ant, showing the number of environment steps plotted against the average episode score achieved for the environment. Both environments were trained with the same standard implementation of SAC. Shaded region indicates lowest and highest performing seeds over five runs, and solid line indicates the mean.

These curves show that as the reward rises at about the same rate for both simulators, both engines compute physics with a comparable level of complexity or difficulty to solve. And as both curves top out at about the same reward, we have confidence that the same general physical limits apply to agents operating to the best of their ability in either simulation.

We can also measure Brax’s ability to conserve linear momentum, angular momentum, and energy.

Linear momentum (left), angular momentum (middle), and energy (right) non-conservation scaling for Brax as well as several other physics engines. The y-axis indicates drift from the expected calculation (higher is smaller drift, which is better), and the x axis indicates the amount of time being simulated.

This measure of physics simulation quality was first proposed by the authors of MuJoCo as a way to understand how the simulation drifts off course as it is tasked with computing larger and larger time steps. Here, Brax performs similarly as its neighbors.

We invite researchers to perform a more qualitative measure of Brax’s physics fidelity by training their own policies in the Brax Training Colab. The learned trajectories are recognizably similar to those seen in OpenAI Gym.

Our work makes fast, scalable RL and robotics research much more accessible — what was formerly only possible via large compute clusters can now be run on workstations, or for free via hosted Google Colaboratory. Our Github repository includes not only the Brax simulation engine, but also a host of reference RL algorithms for fast training. We can’t wait to see what kind of new research Brax enables.

We’d like to thank our paper co-authors: Anton Raichuk, Sertan Girgin, Igor Mordatch, and Olivier Bachem. We also thank Erwin Coumans for advice on building physics engines, Blake Hechtman and James Bradbury for providing optimization help with JAX and XLA, and Luke Metz and Shane Gu for their advice. We’d also like to thank Vijay Sundaram, Wright Bagwell, Matt Leffler, Gavin Dodd, Brad Mckee, and Logan Olson, for helping to incubate this project.

1 Due to the complexity of the real world, there is also ongoing research exploring the physics of deformable bodies


Starting to think about AI Fairness

The topic of AI fairness metrics is as important to society as it is confusing. Confusing it is due to a number of reasons: terminological proliferation, abundance of formulae, and last not least the impression that everyone else seems to know what they’re talking about. This text hopes to counteract some of that confusion by starting from a common-sense approach of contrasting two basic positions: On the one hand, the assumption that dataset features may be taken as reflecting the underlying concepts ML practitioners are interested in; on the other, that there inevitably is a gap between concept and measurement, a gap that may be bigger or smaller depending on what is being measured. In contrasting these fundamental views, we bring together concepts from ML, legal science, and political philosophy.