Category: Offsites

Stuff I find valuable at key websites

Characterizing Emergent Phenomena in Large Language Models

Post author By
Post date November 13, 2022
No Comments on Characterizing Emergent Phenomena in Large Language Models

Posted by Jason Wei and Yi Tay, Research Scientists, Google Research, Brain Team

The field of natural language processing (NLP) has been revolutionized by language models trained on large amounts of text data. Scaling up the size of language models often leads to improved performance and sample efficiency on a range of downstream NLP tasks. In many cases, the performance of a large language model can be predicted by extrapolating the performance trend of smaller models. For instance, the effect of scale on language model perplexity has been empirically shown to span more than seven orders of magnitude.

On the other hand, performance for certain other tasks does not improve in a predictable fashion. For example, the GPT-3 paper showed that the ability of language models to perform multi-digit addition has a flat scaling curve (approximately random performance) for models from 100M to 13B parameters, at which point the performance jumped substantially. Given the growing use of language models in NLP research and applications, it is important to better understand abilities such as these that can arise unexpectedly.

In “Emergent Abilities of Large Language Models,” recently published in the Transactions on Machine Learning Research (TMLR), we discuss the phenomena of emergent abilities, which we define as abilities that are not present in small models but are present in larger models. More specifically, we study emergence by analyzing the performance of language models as a function of language model scale, as measured by total floating point operations (FLOPs), or how much compute was used to train the language model. However, we also explore emergence as a function of other variables, such as dataset size or number of model parameters (see the paper for full details). Overall, we present dozens of examples of emergent abilities that result from scaling up language models. The existence of such emergent abilities raises the question of whether additional scaling could potentially further expand the range of capabilities of language models.

Emergent Prompted Tasks

First we discuss emergent abilities that may arise in prompted tasks. In such tasks, a pre-trained language model is given a prompt for a task framed as next word prediction, and it performs the task by completing the response. Without any further fine-tuning, language models can often perform tasks that were not seen during training.

Example of few-shot prompting on movie review sentiment classification. The model is given one example of a task (classifying a movie review as positive or negative) and then performs the task on an unseen example.

We call a prompted task emergent when it unpredictably surges from random performance to above-random at a specific scale threshold. Below we show three examples of prompted tasks with emergent performance: multi-step arithmetic, taking college-level exams, and identifying the intended meaning of a word. In each case, language models perform poorly with very little dependence on model size up to a threshold at which point their performance suddenly begins to excel.

The ability to perform multi-step arithmetic (left), succeed on college-level exams (middle), and identify the intended meaning of a word in context (right) all emerge only for models of sufficiently large scale. The models shown include LaMDA, GPT-3, Gopher, Chinchilla, and PaLM.

Performance on these tasks only becomes non-random for models of sufficient scale — for instance, above 10²² training FLOPs for the arithmetic and multi-task NLU tasks, and above 10²⁴ training FLOPs for the word in context tasks. Note that although the scale at which emergence occurs can be different for different tasks and models, no model showed smooth improvement in behavior on any of these tasks. Dozens of other emergent prompted tasks are listed in our paper.

Emergent Prompting Strategies

The second class of emergent abilities encompasses prompting strategies that augment the capabilities of language models. Prompting strategies are broad paradigms for prompting that can be applied to a range of different tasks. They are considered emergent when they fail for small models and can only be used by a sufficiently-large model.

One example of an emergent prompting strategy is called “chain-of-thought prompting”, for which the model is prompted to generate a series of intermediate steps before giving the final answer. Chain-of-thought prompting enables language models to perform tasks requiring complex reasoning, such as a multi-step math word problem. Notably, models acquire the ability to do chain-of-thought reasoning without being explicitly trained to do so. An example of chain-of-thought prompting is shown in the figure below.

Chain of thought prompting enables sufficiently large models to solve multi-step reasoning problems.

The empirical results of chain-of-thought prompting are shown below. For smaller models, applying chain-of-thought prompting does not outperform standard prompting, for example, when applied to GSM8K, a challenging benchmark of math word problems. However, for large models (10²⁴ FLOPs), chain-of-thought prompting substantially improves performance in our tests, reaching a 57% solve rate on GSM8K.

Chain-of-thought prompting is an emergent ability — it fails to improve performance for small language models, but substantially improves performance for large models. Here we illustrate the difference between standard and chain-of-thought prompting at different scales for two language models, LaMDA and PaLM.

Implications of Emergent Abilities

The existence of emergent abilities has a range of implications. For example, because emergent few-shot prompted abilities and strategies are not explicitly encoded in pre-training, researchers may not know the full scope of few-shot prompted abilities of current language models. Moreover, the emergence of new abilities as a function of model scale raises the question of whether further scaling will potentially endow even larger models with new emergent abilities.

Identifying emergent abilities in large language models is a first step in understanding such phenomena and their potential impact on future model capabilities. Why does scaling unlock emergent abilities? Because computational resources are expensive, can emergent abilities be unlocked via other methods without increased scaling (e.g., better model architectures or training techniques)? Will new real-world applications of language models become unlocked when certain abilities emerge? Analyzing and understanding the behaviors of language models, including emergent behaviors that arise from scaling, is an important research question as the field of NLP continues to grow.

Acknowledgements

It was an honor and privilege to work with Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus.

Offsites

Researchers thought this was a bug (Borwein integrals)

Post author By
Post date November 4, 2022
No Comments on Researchers thought this was a bug (Borwein integrals)

Offsites

Natural Language Assessment: A New Framework to Promote Education

Post author By
Post date October 26, 2022
No Comments on Natural Language Assessment: A New Framework to Promote Education

Posted by Kedem Snir, Software Engineer, and Gal Elidan, Senior Staff Research Scientist, Google Research

Whether it’s a professional honing their skills or a child learning to read, coaches and educators play a key role in assessing the learner’s answer to a question in a given context and guiding them towards a goal. These interactions have unique characteristics that set them apart from other forms of dialogue, yet are not available when learners practice alone at home. In the field of natural language processing, this type of capability has not received much attention and is technologically challenging. We set out to explore how we can use machine learning to assess answers in a way that facilitates learning.

In this blog, we introduce an important natural language understanding (NLU) capability called Natural Language Assessment (NLA), and discuss how it can be helpful in the context of education. While typical NLU tasks focus on the user’s intent, NLA allows for the assessment of an answer from multiple perspectives. In situations where a user wants to know how good their answer is, NLA can offer an analysis of how close the answer is to what is expected. In situations where there may not be a “correct” answer, NLA can offer subtle insights that include topicality, relevance, verbosity, and beyond. We formulate the scope of NLA, present a practical model for carrying out topicality NLA, and showcase how NLA has been used to help job seekers practice answering interview questions with Google’s new interview prep tool, Interview Warmup.

Overview of Natural Language Assessment (NLA)

The goal of NLA is to evaluate the user’s answer against a set of expectations. Consider the following components for an NLA system interacting with students:

A question presented to the student
Expectations that define what we expect to find in the answer (e.g., a concrete textual answer, a set of topics we expect the answer to cover, conciseness)
An answer provided by the student
An assessment output (e.g., correctness, missing information, too specific or general, stylistic feedback, pronunciation, etc.)
[Optional] A context (e.g., a chapter in a book or an article)

With NLA, both the expectations about the answer and the assessment of the answer can be very broad. This enables teacher-student interactions that are more expressive and subtle. Here are two examples:

A question with a concrete correct answer: Even in situations where there is a clear correct answer, it can be helpful to assess the answer more subtly than simply correct or incorrect. Consider the following:

Context: Harry Potter and the Philosopher’s Stone
Question: “What is Hogwarts?”
Expectation: “Hogwarts is a school of Witchcraft and Wizardry” [expectation is given as text]
Answer: “I am not exactly sure, but I think it is a school.”

The answer may be missing salient details but labeling it as incorrect wouldn’t be entirely true or useful to a user. NLA can offer a more subtle understanding by, for example, identifying that the student’s answer is too general, and also that the student is uncertain.

Illustration of the NLA process from input question, answer and expectation to assessment output

This kind of subtle assessment, along with noting the uncertainty the student expressed, can be important in helping students build skills in conversational settings.
Topicality expectations: There are many situations in which a concrete answer is not expected. For example, if a student is asked an opinion question, there is no concrete textual expectation. Instead, there’s an expectation of relevance and opinionation, and perhaps some level of succinctness and fluency. Consider the following interview practice setup:

Question: “Tell me a little about yourself?”
Expectations: { “Education”, “Experience”, “Interests” } (a set of topics)
Answer: “Let’s see. I grew up in the Salinas valley in California and went to Stanford where I majored in economics but then got excited about technology so next I ….”

In this case, a useful assessment output would map the user’s answer to a subset of the topics covered, possibly along with a markup of which parts of the text relate to which topic. This can be challenging from an NLP perspective as answers can be long, topics can be mixed, and each topic on its own can be multi-faceted.

A Topicality NLA Model

In principle, topicality NLA is a standard multi-class task for which one can readily train a classifier using standard techniques. However, training data for such scenarios is scarce and it would be costly and time consuming to collect for each question and topic. Our solution is to break each topic into granular components that can be identified using large language models (LLMs) with a straightforward generic tuning.

We map each topic to a list of underlying questions and define that if the sentence contains an answer to one of those underlying questions, then it covers that topic. For the topic “Experience” we might choose underlying questions such as:

Where did you work?
What did you study?
…

While for the topic “Interests” we might choose underlying questions such as:

What are you interested in?
What do you enjoy doing?
…

These underlying questions are designed through an iterative manual process. Importantly, since these questions are sufficiently granular, current language models (see details below) can capture their semantics. This allows us to offer a zero-shot setting for the NLA topicality task: once trained (more on the model below), it is easy to add new questions and new topics, or adapt existing topics by modifying their underlying content expectation without the need to collect topic specific data. See below the model’s predictions for the sentence “I’ve worked in retail for 3 years” for the two topics described above:

A diagram of how the model uses underlying questions to predict the topic most likely to be covered by the user’s answer.

Since an underlying question for the topic “Experience” was matched, the sentence would be classified as “Experience”.

Application: Helping Job Seekers Prepare for Interviews

Interview Warmup is a new tool developed in collaboration with job seekers to help them prepare for interviews in fast-growing fields of employment such as IT Support and UX Design. It allows job seekers to practice answering questions selected by industry experts and to become more confident and comfortable with interviewing. As we worked with job seekers to understand their challenges in preparing for interviews and how an interview practice tool could be most useful, it inspired our research and the application of topicality NLA.

We build the topicality NLA model (once for all questions and topics) as follows: we train an encoder-only T5 model (EncT5 architecture) with 350 million parameters on Question-Answers data to predict the compatibility of an <underlying question, answer> pair. We rely on data from SQuAD 2.0 which was processed to produce <question, answer, label> triplets.

In the Interview Warmup tool, users can switch between talking points to see which ones were detected in their answer.

The tool does not grade or judge answers. Instead it enables users to practice and identify ways to improve on their own. After a user replies to an interview question, their answer is parsed sentence-by-sentence with the Topicality NLA model. They can then switch between different talking points to see which ones were detected in their answer. We know that there are many potential pitfalls in signaling to a user that their response is “good”, especially as we only detect a limited set of topics. Instead, we keep the control in the user’s hands and only use ML to help users make their own discoveries about how to improve.

So far, the tool has had great results helping job seekers around the world, including in the US, and we have recently expanded it to Africa. We plan to continue working with job seekers to iterate and make the tool even more helpful to the millions of people searching for new jobs.

A short film showing how Interview Warmup and its NLA capabilities were developed in collaboration with job seekers.

Conclusion

Natural Language Assessment (NLA) is a technologically challenging and interesting research area. It paves the way for new conversational applications that promote learning by enabling the nuanced assessment and analysis of answers from multiple perspectives. Working together with communities, from job seekers and businesses to classroom teachers and students, we can identify situations where NLA has the potential to help people learn, engage, and develop skills across an array of subjects, and we can build applications in a responsible way that empower users to assess their own abilities and discover ways to improve.

Acknowledgements

This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Google Research Israel, Google Creative Lab, and Grow with Google teams among others.

Offsites

Open Images V7 — Now Featuring Point Labels

Post author By
Post date October 25, 2022
No Comments on Open Images V7 — Now Featuring Point Labels

Posted by Rodrigo Benenson, Research Scientist, Google Research

Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. Researchers around the world use Open Images to train and evaluate computer vision models. Since the initial release of Open Images in 2016, which included image-level labels covering 6k categories, we have provided multiple updates to enrich annotations and expand the potential use cases of the dataset. Through several releases, we have added image-level labels for over 20k categories on all images and bounding box annotations, visual relations, instance segmentations, and localized narratives (synchronized voice, mouse trace, and text caption) on a subset of 1.9M images.

Today, we are happy to announce the release of Open Images V7, which expands the Open Images dataset even further with a new annotation type called point-level labels and includes a new all-in-one visualization tool that allows a better exploration of the rich data available.

Point Labels

The main strategy used to collect the new point-level label annotations leveraged suggestions from a machine learning (ML) model and human verification. First, the ML model selected points of interest and asked a yes or no question, e.g., “is this point on a pumpkin?”. Then, human annotators spent an average of 1.1 seconds answering the yes or no questions. We aggregated the answers from different annotators over the same question and assigned a final “yes”, “no”, or “unsure” label to each annotated point.

Illustration of the annotations interface.
(Image by Lenore Edman, under CC BY 2.0 license)

For each annotated image, we provide a collection of points, each with a “yes” or “no” label for a given class. These points provide sparse information that can be used for the semantic segmentation task. We collected a total of 38.6M new point annotations (12.4M with “yes” labels) that cover 5.8 thousand classes and 1.4M images.

By focusing on point labels, we expanded the number of images annotated and categories covered. We also concentrated the efforts of our annotators on efficiently collecting useful information. Compared to our instance segmentation, the new points include 16x more classes and cover more images. The new points also cover 9x more classes than our box annotations. Compared to existing segmentation datasets, like PASCAL VOC, COCO, Cityscapes, LVIS, or ADE20K, our annotations cover more classes and more images than previous work. The new point label annotations are the first type of annotation in Open Images that provides localization information for both things (countable objects, like cars, cats, and catamarans), and stuff categories (uncountable objects like grass, granite, and gravel). Overall, the newly collected data is roughly equivalent to two years of human annotation effort.

Our initial experiments show that this type of sparse data is suitable for both training and evaluating segmentation models. Training a model directly on sparse data allows us to reach comparable quality to training on dense annotations. Similarly, we show that one can directly compute the traditional semantic segmentation intersection-over-union (IoU) metric over sparse data. The ranking across different methods is preserved, and the sparse IoU values are an accurate estimate of its dense version. See our paper for more details.

Below, we show four example images with their point-level labels, illustrating the rich and diverse information these annotations provide. Circles ⭘ are “yes” labels, and squares ☐ are “no” labels.

Four example images with point-level labels.
Images by Richie Diesterheft, John AM Nueva, Sarah Ackerman, and C Thomas, all under CC BY 2.0 license.

New Visualizers

In addition to the new data release, we also expanded the available visualizations of the Open Images annotations. The Open Images website now includes dedicated visualizers to explore the localized narratives annotations, the new point-level annotations, and a new all-in-one view. This new all-in-one view is available for the subset of 1.9M densely annotated images and allows one to explore the rich annotations that Open Images has accumulated over seven releases. On average these images have annotations for 6.7 image-labels (classes), 8.3 boxes, 1.7 relations, 1.5 masks, 0.4 localized narratives and 34.8 point-labels per image.

Below, we show two example images with various annotations in the all-in-one visualizer. The figures show the image-level labels, bounding boxes, box relations, instance masks, localized narrative mouse trace and caption, and point-level labels. The + classes have positive annotations (of any kind), while – classes have only negative annotations (image-level or point-level).

Two example images with various annotations in the all-in-one visualizer.
Images by Jason Paris, and Rubén Vique, all under CC BY 2.0 license.

Conclusion

We hope that this new data release will enable computer vision research to cover ever more diverse and challenging scenarios. As the quality of automated semantic segmentation models improves over common classes, we want to move towards the long tail of visual concepts, and sparse point annotations are a step in that direction. More and more works are exploring how to use such sparse annotations (e.g., as supervision for instance segmentation or semantic segmentation), and Open Images V7 contributes to this research direction. We are looking forward to seeing what you will build next.

Acknowledgements

Thanks to Vittorio Ferrari, Jordi Pont-Tuset, Alina Kuznetsova, Ashlesha Sadras, and the annotators team for their support creating this new data release.

Offsites

Google at ECCV 2022

Posted by Shaina Mehta, Program Manager, Google

Google is proud to be a Platinum Sponsor of the European Conference on Computer Vision (ECCV 2022), a premier forum for the dissemination of research in computer vision and machine learning (ML). This year, ECCV 2022 will be held as a hybrid event, in person in Tel Aviv, Israel with virtual attendance as an option. Google has a strong presence at this year’s conference with over 60 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive research and expanding our partnership with the broader ML research community.

Registered for ECCV 2022? We hope you’ll visit our on-site or virtual booths to learn more about the research we’re presenting at ECCV 2022, including several demos and opportunities to connect with our researchers. Learn more about Google’s research being presented at ECCV 2022 below (Google affiliations in bold).

Organizing Committee

Program Chairs include: Moustapha Cissé

Awards Paper Committee: Todd Zickler

Area Chairs include: Ayan Chakrabarti, Tali Dekel, Alireza Fathi, Vittorio Ferrari, David Fleet, Dilip Krishnan, Michael Rubinstein, Cordelia Schmid, Deqing Sun, Federico Tombari, Jasper Uijlings, Ming-Hsuan Yang, Todd Zickler

Accepted Publications

NeuMesh: Learning Disentangled Neural Mesh-Based Implicit Field for Geometry and Texture Editing
Bangbang Yang, Chong Bao, Junyi Zeng, Hujun Bao, Yinda Zhang, Zhaopeng Cui, Guofeng Zhang

Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Networks
Zihang Zou, Boqing Gong, Liqiang Wang

Exploiting Unlabeled Data with Vision and Language Models for Object Detection
Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, Vijay Kumar B G, Anastasis Stathopoulos, Manmohan Chandraker, Dimitris N. Metaxas

Waymo Open Dataset: Panoramic Video Panoptic Segmentation
Jieru Mei, Alex Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar

PRIF: Primary Ray-Based Implicit Function
Brandon Yushan Feng, Yinda Zhang, Danhang Tang, Ruofei Du, Amitabh Varshney

LoRD: Local 4D Implicit Representation for High-Fidelity Dynamic Human Modeling
Boyan Jiang, Xinlin Ren, Mingsong Dou, Xiangyang Xue, Yanwei Fu, Yinda Zhang

k-Means Mask Transformer (see blog post)
Qihang Yu^*, Siyuan Qiao, Maxwell D Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

MaxViT: Multi-Axis Vision Transformer (see blog post)
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

E-Graph: Minimal Solution for Rigid Rotation with Extensibility Graphs
Yanyan Li, Federico Tombari

RBP-Pose: Residual Bounding Box Projection for Category-Level Pose Estimation
Ruida Zhang, Yan Di, Zhiqiang Lou, Fabian Manhardt, Federico Tombari, Xiangyang Ji

GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning
Huseyin Coskun, Alireza Zareian, Joshua L Moore, Federico Tombari, Chen Wang

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin^*

Adaptive Transformers for Robust Few-Shot Cross-Domain Face Anti-spoofing
Hsin-Ping Huang, Deqing Sun, Yaojie Liu, Wen-Sheng Chu, Taihong Xiao, Jinwei Yuan, Hartwig Adam, Ming-Hsuan Yang

DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning
Zifeng Wang^*, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister

BLT: Bidirectional Layout Transformer for Controllable Layout Generation
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer
Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming-Hsuan Yang, Jiaqi Ma

Learning Visibility for Robust Dense Human Body Estimation
Chun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, Ming-Hsuan Yang

Are Vision Transformers Robust to Patch Perturbations?
Jindong Gu, Volker Tresp, Yao Qin

PseudoAugment: Learning to Use Unlabeled Data for Data Augmentation in Point Clouds
Zhaoqi Leng, Shuyang Cheng, Ben Caine, Weiyue Wang, Xiao Zhang, Jonathon Shlens, Mingxing Tan, Dragomir Anguelov

Structure and Motion from Casual Videos
Zhoutong Zhang, Forrester Cole, Zhengqi Li, Noah Snavely, Michael Rubinstein, William T. Freeman

PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map
Chenfeng Xu, Tian Li, Chen Tang, Lingfeng Sun, Kurt Keutzer, Masayoshi Tomizuka, Alireza Fathi, Wei Zhan

Novel Class Discovery Without Forgetting
Joseph K J, Sujoy Paul, Gaurav Aggarwal, Soma Biswas, Piyush Rai, Kai Han, Vineeth N Balasubramanian

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning
Yuxiao Chen, Long Zhao, Jianbo Yuan, Yu Tian, Zhaoyang Xia, Shijie Geng, Ligong Han, Dimitris N. Metaxas

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks
Nan Ding, Xi Chen, Tomer Levinboim, Soravit Changpinyo, Radu Soricut

InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images
Zhengqi Li, Qianqian Wang^*, Noah Snavely, Angjoo Kanazawa^*

Generalizable Patch-Based Neural Rendering (see blog post)
Mohammed Suhail^*, Carlos Esteves, Leonid Sigal, Ameesh Makadia

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds
Minghua Liu, Yin Zhou, Charles R. Qi, Boqing Gong, Hao Su, Dragomir Anguelov

The Missing Link: Finding Label Relations Across Datasets
Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Learning Instance-Specific Adaptation for Cross-Domain Segmentation
Yuliang Zou, Zizhao Zhang, Chun-Liang Li, Han Zhang, Tomas Pfister, Jia-Bin Huang

Learning Audio-Video Modalities from Image Captions
Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
Medhini Narasimhan^*, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

On Label Granularity and Object Localization
Elijah Cole, Kimberly Wilber, Grant Van Horn, Xuan Yang, Marco Fornoni, Pietro Perona, Serge Belongie, Andrew Howard, Oisin Mac Aodha

Disentangling Architecture and Training for Optical Flow
Deqing Sun, Charles Herrmann, Fitsum Reda, Michael Rubinstein, David J. Fleet, William T. Freeman

NewsStories: Illustrating Articles with Visual Summaries
Reuben Tan, Bryan Plummer, Kate Saenko, J.P. Lewis, Avneesh Sud, Thomas Leung

Improving GANs for Long-Tailed Data Through Group Spectral Regularization
Harsh Rangwani, Naman Jaswani, Tejan Karmali, Varun Jampani, Venkatesh Babu Radhakrishnan

Planes vs. Chairs: Category-Guided 3D Shape Learning Without Any 3D Cues
Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James Rehg

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays

Learned Monocular Depth Priors in Visual-Inertial Initialization
Yunwen Zhou, Abhishek Kar, Eric L. Turner, Adarsh Kowdle, Chao Guo, Ryan DuToit, Konstantine Tsotsos

How Stable are Transferability Metrics Evaluations?
Andrea Agostinelli, Michal Pandy, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Data-Free Neural Architecture Search via Recursive Label Calibration
Zechun Liu^*, Zhiqiang Shen, Yun Long, Eric Xing, Kwang-Ting Cheng, Chas H. Leichner

Fast and High Quality Image Denoising via Malleable Convolution
Yifan Jiang^*, Bartlomiej Wronski, Ben Mildenhall, Jonathan T. Barron, Zhangyang Wang, Tianfan Xue

Concurrent Subsidiary Supervision for Unsupervised Source-Free Domain Adaptation
Jogendra Nath Kundu, Suvaansh Bhambri, Akshay R Kulkarni, Hiran Sarkar,
Varun Jampani, Venkatesh Babu Radhakrishnan

Learning Online Multi-Sensor Depth Fusion
Erik Sandström, Martin R. Oswald, Suryansh Kumar, Silvan Weder, Fisher Yu, Cristian Sminchisescu, Luc Van Gool

Hierarchical Semantic Regularization of Latent Spaces in StyleGANs
Tejan Karmali, Rishubh Parihar, Susmit Agrawal, Harsh Rangwani, Varun Jampani, Maneesh K Singh, Venkatesh Babu Radhakrishnan

RayTran: 3D Pose Estimation and Shape Reconstruction of Multiple Objects from Videos with Ray-Traced Transformers
Michał J Tyszkiewicz, Kevis-Kokitsi Maninis, Stefan Popov, Vittorio Ferrari

Neural Video Compression Using GANs for Detail Synthesis and Propagation
Fabian Mentzer, Eirikur Agustsson, Johannes Ballé, David Minnen, Nick Johnston, George Toderici

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, Serge Belongie

Implicit Neural Representations for Image Compression
Yannick Strümpler, Janis Postels, Ren Yang, Luc Van Gool, Federico Tombari

3D Compositional Zero-Shot Learning with DeCompositional Consensus
Muhammad Ferjad Naeem, Evin Pınar Örnek, Yongqin Xian, Luc Van Gool, Federico Tombari

FindIt: Generalized Localization with Natural Language Queries (see blog post)
Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation
Wuyang Chen^*, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou

Improved Masked Image Generation with Token-Critic
Jose Lezama, Huiwen Chang, Lu Jiang, Irfan Essa

Learning Discriminative Shrinkage Deep Networks for Image Deconvolution
Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Efthymios Tzinis^*, Scott Wisdom, Tal Remez, John Hershey

Simple Open-Vocabulary Object Detection with Vision Transformers
Matthias Minderer, Alexey Gritsenko, Austin C Stone, Maxim Neumann, Dirk Weißenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality
Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, Hans Peter Graf

Video Question Answering with Iterative Video-Text Co-tokenization (see blog post)
AJ Piergiovanni, Kairo Morton^*, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Class-Agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang

FILM: Frame Interpolation for Large Motion (see blog post)
Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, Brian Curless

Compositional Human-Scene Interaction Synthesis with Semantic Control
Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, Siyu Tang

Workshops

LatinX in AI
Mentors include: José Lezama
Keynote Speakers include: Andre Araujo

AI for Creative Video Editing and Understanding
Keynote Speakers include: Tali Dekel, Negar Rostamzadeh

Learning With Limited and Imperfect Data (L2ID)
Invited Speakers include: Xiuye Gu
Organizing Committee includes: Sadeep Jayasumana

International Challenge on Compositional and Multimodal Perception (CAMP)
Program Committee includes: Edward Vendrow

Self-Supervised Learning: What is Next?
Invited Speakers include: Mathilde Caron, Arsha Nagrani
Organizers include: Andrew Zisserman

3rd Workshop on Adversarial Robustness In the Real World
Invited Speakers include: Ekin Dogus Cubuk
Organizers include: Xinyun Chen, Alexander Robey, Nataniel Ruiz, Yutong Bai

AV4D: Visual Learning of Sounds in Spaces
Invited Speakers include: John Hershey

Challenge on Mobile Intelligent Photography and Imaging (MIPI)
Invited Speakers include: Peyman Milanfar

Robust Vision Challenge 2022
Organizing Committee includes: Alina Kuznetsova

Computer Vision in the Wild
Challenge Organizers include: Yi-Ting Chen, Ye Xia
Invited Speakers include: Yin Cui, Yongqin Xian, Neil Houlsby

Self-Supervised Learning for Next-Generation Industry-Level Autonomous Driving (SSLAD)
Organizers include: Fisher Yu

Responsible Computer Vision
Organizing Committee includes: Been Kim
Invited Speakers include: Emily Denton

Cross-Modal Human-Robot Interaction
Invited Speakers include: Peter Anderson

ISIC Skin Image Analysis
Organizing Committee includes: Yuan Liu
Steering Committee includes: Yuan Liu, Dale Webster
Invited Speakers include: Yuan Liu

Observing and Understanding Hands in Action
Sponsored by Google

Autonomous Vehicle Vision (AVVision)
Speakers include: Fisher Yu

Visual Perception for Navigation in Human Environments: The JackRabbot Human Body Pose Dataset and Benchmark
Organizers include: Edward Vendrow

Language for 3D Scenes
Invited Speakers include: Jason Baldridge
Organizers include: Leonidas Guibas

Designing and Evaluating Computer Perception Systems (CoPe)
Organizers include: Andrew Zisserman

Learning To Generate 3D Shapes and Scenes
Panelists include: Pete Florence

Advances in Image Manipulation
Program Committee includes: George Toderici, Ming-Hsuan Yang

TiE: Text in Everything
Challenge Organizers include: Shangbang Long, Siyang Qin
Invited Speakers include: Tali Dekel, Aishwarya Agrawal

Instance-Level Recognition
Organizing Committee: Andre Araujo, Bingyi Cao, Tobias Weyand
Invited Speakers include: Mathilde Caron

What Is Motion For?
Organizing Committee: Deqing Sun, Fitsum Reda, Charles Herrmann
Invited Speakers include: Tali Dekel

Neural Geometry and Rendering: Advances and the Common Objects in 3D Challenge
Invited Speakers include: Ben Mildenhall

Visual Object-Oriented Learning Meets Interaction: Discovery, Representations, and Applications
Invited Speakers include: Klaus Greff, Thomas Kipf
Organizing Committee includes: Leonidas Guibas

Vision with Biased or Scarce Data (VBSD)
Program Committee includes: Yizhou Wang

Multiple Object Tracking and Segmentation in Complex Environments
Invited Speakers include: Xingyi Zhou, Fisher Yu

3rd Visual Inductive Priors for Data-Efficient Deep Learning Workshop
Organizing Committee includes: Ekin Dogus Cubuk

DeeperAction: Detailed Video Action Understanding and Anomaly Recognition
Advisors include: Rahul Sukthankar

Sign Language Understanding Workshop and Sign Language Recognition, Translation & Production Challenge
Organizing Committee includes: Andrew Zisserman
Speakers include: Andrew Zisserman

Ego4D: First-Person Multi-Modal Video Understanding
Invited Speakers include: Michal Irani

AI-Enabled Medical Image Analysis: Digital Pathology & Radiology/COVID19
Program Chairs include: Po-Hsuan Cameron Chen
Workshop Partner: Google Health

Visual Object Tracking Challenge (VOT 2022)
Technical Committee includes: Christoph Mayer

Assistive Computer Vision and Robotics
Technical Committee includes: Maja Mataric

Human Body, Hands, and Activities from Egocentric and Multi-View Cameras
Organizers include: Francis Engelmann

Frontiers of Monocular 3D Perception: Implicit x Explicit
Panelists include: Pete Florence

Tutorials

Self-Supervised Representation Learning in Computer Vision
Invited Speakers include: Ting Chen

Neural Volumetric Rendering for Computer Vision
Organizers include: Ben Mildenhall, Pratul Srinivasan, Jon Barron
Presenters include: Ben Mildenhall, Pratul Srinivasan

New Frontiers in Efficient Neural Architecture Search!
Speakers include: Ruochen Wang

^*Work done while at Google. ^↩

Offsites

PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

Post author By
Post date October 20, 2022
No Comments on PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

Posted by Wenhao Yu, Research Scientist, Robotics at Google, and Kuang-Huei Lee, Research Engineer, Google Research, Brain team

Evolution strategy (ES) is a family of optimization techniques inspired by the ideas of natural selection: a population of candidate solutions are usually evolved over generations to better adapt to an optimization objective. ES has been applied to a variety of challenging decision making problems, such as legged locomotion, quadcopter control, and even power system control.

Compared to gradient-based reinforcement learning (RL) methods like proximal policy optimization (PPO) and soft actor-critic (SAC), ES has several advantages. First, ES directly explores in the space of controller parameters, while gradient-based methods often explore within a limited action space, which indirectly influences the controller parameters. More direct exploration has been shown to boost learning performance and enable large scale data collection with parallel computation. Second, a major challenge in RL is long-horizon credit assignment, e.g., when a robot accomplishes a task in the end, determining which actions it performed in the past were the most critical and should be assigned a greater reward. Since ES directly considers the total reward, it relieves researchers from needing to explicitly handle credit assignment. In addition, because ES does not rely on gradient information, it can naturally handle highly non-smooth objectives or controller architectures where gradient computation is non-trivial, such as meta–reinforcement learning. However, a major weakness of ES-based algorithms is their difficulty in scaling to problems that require high-dimensional sensory inputs to encode the environment dynamics, such as training robots with complex vision inputs.

In this work, we propose “PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations”, a learning algorithm that combines representation learning and ES to effectively solve high dimensional problems in a scalable way. The core idea is to leverage predictive information, a representation learning objective, to obtain a compact representation of the high-dimensional environment dynamics, and then apply Augmented Random Search (ARS), a popular ES algorithm, to transform the learned compact representation into robot actions. We tested PI-ARS on the challenging problem of visual-locomotion for legged robots. PI-ARS enables fast training of performant vision-based locomotion controllers that can traverse a variety of difficult environments. Furthermore, the controllers trained in simulated environments successfully transfer to a real quadruped robot.

PI-ARS trains reliable visual-locomotion policies that are transferable to the real world.

Predictive Information
A good representation for policy learning should be both compressive, so that ES can focus on solving a much lower dimensional problem than learning from raw observations would entail, and task-critical, so the learned controller has all the necessary information needed to learn the optimal behavior. For robotic control problems with high-dimensional input space, it is critical for the policy to understand the environment, including the dynamic information of both the robot itself and its surrounding objects.

As such, we propose an observation encoder that preserves information from the raw input observations that allows the policy to predict the future states of the environment, thus the name predictive information (PI). More specifically, we optimize the encoder such that the encoded version of what the robot has seen and planned in the past can accurately predict what the robot might see and be rewarded in the future. One mathematical tool to describe such a property is that of mutual information, which measures the amount of information we obtain about one random variable X by observing another random variable Y. In our case, X and Y would be what the robot saw and planned in the past, and what the robot sees and is rewarded in the future. Directly optimizing the mutual information objective is a challenging problem because we usually only have access to samples of the random variables, but not their underlying distributions. In this work we follow a previous approach that uses InfoNCE, a contrastive variational bound on mutual information to optimize the objective.

Left: We use representation learning to encode PI of the environment. Right: We train the representation by replaying trajectories from the replay buffer and maximize the predictability between the observation and motion plan in the past and the observation and reward in the future of the trajectory.

Predictive Information with Augmented Random Search
Next, we combine PI with Augmented Random Search (ARS), an algorithm that has shown excellent optimization performance for challenging decision-making tasks. At each iteration of ARS, it samples a population of perturbed controller parameters, evaluates their performance in the testing environment, and then computes a gradient that moves the controller towards the ones that performed better.

We use the learned compact representation from PI to connect PI and ARS, which we call PI-ARS. More specifically, ARS optimizes a controller that takes as input the learned compact representation PI and predicts appropriate robot commands to achieve the task. By optimizing a controller with smaller input space, it allows ARS to find the optimal solution more efficiently. Meanwhile, we use the data collected during ARS optimization to further improve the learned representation, which is then fed into the ARS controller in the next iteration.

An overview of the PI-ARS data flow. Our algorithm interleaves between two steps: 1) optimizing the PI objective that updates the policy, which is the weights for the neural network that extracts the learned representation; and 2) sampling new trajectories and updating the controller parameters using ARS.

Visual-Locomotion for Legged Robots
We evaluate PI-ARS on the problem of visual-locomotion for legged robots. We chose this problem for two reasons: visual-locomotion is a key bottleneck for legged robots to be applied in real-world applications, and the high-dimensional vision-input to the policy and the complex dynamics in legged robots make it an ideal test-case to demonstrate the effectiveness of the PI-ARS algorithm. A demonstration of our task setup in simulation can be seen below. Policies are first trained in simulated environments, and then transferred to hardware.

An illustration of the visual-locomotion task setup. The robot is equipped with two cameras to observe the environment (illustrated by the transparent pyramids). The observations and robot state are sent to the policy to generate a high-level motion plan, such as feet landing location and desired moving speed. The high-level motion plan is then achieved by a low-level Motion Predictive Control (MPC) controller.

Experiment Results
We first evaluate the PI-ARS algorithm on four challenging simulated tasks:

Uneven stepping stones: The robot needs to walk over uneven terrain while avoiding gaps.
Quincuncial piles: The robot needs to avoid gaps both in front and sideways.
Moving platforms: The robot needs to walk over stepping stones that are randomly moving horizontally or vertically. This task illustrates the flexibility of learning a vision-based policy in comparison to explicitly reconstructing the environment.
Indoor navigation: The robot needs to navigate to a random location while avoiding obstacles in an indoor environment.

As shown below, PI-ARS is able to significantly outperform ARS in all four tasks in terms of the total task reward it can obtain (by 30-50%).

Left: Visualization of PI-ARS policy performance in simulation. Right: Total task reward (i.e., episode return) for PI-ARS (green line) and ARS (red line). The PI-ARS algorithm significantly outperforms ARS on four challenging visual-locomotion tasks.

We further deploy the trained policies to a real Laikago robot on two tasks: random stepping stone and indoor navigation. We demonstrate that our trained policies can successfully handle real-world tasks. Notably, the success rate of the random stepping stone task improved from 40% in the prior work to 100%.

PI-ARS trained policy enables a real Laikago robot to navigate around obstacles.

Conclusion
In this work, we present a new learning algorithm, PI-ARS, that combines gradient-based representation learning with gradient-free evolutionary strategy algorithms to leverage the advantages of both. PI-ARS enjoys the effectiveness, simplicity, and parallelizability of gradient-free algorithms, while relieving a key bottleneck of ES algorithms on handling high-dimensional problems by optimizing a low-dimensional representation. We apply PI-ARS to a set of challenging visual-locomotion tasks, among which PI-ARS significantly outperforms the state of the art. Furthermore, we validate the policy learned by PI-ARS on a real quadruped robot. It enables the robot to walk over randomly-placed stepping stones and navigate in an indoor space with obstacles. Our method opens the possibility of incorporating modern large neural network models and large-scale data into the field of evolutionary strategy for robotics control.

Acknowledgements
We would like to thank our paper co-authors: Ofir Nachum, Tingnan Zhang, Sergio Guadarrama, and Jie Tan. We would also like to thank Ian Fischer and John Canny for valuable feedback.

Offsites

MUSIQ: Assessing Image Aesthetic and Technical Quality with Multi-scale Transformers

Post author By
Post date October 20, 2022
No Comments on MUSIQ: Assessing Image Aesthetic and Technical Quality with Multi-scale Transformers

Posted by Junjie Ke, Senior Software Engineer, and Feng Yang, Senior Staff Software Engineer, Google Research

Understanding the aesthetic and technical quality of images is important for providing a better user visual experience. Image quality assessment (IQA) uses models to build a bridge between an image and a user’s subjective perception of its quality. In the deep learning era, many IQA approaches, such as NIMA, have achieved success by leveraging the power of convolutional neural networks (CNNs). However, CNN-based IQA models are often constrained by the fixed-size input requirement in batch training, i.e., the input images need to be resized or cropped to a fixed size shape. This preprocessing is problematic for IQA because images can have very different aspect ratios and resolutions. Resizing and cropping can impact image composition or introduce distortions, thus changing the quality of the image.

In CNN-based models, images need to be resized or cropped to a fixed shape for batch training. However, such preprocessing can alter the image aspect ratio and composition, thus impacting image quality. Original image used under CC BY 2.0 license.

In “MUSIQ: Multi-scale Image Quality Transformer”, published at ICCV 2021, we propose a patch-based multi-scale image quality transformer (MUSIQ) to bypass the CNN constraints on fixed input size and predict the image quality effectively on native-resolution images. The MUSIQ model supports the processing of full-size image inputs with varying aspect ratios and resolutions and allows multi-scale feature extraction to capture image quality at different granularities. To support positional encoding in the multi-scale representation, we propose a novel hash-based 2D spatial embedding combined with an embedding that captures the image scaling. We apply MUSIQ on four large-scale IQA datasets, demonstrating consistent state-of-the-art results across three technical quality datasets (PaQ-2-PiQ, KonIQ-10k, and SPAQ) and comparable performance to that of state-of-the-art models on the aesthetic quality dataset AVA.

The patch-based MUSIQ model can process the full-size image and extract multi-scale features, which better aligns with a person’s typical visual response.

In the following figure, we show a sample of images, their MUSIQ score, and their mean opinion score (MOS) from multiple human raters in the brackets. The range of the score is from 0 to 100, with 100 being the highest perceived quality. As we can see from the figure, MUSIQ predicts high scores for images with high aesthetic quality and high technical quality, and it predicts low scores for images that are not aesthetically pleasing (low aesthetic quality) or that contain visible distortions (low technical quality).

High quality
High quality	76.10 [74.36]	69.29 [70.92]

Low aesthetics quality
Low aesthetics quality	55.37 [53.18]	32.50 [35.47]

Low technical quality
Low technical quality	14.93 [14.38]	15.24 [11.86]

Predicted MUSIQ score (and ground truth) on images from the KonIQ-10k dataset. Top: MUSIQ predicts high scores for high quality images. Middle: MUSIQ predicts low scores for images with low aesthetic quality, such as images with poor composition or lighting. Bottom: MUSIQ predicts low scores for images with low technical quality, such as images with visible distortion artifacts (e.g., blurry, noisy).

The Multi-scale Image Quality Transformer
MUSIQ tackles the challenge of learning IQA on full-size images. Unlike CNN-models that are often constrained to fixed resolution, MUSIQ can handle inputs with arbitrary aspect ratios and resolutions.

To accomplish this, we first make a multi-scale representation of the input image, containing the native resolution image and its resized variants. To preserve the image composition, we maintain its aspect ratio during resizing. After obtaining the pyramid of images, we then partition the images at different scales into fixed-size patches that are fed into the model.

Illustration of the multi-scale image representation in MUSIQ.

Since patches are from images of varying resolutions, we need to effectively encode the multi-aspect-ratio multi-scale input into a sequence of tokens, capturing both the pixel, spatial, and scale information. To achieve this, we design three encoding components in MUSIQ, including: 1) a patch encoding module to encode patches extracted from the multi-scale representation; 2) a novel hash-based spatial embedding module to encode the 2D spatial position for each patch; and 3) a learnable scale embedding to encode different scales. In this way, we can effectively encode the multi-scale input as a sequence of tokens, serving as the input to the Transformer encoder.

To predict the final image quality score, we use the standard approach of prepending an additional learnable “classification token” (CLS). The CLS token state at the output of the Transformer encoder serves as the final image representation. We then add a fully connected layer on top to predict the IQS. The figure below provides an overview of the MUSIQ model.

Overview of MUSIQ. The multi-scale multi-resolution input will be encoded by three components: the scale embedding (SCE), the hash-based 2D spatial embedding (HSE), and the multi-scale patch embedding (MPE).

Since MUSIQ only changes the input encoding, it is compatible with any Transformer variants. To demonstrate the effectiveness of the proposed method, in our experiments we use the classic Transformer with a relatively lightweight setting so that the model size is comparable to ResNet-50.

Benchmark and Evaluation
To evaluate MUSIQ, we run experiments on multiple large-scale IQA datasets. On each dataset, we report the Spearman’s rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) between our model prediction and the human evaluators’ mean opinion score. SRCC and PLCC are correlation metrics ranging from -1 to 1. Higher PLCC and SRCC means better alignment between model prediction and human evaluation. The graph below shows that MUSIQ outperforms other methods on PaQ-2-PiQ, KonIQ-10k, and SPAQ.

Performance comparison of MUSIQ and previous state-of-the-art (SOTA) methods on four large-scale IQA datasets. On each dataset we compare the Spearman’s rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) of model prediction and ground truth.

Notably, the PaQ-2-PiQ test set is entirely composed of large pictures having at least one dimension exceeding 640 pixels. This is very challenging for traditional deep learning approaches, which require resizing. MUSIQ can outperform previous methods by a large margin on the full-size test set, which verifies its robustness and effectiveness.

It is also worth mentioning that previous CNN-based methods often required sampling as many as 20 crops for each image during testing. This kind of multi-crop ensemble is a way to mitigate the fixed shape constraint in the CNN models. But since each crop is only a sub-view of the whole image, the ensemble is still an approximate approach. Moreover, CNN-based methods both add additional inference cost for every crop and, because they sample different crops, they can introduce randomness in the result. In contrast, because MUSIQ takes the full-size image as input, it can directly learn the best aggregation of information across the full image and it only needs to run the inference once.

To further verify that the MUSIQ model captures different information at different scales, we visualize the attention weights on each image at different scales.

Attention visualization from the output tokens to the multi-scale representation, including the original resolution image and two proportionally resized images. Brighter areas indicate higher attention, which means that those areas are more important for the model output. Images for illustration are taken from the AVA dataset.

We observe that MUSIQ tends to focus on more detailed areas in the full, high-resolution images and on more global areas on the resized ones. For example, for the flower photo above, the model’s attention on the original image is focusing on the pedal details, and the attention shifts to the buds at lower resolutions. This shows that the model learns to capture image quality at different granularities.

Conclusion
We propose a multi-scale image quality transformer (MUSIQ), which can handle full-size image input with varying resolutions and aspect ratios. By transforming the input image to a multi-scale representation with both global and local views, the model can capture the image quality at different granularities. Although MUSIQ is designed for IQA, it can be applied to other scenarios where task labels are sensitive to image resolution and aspect ratio. The MUSIQ model and checkpoints are available at our GitHub repository.

Acknowledgements
This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Qifei Wang, Yilin Wang and Peyman Milanfar.

Offsites

Do Modern ImageNet Classifiers Accurately Predict Perceptual Similarity?

Post author By
Post date October 19, 2022
No Comments on Do Modern ImageNet Classifiers Accurately Predict Perceptual Similarity?

Posted by Manoj Kumar, Research Engineer, and Ekin Dogus Cubuk, Research Scientist, Google Research

The task of determining the similarity between images is an open problem in computer vision and is crucial for evaluating the realism of machine-generated images. Though there are a number of straightforward methods of estimating image similarity (e.g., low-level metrics that measure pixel differences, such as FSIM and SSIM), in many cases, the measured similarity differences do not match the differences perceived by a person. However, more recent work has demonstrated that intermediate representations of neural network classifiers, such as AlexNet, VGG and SqueezeNet trained on ImageNet, exhibit perceptual similarity as an emergent property. That is, Euclidean distances between encoded representations of images by ImageNet-trained models correlate much better with a person’s judgment of differences between images than estimating perceptual similarity directly from image pixels.

Two sets of sample images from the BAPPS dataset. Trained networks agree more with human judgements as compared to low-level metrics (PSNR, SSIM, FSIM). Image source: Zhang et al. (2018).

In “Do better ImageNet classifiers assess perceptual similarity better?” published in Transactions on Machine Learning Research, we contribute an extensive experimental study on the relationship between the accuracy of ImageNet classifiers and their emergent ability to capture perceptual similarity. To evaluate this emergent ability, we follow previous work in measuring the perceptual scores (PS), which is roughly the correlation between human preferences to that of a model for image similarity on the BAPPS dataset. While prior work studied the first generation of ImageNet classifiers, such as AlexNet, SqueezeNet and VGG, we significantly increase the scope of the analysis incorporating modern classifiers, such as ResNets and Vision Transformers (ViTs), across a wide range of hyper-parameters.

Relationship Between Accuracy and Perceptual Similarity
It is well established that features learned via training on ImageNet transfer well to a number of downstream tasks, making ImageNet pre-training a standard recipe. Further, better accuracy on ImageNet usually implies better performance on a diverse set of downstream tasks, such as robustness to common corruptions, out-of-distribution generalization and transfer learning on smaller classification datasets. Contrary to prevailing evidence that suggests models with high validation accuracies on ImageNet are likely to transfer better to other tasks, surprisingly, we find that representations from underfit ImageNet models with modest validation accuracies achieve the best perceptual scores.

Plot of perceptual scores (PS) on the 64 × 64 BAPPS Dataset (y-axis) against the ImageNet 64 × 64 validation accuracies (x-axis). Each blue dot represents an ImageNet classifier. Better ImageNet classifiers achieve better PS up to a certain point (dark blue), beyond which improving the accuracy lowers the PS. The best PS are attained by classifiers with moderate accuracy (20.0–40.0).

<!–

–>

We study the variation of perceptual scores as a function of neural network hyperparameters: width, depth, number of training steps, weight decay, label smoothing and dropout. For each hyperparameter, there exists an optimal accuracy up to which improving accuracy improves PS. This optimum is fairly low and is attained quite early in the hyperparameter sweep. Beyond this point, improved classifier accuracy corresponds to worse PS.

As illustration, we present the variation of PS with respect to two hyperparameters: training steps in ResNets and width in ViTs. The PS of ResNet-50 and ResNet-200 peak very early at the first few epochs of training. After the peak, PS of better classifiers decrease more drastically. ResNets are trained with a learning rate schedule that causes a stepwise increase in accuracy as a function of training steps. Interestingly, after the peak, they also exhibit a step-wise decrease in PS that matches this step-wise accuracy increase.

Early-stopped ResNets attain the best PS across different depths of 6, 50 and 200.

ViTs consist of a stack of transformer blocks applied to the input image. The width of a ViT model is the number of output neurons of a single transformer block. Increasing its width is an effective way to improve its accuracy. Here, we vary the width of two ViT variants, B/8 and L/4 (i.e., Base and Large ViT models with patch sizes 4 and 8 respectively), and evaluate both the accuracy and PS. Similar to our observations with early-stopped ResNets, narrower ViTs with lower accuracies perform better than the default widths. Surprisingly, the optimal width of ViT-B/8 and ViT-L/4 are 6 and 12% of their default widths. For a more comprehensive list of experiments involving other hyperparameters such as width, depth, number of training steps, weight decay, label smoothing and dropout across both ResNets and ViTs, check out our paper.

Narrow ViTs attain the best PS.

Scaling Down Models Improves Perceptual Scores
Our results prescribe a simple strategy to improve an architecture’s PS: scale down the model to reduce its accuracy until it attains the optimal perceptual score. The table below summarizes the improvements in PS obtained by scaling down each model across every hyperparameter. Except for ViT-L/4, early stopping yields the highest improvement in PS, regardless of architecture. In addition, early stopping is the most efficient strategy as there is no need for an expensive grid search.

*Model*	*Default*	*Width*	*Depth*	Weight Decay	Central Crop	Train Steps	*Best*
ResNet-6	69.1	+0.4	–	+0.3	0.0	+0.5	69.6
ResNet-50	68.2	+0.4	–	+0.7	+0.7	+1.5	69.7
ResNet-200	67.6	+0.2	–	+1.3	+1.2	+1.9	69.5
ViT B/8	67.6	+1.1	+1.0	+1.3	+0.9	+1.1	68.9
ViT L/4	67.9	+0.4	+0.4	-0.1	-1.1	+0.5	68.4

Perceptual Score improves by scaling down ImageNet models. Each value denotes the improvement obtained by scaling down a model across a given hyperparameter over the model with default hyperparameters.

Global Perceptual Functions
In prior work, the perceptual similarity function was computed using Euclidean distances across the spatial dimensions of the image. This assumes a direct correspondence between pixels, which may not hold for warped, translated or rotated images. Instead, we adopt two perceptual functions that rely on global representations of images, namely the style-loss function from the Neural Style Transfer work that captures stylistic similarity between two images, and a normalized mean pool distance function. The style-loss function compares the inter-channel cross-correlation matrix between two images while the mean pool function compares the spatially averaged global representations.

Global perceptual functions consistently improve PS across both networks trained with default hyperparameters (top) and ResNet-200 as a function of train epochs (bottom).

We probe a number of hypotheses to explain the relationship between accuracy and PS and come away with a few additional insights. For example, the accuracy of models without commonly used skip-connections also inversely correlate with PS, and layers close to the input on average have lower PS as compared to layers close to the output. For further exploration involving distortion sensitivity, ImageNet class granularity, and spatial frequency sensitivity, check out our paper.

Conclusion
In this paper, we explore the question of whether improving classification accuracy yields better perceptual metrics. We study the relationship between accuracy and PS on ResNets and ViTs across many different hyperparameters and observe that PS exhibits an inverse-U relationship with accuracy, where accuracy correlates with PS up to a certain point, and then exhibits an inverse-correlation. Finally, in our paper, we discuss in detail a number of explanations for the observed relationship between accuracy and PS, involving skip connections, global similarity functions, distortion sensitivity, layerwise perceptual scores, spatial frequency sensitivity and ImageNet class granularity. While the exact explanation for the observed tradeoff between ImageNet accuracy and perceptual similarity is a mystery, we are excited that our paper opens the door for further research in this area.

Acknowledgements
This is joint work with Neil Houlsby and Nal Kalchbrenner. We would additionally like to thank Basil Mustafa, Kevin Swersky, Simon Kornblith, Johannes Balle, Mike Mozer, Mohammad Norouzi and Jascha Sohl-Dickstein for useful discussions.

Offsites

Table Tennis: A Research Platform for Agile Robotics

Post author By
Post date October 18, 2022
No Comments on Table Tennis: A Research Platform for Agile Robotics

Posted by Avi Singh, Research Scientist, and Laura Graesser, Research Engineer, Robotics at Google

Robot learning has been applied to a wide range of challenging real world tasks, including dexterous manipulation, legged locomotion, and grasping. It is less common to see robot learning applied to dynamic, high-acceleration tasks requiring tight-loop human-robot interactions, such as table tennis. There are two complementary properties of the table tennis task that make it interesting for robotic learning research. First, the task requires both speed and precision, which puts significant demands on a learning algorithm. At the same time, the problem is highly-structured (with a fixed, predictable environment) and naturally multi-agent (the robot can play with humans or another robot), making it a desirable testbed to investigate questions about human-robot interaction and reinforcement learning. These properties have led to several research groups developing table tennis research platforms [1, 2, 3, 4].

The Robotics team at Google has built such a platform to study problems that arise from robotic learning in a multi-player, dynamic and interactive setting. In the rest of this post we introduce two projects, Iterative-Sim2Real (to be presented at CoRL 2022) and GoalsEye (IROS 2022), which illustrate the problems we have been investigating so far. Iterative-Sim2Real enables a robot to hold rallies of over 300 hits with a human player, while GoalsEye enables learning goal-conditioned policies that match the precision of amateur humans.

Iterative-Sim2Real policies playing cooperatively with humans (top) and a GoalsEye policy returning balls to different locations (bottom).

Iterative-Sim2Real: Leveraging a Simulator to Play Cooperatively with Humans
In this project, the goal for the robot is cooperative in nature: to carry out a rally with a human for as long as possible. Since it would be tedious and time-consuming to train directly against a human player in the real world, we adopt a simulation-based (i.e., sim-to-real) approach. However, because it is difficult to simulate human behavior accurately, applying sim-to-real learning to tasks that require tight, close-loop interaction with a human participant is difficult.

In Iterative-Sim2Real, (i.e., i-S2R), we present a method for learning human behavior models for human-robot interaction tasks, and instantiate it on our robotic table tennis platform. We have built a system that can achieve rallies of up to 340 hits with an amateur human player (shown below).

A 340-hit rally lasting over 4 minutes.

Learning Human Behavior Models: a Chicken and Egg Problem
The central problem in learning accurate human behavior models for robotics is the following: if we do not have a good-enough robot policy to begin with, then we cannot collect high-quality data on how a person might interact with the robot. But without a human behavior model, we cannot obtain robot policies in the first place. An alternative would be to train a robot policy directly in the real world, but this is often slow, cost-prohibitive, and poses safety-related challenges, which are further exacerbated when people are involved. i-S2R, visualized below, is a solution to this chicken and egg problem. It uses a simple model of human behavior as an approximate starting point and alternates between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined.

i-S2R Methodology.

Results
To evaluate i-S2R, we repeated the training process five times with five different human opponents and compared it with a baseline approach of ordinary sim-to-real plus fine-tuning (S2R+FT). When aggregated across all players, the i-S2R rally length is higher than S2R+FT by about 9% (below on the left). The histogram of rally lengths for i-S2R and S2R+FT (below on the right) shows that a large fraction of the rallies for S2R+FT are shorter (i.e., less than 5), while i-S2R achieves longer rallies more frequently.

Summary of i-S2R results. Boxplot details: The white circle is the mean, the horizontal line is the median, box bounds are the 25th and 75th percentiles.

We also break down the results based on player type: beginner (40% players), intermediate (40% of players) and advanced (20% players). We see that i-S2R significantly outperforms S2R+FT for both beginner and intermediate players (80% of players).

i-S2R Results by player type.

More details on i-S2R can be found on our preprint, website, and also in the following summary video.

GoalsEye: Learning to Return Balls Precisely on a Physical Robot
While we focused on sim-to-real learning in i-S2R, it is sometimes desirable to learn using only real-world data — closing the sim-to-real gap in this case is unnecessary. Imitation learning (IL) provides a simple and stable approach to learning in the real world, but it requires access to demonstrations and cannot exceed the performance of the teacher. Collecting expert human demonstrations of precise goal-targeting in high speed settings is challenging and sometimes impossible (due to limited precision in human movements). While reinforcement learning (RL) is well-suited to such high-speed, high-precision tasks, it faces a difficult exploration problem (especially at the start), and can be very sample inefficient. In GoalsEye, we demonstrate an approach that combines recent behavior cloning techniques [5, 6] to learn a precise goal-targeting policy, starting from a small, weakly-structured, non-targeting dataset.

Here we consider a different table tennis task with an emphasis on precision. We want the robot to return the ball to an arbitrary goal location on the table, e.g. “hit the back left corner” or ”land the ball just over the net on the right side” (see left video below). Further, we wanted to find a method that can be applied directly on our real world table tennis environment with no simulation involved. We found that the synthesis of two existing imitation learning techniques, Learning from Play (LFP) and Goal-Conditioned Supervised Learning (GCSL), scales to this setting. It is safe and sample efficient enough to train a policy on a physical robot which is as accurate as amateur humans at the task of returning balls to specific goals on the table.

GoalsEye policy aiming at a 20cm diameter goal (left). Human player aiming at the same goal (right).

The essential ingredients of success are:

A minimal, but non-goal-directed “bootstrap” dataset of the robot hitting the ball to overcome an initial difficult exploration problem.
Hindsight relabeled goal conditioned behavioral cloning (GCBC) to train a goal-directed policy to reach any goal in the dataset.
Iterative self-supervised goal reaching. The agent improves continuously by setting random goals and attempting to reach them using the current policy. All attempts are relabeled and added into a continuously expanding training set. This self-practice, in which the robot expands the training data by setting and attempting to reach goals, is repeated iteratively.

GoalsEye methodology.

Demonstrations and Self-Improvement Through Practice Are Key
The synthesis of techniques is crucial. The policy’s objective is to return a variety of incoming balls to any location on the opponent’s side of the table. A policy trained on the initial 2,480 demonstrations only accurately reaches within 30 cm of the goal 9% of the time. However, after a policy has self-practiced for ~13,500 attempts, goal-reaching accuracy rises to 43% (below on the right). This improvement is clearly visible as shown in the videos below. Yet if a policy only self-practices, training fails completely in this setting. Interestingly, the number of demonstrations improves the efficiency of subsequent self-practice, albeit with diminishing returns. This indicates that demonstration data and self-practice could be substituted depending on the relative time and cost to gather demonstration data compared with self-practice.

Self-practice substantially improves accuracy. Left: simulated training. Right: real robot training. The demonstration datasets contain ~2,500 episodes, both in simulation and the real world.

Visualizing the benefits of self-practice. Left: policy trained on initial 2,480 demonstrations. Right: policy after an additional 13,500 self-practice attempts.

More details on GoalsEye can be found in the preprint and on our website.

Conclusion and Future Work
We have presented two complementary projects using our robotic table tennis research platform. i-S2R learns RL policies that are able to interact with humans, while GoalsEye demonstrates that learning from real-world unstructured data combined with self-supervised practice is effective for learning goal-conditioned policies in a precise, dynamic setting.

One interesting research direction to pursue on the table tennis platform would be to build a robot “coach” that could adapt its play style according to the skill level of the human participant to keep things challenging and exciting.

Acknowledgements
We thank our co-authors, Saminda Abeyruwan, Alex Bewley, Krzysztof Choromanski, David B. D’Ambrosio, Tianli Ding, Deepali Jain, Corey Lynch, Pannag R. Sanketi, Pierre Sermanet and Anish Shankar. We are also grateful for the support of many members of the Robotics Team who are listed in the acknowledgement sections of the papers.

Offsites

Crossmodal-3600 — Multilingual Reference Captions for Geographically Diverse Images

Post author By
Post date October 15, 2022
No Comments on Crossmodal-3600 — Multilingual Reference Captions for Geographically Diverse Images

Posted by Ashish Thapliyal, Software Engineer, and Jordi Pont-Tuset, Research Scientist, Google Research

Image captioning is the machine learning task of automatically generating a fluent natural language description for a given image. This task is important for improving accessibility for visually impaired users and is a core task in multimodal research encompassing both vision and language modeling.

However, datasets for image captioning are primarily available in English. Beyond that, there are only a few datasets covering a limited number of languages that represent just a small fraction of the world’s population. Further, these datasets feature images that severely under-represent the richness and diversity of cultures from across the globe. These aspects have hindered research on image captioning for a wide variety of languages, and directly hamper the deployment of accessibility solutions for a large potential audience around the world.

Today we present and make publicly available the Crossmodal 3600 (XM3600) image captioning evaluation dataset as a robust benchmark for multilingual image captioning that enables researchers to reliably compare research contributions in this emerging field. XM3600 provides 261,375 human-generated reference captions in 36 languages for a geographically diverse set of 3600 images. We show that the captions are of high quality and the style is consistent across languages.

The Crossmodal 3600 dataset includes reference captions in 36 languages for each of a geographically diverse set of 3600 images. All images used with permission under the CC-BY 2.0 license.

Overview of the Crossmodal 3600 Dataset
Creating large training and evaluation datasets in multiple languages is a resource-intensive endeavor. Recent work has shown that it is feasible to build multilingual image captioning models trained on machine-translated data with English captions as the starting point. However, some of the most reliable automatic metrics for image captioning are much less effective when applied to evaluation sets with translated image captions, resulting in poorer agreement with human evaluations compared to the English case. As such, trustworthy model evaluation at present can only be based on extensive human evaluation. Unfortunately, such evaluations usually cannot be replicated across different research efforts, and therefore do not offer a fast and reliable mechanism to automatically evaluate multiple model parameters and configurations (e.g., model hill climbing) or to compare multiple lines of research.

XM3600 provides 261,375 human-generated reference captions in 36 languages for a geographically diverse set of 3600 images from the Open Images dataset. We measure the quality of generated captions by comparing them to the manually provided captions using the CIDEr metric, which ranges from 0 (unrelated to the reference captions) to 10 (perfectly matching the reference captions). When comparing pairs of models, we observed strong correlations between the differences in the CIDEr scores of the model outputs, and side-by-side human evaluations comparing the model outputs. , making XM3600 is a reliable tool for high-quality automatic comparisons between image captioning models on a wide variety of languages beyond English.

Language Selection
We chose 30 languages beyond English, roughly based on their percentage of web content. In addition, we chose an additional five languages that include under-resourced languages that have many native speakers or major native languages from continents that would not be covered otherwise. Finally, we also included English as a baseline, thus resulting in a total of 36 languages, as listed in the table below.

Arabic	Bengali*	Chinese	Croatian	Cusco Quechua*	Czech
Danish	Dutch	English	Filipino	Finnish	French
German	Greek	Hebrew	Hindi	Hungarian	Indonesian
Italian	Japanese	Korean	Maori*	Norwegian	Persian
Polish	Portuguese	Romanian	Russian	Spanish	Swahili*
Swedish	Telugu*	Thai	Turkish	Ukrainian	Vietnamese

List of languages used in XM3600. *Low-resource languages with many native speakers, or major native languages from continents that would not be covered otherwise.

Image Selection
The images were selected from among those in the Open Images dataset that have location metadata. Since there are many regions where more than one language is spoken, and some areas are not well covered by these images, we designed an algorithm to maximize the correspondence between selected images and the regions where the targeted languages are spoken. The algorithm starts with the selection of images with geo-data corresponding to the languages for which we have the smallest pool (e.g., Persian) and processes them in increasing order of their candidate image pool size. If there aren’t enough images in an area where a language is spoken, then we gradually expand the geographic selection radius to: (i) a country where the language is spoken; (ii) a continent where the language is spoken; and, as last resort, (iii) from anywhere in the world. This strategy succeeded in providing our target number of 100 images from an appropriate region for most of the 36 languages, except for Persian (where 14 continent-level images are used) and Hindi (where all 100 images are at the global level, because the in-region images were assigned to Bengali and Telugu).

English Photo by Chris Sampson	Swahili Photo by Henrik Palm	Telugu Photo by rojypala

Cusco Quechua Photo by McKay Savage	Filipino Photo by Simon Schoeters	Chinese Photo by Stefan Krasowski

Sample images showcasing the geographical diversity of the annotated images. Images used under CC BY 2.0 license.

Caption Generation
In total, all 3600 images (100 images per language) are annotated in all 36 languages, each with an average of two annotations per language, yielding a total of 261,375 captions.

Annotators work in batches of 15 images. The first screen shows all 15 images with their captions in English as generated by a captioning model trained to output a consistent style of the form “<main salient objects> doing <activities> in the <environment>”, often with object attributes, such as a “smiling” person, “red” car, etc. The annotators are asked to rate the caption quality given guidelines for a 4-point scale from “excellent” to “bad”, plus an option for “not_enough_information”. This step forces the annotators to carefully assess caption quality and it primes them to internalize the style of the captions. The following screens show the images again but individually and without the English captions, and the annotators are asked to produce descriptive captions in the target language for each image.

The image batch size of 15 was chosen so that the annotators would internalize the style without remembering the exact captions. Thus, we expect the raters to generate captions based on the image content only and lacking translation artifacts. For example in the example shown below, the Spanish caption mentions “number 42” and the Thai caption mentions “convertibles”, none of which are mentioned in the English captions. The annotators were also provided with a protocol to use when creating the captions, thus achieving style consistency across languages.

Photo by Brian Solis	English	• A vintage sports car in a showroom with many other vintage sports cars
		• The branded classic cars in a row at display

	Spanish	• Automóvil clásico deportivo en exhibición de automóviles de galería — (Classic sports car in gallery car show)
		• Coche pequeño de carreras color plateado con el número 42 en una exhibición de coches — (Small silver racing car with the number 42 at a car show)

	Thai	• รถเปิดประทุนหลายสีจอดเรียงกันในที่จัดแสดง — (Multicolored convertibles line up in the exhibit)
		• รถแข่งวินเทจจอดเรียงกันหลายคันในงานจัดแสดง — (Several vintage racing cars line up at the show.)

Sample captions in three different languages (out of 36 — see full list of captions in Appendix A of the Crossmodal-3600 paper), showcasing the creation of annotations that are consistent in style across languages, while being free of direct-translation artifacts (e.g., the Spanish “number 42” or the Thai “convertibles” would not be possible when directly translating from the English versions). Image used under CC BY 2.0 license.

<!–

>>>>> gd2md-html alert: inline image link here (to images/image7.jpg). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Photo by Brian Solis

English

A vintage sports car in a showroom with many other vintage sports cars
The branded classic cars in a row at display

Spanish

Automóvil clásico deportivo en exhibición de automóviles de galería
(Classic sports car in gallery car show)
Coche pequeño de carreras color plateado con el número 42 en una exhibición de coches
(Small silver racing car with the number 42 at a car show)

Thai

รถเปิดประทุนหลายสีจอดเรียงกันในที่จัดแสดง
(Multicolored convertibles line up in the exhibit)
รถแข่งวินเทจจอดเรียงกันหลายคันในงานจัดแสดง
(Several vintage racing cars line up at the show.)

–>

Caption Quality and Statistics
We ran two to five pilot studies per language to troubleshoot the caption generation process and to ensure high quality captions. We then manually evaluated a random subset of captions. First we randomly selected a sample of 600 images. Then, to measure the quality of captions in a particular language, for each image, we selected for evaluation one of the manually generated captions. We found that:

For 25 out of 36 languages, the percentage of captions rated as “Good” or “Excellent” is above 90%, and the rest are all above 70%.
For 26 out of 36 languages, the percentage of captions rated as “Bad” is below 2%, and the rest are all below 5%.

For languages that use spaces to separate words, the number of words per caption can be as low as 5 or 6 for some agglutinative languages like Cusco Quechua and Czech, and as high as 18 for an analytic language like Vietnamese. The number of characters per caption also varies drastically — from mid-20s for Korean to mid-90s for Indonesian — depending on the alphabet and the script of the language.

Empirical Evaluation and Results
We empirically measured the ability of the XM3600 annotations to rank image captioning model variations by training four variations of a multilingual image captioning model and comparing the CIDEr differences of the models’ outputs over the XM3600 dataset for 30+ languages, to side-by-side human evaluations. We observed strong correlations between the CIDEr differences and the human evaluations. These results support the use of the XM3600 references as a means to achieve high-quality automatic comparisons between image captioning models on a wide variety of languages beyond English.

Recent Uses
Recently PaLI used XM3600 to evaluate model performance beyond English for image captioning, image-to-text retrieval and text-to-image retrieval. The key takeaways they found when evaluating on XM3600 were that multilingual captioning greatly benefits from scaling the PaLI models, especially for low-resource languages.

Acknowledgements
We would like to acknowledge the coauthors of this work: Xi Chen and Radu Soricut.