
The C4_200M Synthetic Dataset for Grammatical Error Correction

Grammatical error correction (GEC) attempts to model grammar and other types of writing errors in order to provide grammar and spelling suggestions, improving the quality of written output in documents, emails, blog posts and even informal chats. Over the past 15 years, there has been a substantial improvement in GEC quality, which can in large part be credited to recasting the problem as a “translation” task. When introduced in Google Docs, for example, this approach resulted in a significant increase in the number of accepted grammar correction suggestions.

One of the biggest challenges for GEC models, however, is data sparsity. Unlike other natural language processing (NLP) tasks, such as speech recognition and machine translation, there is very limited training data available for GEC, even for high-resource languages like English. A common remedy for this is to generate synthetic data using a range of techniques, from heuristic-based random word- or character-level corruptions to model-based approaches. However, such methods tend to be simplistic and do not reflect the true distribution of error types from actual users.

In “Synthetic Data Generation for Grammatical Error Correction with Tagged Corruption Models”, presented at the EACL 16th Workshop on Innovative Use of NLP for Building Educational Applications, we introduce tagged corruption models. Inspired by the popular back-translation data synthesis technique for machine translation, this approach enables the precise control of synthetic data generation, ensuring diverse outputs that are more consistent with the distribution of errors seen in practice. We used tagged corruption models to generate a new 200M sentence dataset, which we have released in order to provide researchers with realistic pre-training data for GEC. By integrating this new dataset into our training pipeline, we were able to significantly improve on GEC baselines.

Tagged Corruption Models
The idea behind applying a conventional corruption model to GEC is to begin with a grammatically correct sentence and then to “corrupt” it by adding errors. A corruption model can be easily trained by switching the source and target sentences in existing GEC datasets, a method that previous studies have shown that can be very effective for generating improved GEC datasets.

A conventional corruption model generates an ungrammatical sentence (red) given a clean input sentence (green).

The tagged corruption model that we propose builds on this idea by taking a clean sentence as input along with an error type tag that describes the kind of error one wishes to reproduce. It then generates an ungrammatical version of the input sentence that contains the given error type. Choosing different error types for different sentences increases the diversity of corruptions compared to a conventional corruption model.

Tagged corruption models generate corruptions (red) for the clean input sentence (green) depending on the error type tag. A determiner error may lead to dropping the “a”, whereas a noun-inflection error may produce the incorrect plural “sheeps”.

To use this model for data generation we first randomly selected 200M clean sentences from the C4 corpus, and assigned an error type tag to each sentence such that their relative frequencies matched the error type tag distribution of the small development set BEA-dev. Since BEA-dev is a carefully curated set that covers a wide range of different English proficiency levels, we expect its tag distribution to be representative for writing errors found in the wild. We then used a tagged corruption model to synthesize the source sentence.

Synthetic data generation with tagged corruption models. The clean C4 sentences (green) are paired with the corrupted sentences (red) in the synthetic GEC training corpus. The corrupted sentences are generated using a tagged corruption model by following the error type frequencies in the development set (bar chart).

In our experiments, tagged corruption models outperformed untagged corruption models on two standard development sets (CoNLL-13 and BEA-dev) by more than three F0.5-points (a standard metric in GEC research that combines precision and recall with more weight on precision), advancing the state-of-the-art on the two widely used academic test sets, CoNLL-14 and BEA-test.

In addition, the use of tagged corruption models not only yields gains on standard GEC test sets, it is also able to adapt GEC systems to the proficiency levels of users. This could be useful, for example, because the error tag distribution for native English writers often differs significantly from the distributions for non-native English speakers. For example, native speakers tend to make more punctuation and spelling mistakes, whereas determiner errors (e.g., missing or superfluous articles, like “a”, “an” or “the”) are more common in text from non-native writers.

Neural sequence models are notoriously data-hungry, but the availability of annotated training data for grammatical error correction is rare. Our new C4_200M corpus is a synthetic dataset containing diverse grammatical errors, which yields state-of-the-art performance when used to pre-train GEC systems. By releasing the dataset we hope to provide GEC researchers with a valuable resource to train strong baseline systems.


torch: Just-in-time compilation (JIT) for R-less model deployment

Using the torch just-in-time (JIT) compiler, it is possible to query a model trained in R from a different language, provided that language can make use of the low-level libtorch library. This post shows how. In addition, we try to untangle a bit of the terminological jumble surrounding the topic.


A Dataset Exploration Case Study with Know Your Data

Data underlies much of machine learning (ML) research and development, helping to structure what a machine learning algorithm learns and how models are evaluated and benchmarked. However, data collection and labeling can be complicated by unconscious biases, data access limitations and privacy concerns, among other challenges. As a result, machine learning datasets can reflect unfair social biases along dimensions of race, gender, age, and more.

Methods of examining datasets that can surface information about how different social groups are represented within are a key component of ensuring development of ML models and datasets is aligned with our AI Principles. Such methods can inform the responsible use of ML datasets and point toward potential mitigations of unfair outcomes. For example, prior research has demonstrated that some object recognition datasets are biased toward images sourced from North America and Western Europe, prompting Google’s Crowdsource effort to balance out image representations in other parts of the world.

Today, we demonstrate some of the functionality of a dataset exploration tool, Know Your Data (KYD), recently introduced at Google I/O, using the COCO Captions dataset as a case study. Using this tool, we find a range of gender and age biases in COCO Captions — biases that can be traced to both dataset collection and annotation practices. KYD is a dataset analysis tool that complements the growing suite of responsible AI tools being developed across Google and the broader research community. Currently, KYD only supports analysis of a small set of image datasets, but we’re working hard to make the tool accessible beyond this set.

Introducing Know Your Data
Know Your Data helps ML research, product and compliance teams understand datasets, with the goal of improving data quality, and thus helping to mitigate fairness and bias issues. KYD offers a range of features that allow users to explore and examine machine learning datasets — users can filter, group, and study correlations based on annotations already present in a given dataset. KYD also presents automatically computed labels from Google’s Cloud Vision API, providing users with a simple way to explore their data based on signals that weren’t originally present in the dataset.

A KYD Case Study
As a case study, we explore some of these features using the COCO Captions dataset, an image dataset that contains five human-generated captions for each of over 300k images. Given the rich annotations provided by free-form text, we focus our analysis on signals already present within the dataset.

Exploring Gender Bias
Previous research has demonstrated undesirable gender biases within computer vision datasets, including pornographic imagery of women and image label correlations that align with harmful gender stereotypes. We use KYD to explore gender biases within COCO Captions by examining gendered correlations within the image captions. We find a gender bias in the depiction of different activities across the images in the dataset, as well as biases relating to how people of different genders are described by annotators.

The first part of our analysis aimed to surface gender biases with respect to different activities depicted in the dataset. We examined images captioned with words describing different activities and analyzed their relation to gendered caption words, such as “man” or “woman”. The KYD Relations tab makes it easy to examine the relation between two different signals in a dataset by visualizing the extent to which two signals co-occur more (or less) than would be expected by chance. Each cell indicates either a positive (blue color) or negative (orange color) correlation between two specific signal values along with the strength of that correlation.

KYD also allows users to filter rows of a relations table based on substring matching. Using this functionality, we initially probed for caption words containing “-ing”, as a simple way to filter by verbs. We immediately saw strong gendered correlations:

Using KYD to analyze the relationship between any word and gendered words. Each cell shows if the two respective words co-occur in the same caption more (up arrow) or less often (down arrow) than pure chance.

Digging further into these correlations, we found that several activities stereotypically associated with women, such as “shopping” and “cooking”, co-occur with images captioned with “women” or “woman” at a higher rate than with images captioned with “men” or “man”. In contrast captions describing many physically intensive activities, such as “skateboarding”, “surfing”, and “snowboarding”, co-occur with images captioned with “man” or “men” at higher rates.

While individual image captions may not use stereotypical or derogatory language, such as with the example below, if certain gender groups are over (or under) represented within a particular activity across the whole dataset, models developed from the dataset risk learning stereotypical associations. KYD makes it easy to surface, quantify, and make plans to mitigate this risk.

An image with one of the captions: “Two women cooking in a beige and white kitchen.” Image licensed under CC-BY 2.0.

In addition to examining biases with respect to the social groups depicted with different activities, we also explored biases in how annotators described the appearance of people they perceived as male or female. Inspired by media scholars who have examined the “male gaze” embedded in other forms of visual media, we examined the frequency with which individuals perceived as women in COCO are described using adjectives that position them as an object of desire. KYD allowed us to easily examine co-occurrences between words associated with binary gender (e.g. “female/girl/woman” vs. “male/man/boy”) and words associated with evaluating physical attractiveness. Importantly, these are captions written by human annotators, who are making subjective assessments about the gender of people in the image and choosing a descriptor for attractiveness. We see that the words “attractive”, “beautiful”, “pretty”, and “sexy” are overrepresented in describing people perceived as women as compared to those perceived as men, confirming what prior work has said about how gender is viewed in visual media.

A screenshot from KYD showing the relationship between words that describe attractiveness and gendered words. For example, “attractive” and “male/man/boy” co-occur 12 times, but we expect ~60 times by chance (the ratio is 0.2x). On the other hand, “attractive” and “female/woman/girl” co-occur 2.62 times more than chance.

KYD also allows us to manually inspect images for each relation by clicking on the relation in question. For example, we can see images whose captions include female terms (e.g. “woman”) and the word “beautiful”.

Exploring Age Bias
Adults older than 65 have been shown to be underrepresented in datasets relative to their presence in the general population — a first step toward improving age representation is to allow developers to assess it in their datasets. By looking at caption words describing different activities and analyzing their relation to caption words describing age, KYD helped us to assess the range of example captions depicting older adults. Having example captions of adults in a range of environments and activities is important for a variety of tasks, such as image captioning or pedestrian detection.

The first trend that KYD made clear is how rarely annotators described people as older adults in captions detailing different activities. The relations tab also shows a trend wherein “elderly”, “old”, and “older” tend not to occur with verbs that describe a variety of physical activities that might be important for a system to be able to detect. Important to note is that, relative to “young”, “old” is more often used to describe things other than people, such as belongings or clothing, so these relations are also capturing some uses that don’t describe people.

The relationship between words associated with age and movement from a screenshot of KYD.

The underrepresentation of captions containing the references to older adults that we examined here could be rooted in a relative lack of images depicting older adults as well as in a tendency for annotators to omit older age-related terms when describing people in images. While manual inspection of the intersection of “old” and “running” shows a negative relation, we notice that it shows no older people and a number of locomotives. KYD makes it easy to quantitatively and qualitatively inspect relations to identify dataset strengths and areas for improvement.

Understanding the contents of ML datasets is a critical first step to developing suitable strategies to mitigate the downstream impact of unfair dataset bias. The above analysis points towards several potential mitigations. For example, correlations between certain activities and social groups, which can lead trained models to reproduce social stereotypes, can be potentially mitigated by “dataset balancing” — increasing the representation of under-represented group/activity combinations. However, mitigations focused exclusively on dataset balancing are not sufficient, as our analysis of how different genders are described by annotators demonstrated. We found annotators’ subjective judgements of people portrayed in images were reflected within the final dataset, suggesting a deeper look at methods of image annotations are needed. One solution for data practitioners who are developing image captioning datasets is to consider integrating guidelines that have been developed for writing image descriptions that are sensitive to race, gender, and other identity categories.

The above case studies highlight only some of the KYD features. For example, Cloud Vision API signals are also integrated into KYD and can be used to infer signals that annotators haven’t labeled directly. We encourage the broader ML community to perform their own KYD case studies and share their findings.

KYD complements other dataset analysis tools being developed across the ML community, including Google’s growing Responsible AI toolkit. We look forward to ML practitioners using KYD to better understand their datasets and mitigate potential bias and fairness concerns. If you have feedback on KYD, please write to

The analysis and write-up in this post were conducted with equal contribution by Emily Denton, Mark Díaz, and Alex Hanna. We thank Marie Pellat, Ludovic Peran, Daniel Smilkov, Nikhil Thorat and Tsung-Yi for their contributions to and reviews of this post.


Improved Detection of Elusive Polyps via Machine Learning

With the increasing ability to consistently and accurately process large amounts of data, particularly visual data, computer-aided diagnostic systems are more frequently being used to assist physicians in their work. This, in turn, can lead to meaningful improvements in health care. An example of where this could be especially useful is in the diagnosis and treatment of colorectal cancer (CRC), which is especially deadly and results in over 900K deaths per year, globally. CRC originates in small pre-cancerous lesions in the colon, called polyps, the identification and removal of which is very successful in preventing CRC-related deaths.

The standard procedure used by gastroenterologists (GIs) to detect and remove polyps is the colonoscopy, and about 19 million such procedures are performed annually in the US alone. During a colonoscopy, the gastroenterologist uses a camera-containing probe to check the intestine for pre-cancerous polyps and early signs of cancer, and removes tissue that looks worrisome. However, complicating factors, such as incomplete detection (in which the polyp appears within the field of view, but is missed by the GI, perhaps due to its size or shape) and incomplete exploration (in which the polyp does not appear in the camera’s field of view), can lead to a high fraction of missed polyps. In fact, studies suggest that 22%–28% of polyps are missed during colonoscopies, of which 20%–24% have the potential to become cancerous (adenomas).

Today, we are sharing progress made in using machine learning (ML) to help GIs fight colorectal cancer by making colonoscopies more effective. In “Detection of Elusive Polyps via a Large Scale AI System”, we present an ML model designed to combat the problem of incomplete detection by helping the GI detect polyps that are within the field of view. This work adds to our previously published work that maximizes the coverage of the colon during the colonoscopy by flagging for GI follow-up areas that may have been missed. Using clinical studies, we show that these systems significantly improve polyp detection rates.

Incomplete Exploration
To help the GI detect polyps that are outside the field of view, we previously developed an ML system that reduces the rate of incomplete exploration by estimating the fractions of covered and non-covered regions of a colon during a colonoscopy. This earlier work uses computer vision and geometry in a technique we call colonoscopy coverage deficiency via depth, to compute segment-by-segment coverage for the colon. It does so in two phases: first computing depth maps for each frame of the colonoscopy video, and then using these depth maps to compute the coverage in real time.

The ML system computes a depth image (middle) from a single RGB image (left). Then, based on the computation of depth images for a video sequence, it calculates local coverage (right), and detects where the coverage has been deficient and a second look is required (blue color indicates observed segments where red indicates uncovered ones). You can learn more about this work in our previous blog post.

This segment-by-segment work yields the ability to estimate what fraction of the current segment has been covered. The helpfulness of such functionality is clear: during the procedure itself, a physician may be alerted to segments with deficient coverage, and can immediately return to review these areas, potentially reducing the rates of missed polyps due to incomplete exploration.

Incomplete Detection
In our most recent paper, we look into the problem of incomplete detection. We describe an ML model that aids a GI in detecting polyps that are within the field of view, so as to reduce the rate of incomplete detection. We developed a system that is based on convolutional neural networks (CNN) with an architecture that combines temporal logic with a single frame detector, resulting in more accurate detection.

This new system has two principal advantages. The first is that the system improves detection performance by reducing the number of false negatives detections of elusive polyps, those polyps that are particularly difficult for GIs to detect. The second advantage is the very low false positive rate of the system. This low false positive rate makes these systems more likely to be adopted in the clinic.

Examples of the variety of polyps detected by the ML system.

We trained the system on 3600 procedures (86M video frames) and tested it on 1400 procedures (33M frames). All the videos and metadata were de-identified. The system detected 97% of the polyps (i.e., it yielded 97% sensitivity) at 4.6 false alarms per procedure, which is a substantial improvement over previously published results. Of the false alarms, follow-up review showed that some were, in fact, valid polyp detections, indicating that the system was able to detect polyps that were missed by the performing endoscopist and by those who annotated the data. The performance of the system on these elusive polyps suggests its generalizability in that the system has learned to detect examples that were initially missed by all who viewed the procedure.

We evaluated the system performance on polyps that are in the field of view for less than five seconds, which makes them more difficult for the GI to detect, and for which models typically have much lower sensitivity. In this case the system attained a sensitivity that is about three times that of the sensitivity that the original procedure achieved. When the polyps were present in the field of view for less than 2 seconds, the difference was even more stark — the system exhibited a 4x improvement in sensitivity.

It is also interesting to note that the system is fairly insensitive to the choice of neural network architecture. We used two architectures: RetinaNet and  LSTM-SSD. RetinaNet is a leading technique for object detection on static images (used for video by applying it to frames in a consecutive fashion). It is one of the top performers on a variety of benchmarks, given a fixed computational budget, and is known for balancing speed of computation with accuracy. LSTM-SSD is a true video object detection architecture, which can explicitly account for the temporal character of the video (e.g., temporal consistency of detections, ability to deal with blur and fast motion, etc.). It is known for being robust and very computationally lightweight and can therefore run on less expensive processors. Comparable results were also obtained on the much heavier Faster R-CNN architecture. The fact that results are similar across different architectures implies that one can choose the network meeting the available hardware specifications.

Prospective Clinical Research Study
As part of the research reported in our detection paper we ran a clinical validation on 100 procedures in collaboration with Shaare Zedek Medical Center in Jerusalem, where our system was used in real time to help GIs. The system helped detect an average of one polyp per procedure that would have otherwise been missed by the GI performing the procedure, while not missing any of the polyps detected by the GIs, and with 3.8 false alarms per procedure. The feedback from the GIs was consistently positive.

We are encouraged by the potential helpfulness of this system for improving polyp detection, and we look forward to working together with the doctors in the procedure room to further validate this research.

The research was conducted by teams from Google Health and Google Research, Israel with support from Verily Life Sciences, and in collaboration with Shaare Zedek Medical Center. Verily is advancing this research via a newly established center in Israel, led by Ehud Rivlin. This research was conducted by Danny Veikherman, Tomer Golany, Dan M. Livovsky, Amit Aides, Valentin Dashinsky, Nadav Rabani, David Ben Shimol, Yochai Blau, Liran Katzir, Ilan Shimshoni, Yun Liu, Ori Segol, Eran Goldin, Greg Corrado, Jesse Lachter, Yossi Matias, Ehud Rivlin, and Daniel Freedman. Our appreciation also goes to several institutions and GIs who provided advice along the way and tested our system prototype. We would like to thank all of our team members and collaborators who worked on this project with us, including: Chen Barshai, Nia Stoykova, and many others.


Two New Datasets for Conversational NLP: TimeDial and Disfl-QA

A key challenge in natural language processing (NLP) is building conversational agents that can understand and reason about different language phenomena that are unique to realistic speech. For example, because people do not always premeditate exactly what they are going to say, a natural conversation often includes interruptions to speech, called disfluencies. Such disfluencies can be simple (like interjections, repetitions, restarts, or corrections), which simply break the continuity of a sentence, or more complex semantic disfluencies, in which the underlying meaning of a phrase changes. In addition, understanding a conversation also often requires knowledge of temporal relationships, like whether an event precedes or follows another. However, conversational agents built on today’s NLP models often struggle when confronted with temporal relationships or with disfluencies, and progress on improving their performance has been slow. This is due, in part, to a lack of datasets that involve such interesting conversational and speech phenomena.

To stir interest in this direction within the research community, we are excited to introduce TimeDial, for temporal commonsense reasoning in dialog, and Disfl-QA, which focuses on contextual disfluencies. TimeDial presents a new multiple choice span filling task targeted for temporal understanding, with an annotated test set of over ~1.1k dialogs. Disfl-QA is the first dataset containing contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages, with ~12k human annotated disfluent questions. These benchmark datasets are the first of their kind and show a significant gap between human performance and current state of the art NLP models.

While people can effortlessly reason about everyday temporal concepts, such as duration, frequency, or relative ordering of events in a dialog, such tasks can be challenging for conversational agents. For example, current NLP models often make a poor selection when tasked with filling in a blank (as shown below) that assumes a basic level of world knowledge for reasoning, or that requires understanding explicit and implicit inter-dependencies between temporal concepts across conversational turns.

It is easy for a person to judge that “half past one” and “quarter to two” are more plausible options to fill in the blank than “half past three” and “half past nine”. However, performing such temporal reasoning in the context of a dialog is not trivial for NLP models, as it requires appealing to world knowledge (i.e., knowing that the participants are not yet late for the meeting) and understanding the temporal relationship between events (“half past one” is before “three o’clock”, while “half past three” is after it). Indeed, current state-of-the-art models like T5 and BERT end up picking the wrong answers — “half past three” (T5) and “half past nine” (BERT).

The TimeDial benchmark dataset (derived from the DailyDialog multi-turn dialog corpus) measures models’ temporal commonsense reasoning abilities within a dialog context. Each of the ~1.5k dialogs in the dataset is presented in a multiple choice setup, in which one temporal span is masked out and the model is asked to find all correct answers from a list of four options to fill in the blank.

In our experiments we found that while people can easily answer these multiple choice questions (at 97.8% accuracy), state-of-the-art pre-trained language models still struggle on this challenge set. We experiment across three different modeling paradigms: (i) classification over the provided 4 options using BERT, (ii) mask filling for the masked span in the dialog using BERT-MLM, (iii) generative methods using T5. We observe that all the models struggle on this challenge set, with the best variant only scoring 73%.

Model   2-best Accuracy
Human   97.8%
BERT – Classification   50.0%
BERT – Mask Filling   68.5%
T5 – Generation   73.0%

Qualitative error analyses show that the pre-trained language models often rely on shallow, spurious features (particularly text matching), instead of truly doing reasoning over the context. It is likely that building NLP models capable of performing the kind of temporal commonsense reasoning needed for TimeDial requires rethinking how temporal objects are represented within general text representations.

As disfluency is inherently a speech phenomenon, it is most commonly found in text output from speech recognition systems. Understanding such disfluent text is key to building conversational agents that understand human speech. Unfortunately, research in the NLP and speech community has been impeded by the lack of curated datasets containing such disfluencies, and the datasets that are available, like Switchboard, are limited in scale and complexity. As a result, it’s difficult to stress test NLP models in the presence of disfluencies.

Disfluency   Example
Interjection   When is, uh, Easter this year?
Repetition   When is EasEaster this year?
Correction   When is Lent, I mean Easter, this year?
Restart   How much, no wait, when is Easter this year?
Different kinds of disfluencies. The reparandum (words intended to be corrected or ignored; in red), interregnum (optional discourse cues; in grey) and repair (the corrected words; in blue).

Disfl-QA is the first dataset containing contextual disfluencies in an information seeking setting, namely question answering over Wikipedia passages from SQuAD. Disfl-QA is a targeted dataset for disfluencies, in which all questions (~12k) contain disfluencies, making for a much larger disfluent test set than prior datasets. Over 90% of the disfluencies in Disfl-QA are corrections or restarts, making it a much more difficult test set for disfluency correction. In addition, compared to earlier disfluency datasets, it contains a wider variety of semantic distractors, i.e., distractors that carry semantic meaning as opposed to simpler speech disfluencies. 

Passage: …The Normans (Norman: Nourmands; French: Normands; Latin: Normanni) were the people who in the 10th and 11th centuries gave their name to Normandy, a region in France. They were descended from Norse (“Norman” comes from “Norseman”) raiders and pirates from Denmark, Iceland and Norway who, under their leader Rollo, …
Q1:   In what country is Normandy located?   France ✓
DQ1:   In what country is Norse found no wait Normandy not Norse?   Denmark X
Q2:   When were the Normans in Normandy?   10th and 11th centuries ✓
DQ2:   From which countries no tell me when were the Normans in Normandy?   Denmark, Iceland and Norway X
A passage and questions (Qi) from SQuAD dataset, along with their disfluent versions (DQi), consisting of semantic distractors (like “Norse” and “from which countries”) and predictions from a T5 model.

Here, the first question (Q1) is seeking an answer about the location of Normandy. In the disfluent version (DQ1) Norse is mentioned before the question is corrected. The presence of this correctional disfluency confuses the QA model, which tends to rely on shallow textual cues from the question for making predictions.

Disfl-QA also includes newer phenomena, such as coreference (expression referring to the same entity) between the reparandum and the repair.

SQuAD    Disfl-QA
Who does BSkyB have an operating license from?    Who removed [BSkyB’s] operating license, no scratch that, who do [they] have [their] operating license from?

Experiments show that the performance of existing state-of-the-art language model–based question answering systems degrades significantly when tested on Disfl-QA and heuristic disfluencies (presented in the paper) in a zero-shot setting.

Dataset   F1
SQuAD   89.59
Heuristics   65.27 (-24.32)
Disfl-QA   61.64 (-27.95)

We show that data augmentation methods partially recover the loss in performance and also demonstrate the efficacy of using human-annotated training data for fine-tuning. We argue that researchers need large-scale disfluency datasets in order for NLP models to be robust to disfluencies.

Understanding language phenomena that are unique to human speech, like disfluencies and temporal reasoning, among others, is a key ingredient for enabling more natural human–machine communication in the near future. With TimeDial and Disfl-QA, we aim to fill a major research gap by providing these datasets as testbeds for NLP models, in order to evaluate their robustness to ubiquitous phenomena across different tasks. It is our hope that the broader NLP community will devise generalized few-shot or zero-shot approaches to effectively handle these phenomena, without requiring task-specific human-annotated training datasets, constructed specifically for these challenges.

The TimeDial work has been a team effort involving Lianhui Qi, Luheng He, Yenjin Choi, Manaal Faruqui and the authors. The Disfl-QA work has been a collaboration involving Jiacheng Xu, Diyi Yang, Manaal Faruqui.


Google at ACL 2021

This week, the 59th annual meeting of the Association for Computational Linguistics (ACL), a premier conference covering a broad spectrum of research areas that are concerned with computational approaches to natural language, is taking place online.

As a leader in natural language processing and understanding, and a Diamond Level sponsor of ACL 2021, Google will showcase the latest research in the field with over 35 publications, and the organization of and participation in a variety of workshops and tutorials.

If you’re registered for ACL 2021, we hope that you’ll visit the Google virtual booth in Gather Town to learn more about the projects and opportunities at Google that go into solving interesting problems for billions of people. You can also learn more about Google’s participation on the ACL 2021 Expo page, and see a full list of Google publications below (Google affiliations in bold).

Organizing Committee
Senior Area Chairs include: Dan Roth, Emily Pitler, Jimmy Lin, Ming-Wei Chang, Sebastian Ruder, Slav Petrov
Area Chairs include: Ankur P. Parikh, Artem Sokolov, Bhuwan Dhingra, Cicero Nogueira dos Santos, Colin Cherry, Dani Yogatama, David Mimno, Hideto Kazawa, Ian Tenney, Jasmijn Bastings, Jun Suzuki, Katja Filippova, Kyle Gorma, Lu Wang, Manaal Faruqui, Natalie Schluter, Peter Liu, Radu Soricut, Sebastian Gehrmann, Shashi Narayan, Tal Linzen, Vinodkumar Prabhakaran, Waleed Ammar

Parameter-Efficient Multi-task Fine-Tuning for Transformers via Shared Hypernetwork
Rabeeh Karimi Mahabadi*, Sebastian Ruder, Mostafa Dehghani, James Henderson

TicketTalk: Toward Human-Level Performance with End-to-End, Transaction-Based Dialog Systems
Bill Byrne, Karthik Krishnamoorthi, Saravanan Ganesh, Mihir Sanjay Kale

Increasing Faithfulness in Knowledge-Grounded Dialogue with Controllable Feature
Hannah Rashkin, David Reitter, Gaurav Singh Tomar, Dipanjan Das

Compositional Generalization and Natural Language Variation: Can a Semantic Parsing Approach Handle Both?
Peter Shaw, Ming-Wei Chang, Panupong Pasupat, Kristina Toutanova

Exploiting Language Relatedness for Low Web-Resource Language Model Adaptation: An Indic Languages Study
Yash Khemchandani, Sarvesh Mehtani, Vaidehi Patil, Abhijeet Awasthi, Partha Talukdar, Sunita Sarawagi

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Model
Matthew Finlayson, Aaron Mueller, Sebastian Gehrmann, Stuart Shieber, Tal Linzen*, Yonatan Belinkov

Modeling Fine-Grained Entity Types with Box Embeddings
Yasumasa Onoe, Michael Boratko, Andrew McCallum, Greg Durrett

TextSETTR: Few-Shot Text Style Extraction and Tunable Targeted Restyling
Parker Riley*, Noah Constant, Mandy Guo, Girish Kumar*, David Uthus, Zarana Parekh

Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering
Najoung Kim*, Ellie Pavlick, Burcu Karagol Ayan, Deepak Ramachandran

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
Zhenhai Zhu, Radu Soricut

Are Pretrained Convolutions Better than Pretrained Transformers?
Yi Tay, Mostafa Dehghani, Jai Gupta, Dara Bahri, Vamsi Aribandi, Zhen Qin, Donald Metzler

Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference
Robert L Logan IV, Andrew McCallum, Sameer Singh, Dan Bikel

PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling
Xiaoxue Zang, Lijuan Liu, Maria Wang, Yang Song*, Hao Zhang, Jindong Chen

Focus Attention: Promoting Faithfulness and Diversity in Summarization
Rahul Aralikatte*, Shashi Narayan, Joshua Maynez, Sascha Rothe, Ryan McDonald*

A Cognitive Regularizer for Language Modeling
Jason Wei, Clara Meister, Ryan Cotterell

Language Model Augmented Relevance Score
Ruibo Liu, Jason Wei, Soroush Vosoughi

Cross-Replication Reliability – An Empirical Approach to Interpreting Inter-rater Reliability
Ka Wong, Praveen Paritosh, Lora Aroyo

TIMEDIAL: Temporal Commonsense Reasoning in Dialog
Lianhui Qin*, Aditya Gupta, Shyam Upadhyay, Luheng He, Yejin Choi, Manaal Faruqui

StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling
Yikang Shen*, Yi Tay, Che Zheng, Dara Bahri, Donald Metzler, Aaron Courville

MOLEMAN: Mention-Only Linking of Entities with a Mention Annotation Network
Nicholas FitzGerald, Jan A. Botha, Daniel Gillick, Daniel M. Bikel, Tom Kwiatkowski, Andrew McCallum

Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation
Yinfei Yanga, Ning Jinb, Kuo Linb, Mandy Guoa, Daniel Cera

ROPE: Reading Order Equivariant Positional Encoding for Graph-Based Document Information Extraction
Chen-Yu Lee, Chun-Liang Li, Chu Wang∗, Renshen Wang, Yasuhisa Fujii, Siyang Qin, Ashok Popat, Tomas Pfister

Measuring and Improving BERT’s Mathematical Abilities by Predicting the Order of Reasoning
Piotr Piekos, Henryk Michalewski, Mateusz Malinowsk

Improving Compositional Generalization in Classification Tasks via Structure Annotations
Juyong Kim, Pradeep Ravikumar, Joshua Ainslie, Santiago Ontañón

A Simple Recipe for Multilingual Grammatical Error Correction
Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn

nmT5 – Is Parallel Data Still Relevant for Pre-training Massively Multilingual Language Models?
Mihir Kale, Aditya Siddhant, Noah Constant, Melvin Johnson, Rami Al-Rfou, Linting Xue

QA-Driven Zero-Shot Slot Filling with Weak Supervision Pretraining
Xinya Du*, Luheng He, Qi Li, Dian Yu*, Panupong Pasupat, Yuan Zhang

AgreeSum: Agreement-Oriented Multi-Document Summarization
Richard Yuanzhe Pang*, Adam D. Lelkes, Vinh Q. Tran, Cong Yu

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering
Aditya Gupta, Jiacheng Xu*, Shyam Upadhyay, Diyi Yang, Manaal Faruqui

Training ELECTRA Augmented with Multi-word Selection
Jiaming Shen*, Jialu Liu, Tianqi Liu, Cong Yu, Jiawei Han

A Survey of Data Augmentation Approaches for NLP
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, Eduard Hovy

RealFormer: Transformer Likes Residual Attention
Ruining He, Anirudh Ravula, Bhargav Kanagal, Joshua Ainslie

Scaling Within Document Coreference to Long Texts
Raghuveer Thirukovalluru, Nicholas Monath, Kumar Shridhar, Manzil Zaheer, Mrinmaya Sachan, Andrew McCallum

MergeDistill: Merging Language Models using Pre-trained Distillation
Simran Khanuja, Melvin Johnson, Partha Talukdar

DoT: An Efficient Double Transformer for NLP tasks with Tables
Syrine Krichene, Thomas Müller*, Julian Martin Eisenschlos

How Reliable are Model Diagnostics?
Vamsi Aribandi, Yi Tay, Donald Metzler

Interactive Learning for Natural Language Processing
Organizers include: Filip Radlinski
Invited Panelist: Julia Kreutzer

6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Organizers include: Chris Dyer, Laura Rimell

Third Workshop on Gender Bias for Natural Language Processing
Organizers include: Kellie Webster

Benchmarking: Past, Present and Future
Invited Speaker: Eunsol Choi

SemEval-2021, 15th International Workshop on Semantic Evaluation
Organizers include: Natalie Schluter

Workshop on Online Abuse and Harms
Organizers include: Vinodkumar Prabhakaran

GEM: Natural Language Generation, Evaluation, and Metrics
Organizers include: Sebastian Gehrmann

Workshop on Natural Language Processing for Programming
Invited Speaker: Charles Sutton

WPT 2021: The 17th International Conference on Parsing Technologies
Organizers include: Weiwei Sun

Recognizing Multimodal Entailment
Instructors include: Cesar Ilharco, Vaiva Imbrasaite, Ricardo Marino, Jannis Bulian, Chen Sun, Afsaneh Shirazi, Lucas Smaira, Cordelia Schmid

*  Work conducted while at Google. 


Mapping Africa’s Buildings with Satellite Imagery

An accurate record of building footprints is important for a range of applications, from population estimation and urban planning to humanitarian response and environmental science. After a disaster, such as a flood or an earthquake, authorities need to estimate how many households have been affected. Ideally there would be up-to-date census information for this, but in practice such records may be out of date or unavailable. Instead, data on the locations and density of buildings can be a valuable alternative source of information.

A good way to collect such data is through satellite imagery, which can map the distribution of buildings across the world, particularly in areas that are isolated or difficult to access. However, detecting buildings with computer vision methods in some environments can be a challenging task. Because satellite imaging involves photographing the earth from several hundred kilometres above the ground, even at high resolution (30–50 cm per pixel), a small building or tent shelter occupies only a few pixels. The task is even more difficult for informal settlements, or rural areas where buildings constructed with natural materials can visually blend into the surroundings. There are also many types of natural and artificial features that can be easily confused with buildings in overhead imagery.

Objects that can confuse computer vision models for building identification (clockwise from top left) pools, rocks, enclosure walls and shipping containers.

In “Continental-Scale Building Detection from High-Resolution Satellite Imagery”, we address these challenges, using new methods for detecting buildings that work in rural and urban settings across different terrains, such as savannah, desert, and forest, as well as informal settlements and refugee facilities. We use this building detection model to create the Open Buildings dataset, a new open-access data resource containing the locations and footprints of 516 million buildings with coverage across most of the African continent. The dataset will support several practical, scientific and humanitarian applications, ranging from disaster response or population mapping to planning services such as new medical facilities or studying human impact on the natural environment.

Model Development
We built a training dataset for the building detection model by manually labelling 1.75 million buildings in 100k images. The figure below shows some examples of how we labelled images in the training data, taking into account confounding characteristics of different areas across the African continent. In rural areas, for example, it was necessary to identify different types of dwelling places and to disambiguate them from natural features, while in urban areas we needed to develop labelling policies for dense and contiguous structures.

(1) Example of a compound containing both dwelling places as well as smaller outbuildings such as grain stores. (2) Example of a round, thatched-roof structure that can be difficult for a model to distinguish from trees, and where it is necessary to use cues from pathways, clearings and shadows to disambiguate. (3) Example of several contiguous buildings for which the boundaries cannot be easily distinguished.

We trained the model to detect buildings in a bottom-up way, first by classifying each pixel as building or non-building, and then grouping these pixels together into individual instances. The detection pipeline was based on the U-Net model, which is commonly used in satellite image analysis. One advantage of U-Net is that it is a relatively compact architecture, and so can be applied to large quantities of imaging data without a heavy compute burden. This is critical, because the final task of applying this to continental-scale satellite imagery means running the model on many billions of image tiles.

Example of segmenting buildings in satellite imagery. Left: Source image; Center: Semantic segmentation, with each pixel assigned a confidence score that it is a building vs. non-building; Right: Instance segmentation, obtained by thresholding and grouping together connected components.

Initial experiments with the basic model had low precision and recall, for example due to the variety of natural and artificial features with building-like appearance. We found a number of methods that improved performance. One was the use of mixup as a regularisation method, where random training images are blended together by taking a weighted average. Though mixup was originally proposed for image classification, we modified it to be used for semantic segmentation. Regularisation is important in general for this building segmentation task, because even with 100k training images, the training data do not capture the full variation of terrain, atmospheric and lighting conditions that the model is presented with at test time, and hence, there is a tendency to overfit. This is mitigated by mixup as well as random augmentation of training images.

Another method that we found to be effective was the use of unsupervised self-training. We prepared a set of 100 million satellite images from across Africa, and filtered these to a subset of 8.7 million images that mostly contained buildings. This dataset was used for self-training using the Noisy Student method, in which the output of the best building detection model from the previous stage is used as a ‘teacher’ to then train a ‘student’ model that makes similar predictions from augmented images. In practice, we found that this reduced false positives and sharpened the detection output. The student model gave higher confidence to buildings and lower confidence to background.

Difference in model output between the student and teacher models for a typical image. In panel (d), red areas are those that the student model finds more likely to be buildings than the teacher model, and blue areas more likely to be background.

One problem that we faced initially was that our model had a tendency to create “blobby” detections, without clearly delineated edges and with a tendency for neighbouring buildings to be merged together. To address this, we applied another idea from the original U-Net paper, which is to use distance weighting to adapt the loss function to emphasise the importance of making correct predictions near boundaries. During training, distance weighting places greater emphasis at the edges by adding weight to the loss — particularly where there are instances that nearly touch. For building detection, this encourages the model to correctly identify the gaps in between buildings, which is important so that many close structures are not merged together. We found that the original U-Net distance weighting formulation was helpful but slow to compute. So, we developed an alternative based on Gaussian convolution of edges, which was both faster and more effective.

Distance weighting schemes to emphasise nearby edges: U-Net (left) and Gaussian convolution of edges (right).

Our technical report has more details on each of these methods.

We evaluated the performance of the model on several different regions across the continent, in different categories: urban, rural, and medium-density. In addition, with the goal of preparing for potential humanitarian applications, we tested the model on regions with displaced persons and refugee settlements. Precision and recall did vary between regions, so achieving consistent performance across the continent is an ongoing challenge.

Precision-recall curves, measured at 0.5 intersection-over-union threshold.

When visually inspecting the detections for low-scoring regions, we noted various causes. In rural areas, label errors were problematic. For example, single buildings within a mostly-empty area can be difficult for labellers to spot. In urban areas, the model had a tendency to split large buildings into separate instances. The model also underperformed in desert terrain, where buildings were hard to distinguish against the background.

We carried out an ablation study to understand which methods contributed most to the final performance, measured in mean average precision (mAP). Distance weighting, mixup and the use of ImageNet pre-training were the biggest factors for the performance of the supervised learning baseline. The ablated models that did not use these methods had a mAP difference of -0.33, -0.12 and -0.07 respectively. Unsupervised self-training gave a further significant boost of +0.06 mAP.

Ablation study of training methods. The first row shows the mAP performance of the best model combined with self-training, and the second row shows the best model with supervised learning only (the baseline). By disabling each training optimization from the baseline in turn, we observe the impact on mAP test performance. Distance weighting has the most significant effect.

Generating the Open Buildings Dataset
To create the final dataset, we applied our best building detection model to satellite imagery across the African continent (8.6 billion image tiles covering 19.4 million km2, 64% of the continent), which resulted in the detection of 516M distinct structures.

Each building’s outline was simplified as a polygon and associated with a Plus Code, which is a geographic identifier made up of numbers and letters, akin to a street address, and useful for identifying buildings in areas that don’t have formal addressing systems. We also include confidence scores and guidance on suggested thresholds to achieve particular precision levels.

The sizes of the structures vary as shown below, tending towards small footprints. The inclusion of small structures is important, for example, to support analyses of informal settlements or refugee facilities.

Distribution of building footprint sizes.

The data is freely available and we look forward to hearing how it is used. In the future, we may add new features and regions, depending on usage and feedback.

This work is part of our AI for Social Good efforts and was led by Google Research, Ghana. Thanks to the co-authors of this work: Wojciech Sirko, Sergii Kashubin, Marvin Ritter, Abigail Annkah, Yasser Salah Edine Bouchareb, Yann Dauphin, Daniel Keysers, Maxim Neumann and Moustapha Cisse. We are grateful to Abdoulaye Diack, Sean Askay, Ruth Alcantara and Francisco Moneo for help with coordination. Rob Litzke, Brian Shucker, Yan Mayster and Michelina Pallone provided valuable assistance with geo infrastructure.


Advances in TF-Ranking

In December 2018, we introduced TF-Ranking, an open-source TensorFlow-based library for developing scalable neural learning-to-rank (LTR) models, which are useful in settings where users expect to receive an ordered list of items in response to their query. LTR models — unlike standard classification models that classify one item at a time — receive an entire list of items as an input, and learn an ordering that maximizes the utility of the entire list. While search and recommendation systems are the most common applications of LTR models, since its release, we have seen TF-Ranking being applied in diverse domains beyond search, including e-commerce, SAT solvers, and smart city planning.

The goal of learning-to-rank (LTR) is to learn a function f() that takes as an input a list of items (documents, products, movies, etc.) and outputs the list of items in the optimal order (descending order of relevance). Here, green shade indicates item relevance level, and the red item marked with ‘x’ is non-relevant.

In May 2021, we published a major release of TF-Ranking that enables full support for natively building LTR models using Keras, a high-level API of TensorFlow 2. Our native Keras ranking model has a brand-new workflow design, including a flexible ModelBuilder, a DatasetBuilder to set up training data, and a Pipeline to train the model with the provided dataset. These components make building a customized LTR model easier than ever, and facilitate rapid exploration of new model structures for production and research. If RaggedTensors are your tool of choice, TF-Ranking is now working with them as well. In addition, our most recent release, which incorporates the Orbit training library, contains a long list of advances — the culmination of two and half years of neural LTR research. Below we share a few of the key improvements available in the latest TF-Ranking version.

Workflow to build and train a native Keras ranking model. Blue modules are provided by TF-Ranking, and green modules are customizable.

Learning-to-Rank with TFR-BERT
Recently, pretrained language models like BERT have achieved state-of-the-art performance on various language understanding tasks. To capture the expressiveness of these models, TF-Ranking implements a novel TFR-BERT architecture that couples BERT with the power of LTR to optimize the ordering of list inputs. As an example, consider a query and a list of n documents that one might like to rank in response to this query. Instead of learning an independent BERT representation for each <query, document> pair, LTR models apply a ranking loss to jointly learn a BERT representation that maximizes the utility of the entire ranked list with respect to the ground-truth labels.

The figure below illustrates this process. First, we flatten a list of n documents to rank in response to a query into a list <query, document> tuples. These tuples are fed into a pre-trained language model (e.g., BERT). The pooled BERT outputs for the entire document list are then jointly fine-tuned with one of the specialized ranking losses available in TF-Ranking. Our experience shows that this TFR-BERT architecture delivers significant improvements in pretrained language model performance, leading to state-of-the-art performance for several popular ranking tasks, especially when multiple pretrained language models are ensembled. Our users can now get started with TFR-BERT using this simple example.

An illustration of the TFR-BERT architecture, in which a joint LTR model over a list of n documents is constructed using BERT representations of individual <query, document> pairs.

Interpretable Learning-to-Rank
Transparency and interpretability are important factors in deploying LTR models in ranking systems that can be involved in determining the outcomes of processes such as loan eligibility assessment, advertisement targeting, or guiding medical treatment decisions. In such cases, the contribution of each individual feature to the final ranking should be examinable and understandable to ensure transparency, accountability and fairness of the outcomes.

One possible way to achieve this is using generalized additive models (GAMs) — intrinsically interpretable machine learning models that are linearly composed of smooth functions of individual features. However, while GAMs have been extensively studied on regression and classification tasks, it is less clear how to apply them in a ranking setting. For instance, while GAMs can be straightforwardly applied to model each individual item in the list, modeling both item interactions and the context in which these items are ranked is a more challenging research problem. To this end, we have developed a neural ranking GAM — an extension of generalized additive models to ranking problems.

Unlike standard GAMs, a neural ranking GAM can take into account both the features of the ranked items and the context features (e.g., query or user profile) to derive an interpretable, compact model. This ensures that not only the contribution of each item-level feature is interpretable, but also the contribution of the context features. For example, in the figure below, using a neural ranking GAM makes visible how distance, price, and relevance, in the context of a given user device, contribute to the final ranking of the hotel. Neural ranking GAMs are now available as a part of TF-Ranking,

An example of applying neural ranking GAM for local search. For each input feature (e.g., price, distance), a sub-model produces a sub-score that can be examined, providing transparency. Context features (e.g., user device type) can be utilized to derive importance weights of submodels.

Neural Ranking or Gradient Boosting?
While neural models have achieved state of the art performance in multiple domains, specialized gradient boosted decision trees (GBDTs) like LambdaMART remained the baseline to beat in a variety of open LTR datasets. The success of GBDTs in open datasets is due to several reasons. First, due to their relatively small size, neural models are prone to overfitting on these datasets. Second, since GBDTs partition their input feature space using decision trees, they are naturally more resilient to variations in numerical scales in ranking data, which often contain features with Zipfian or otherwise skewed distributions. However, GBDTs do have their limitations in more realistic ranking scenarios, which often combine both textual and numerical features. For instance, GBDTs cannot be directly applied to large discrete feature spaces, such as raw document text. They are also, in general, less scalable than neural ranking models.

Therefore, since the TF-Ranking release, our team has significantly deepened the understanding of how best to leverage neural models in ranking with numerical features. This culminated in a Data Augmented Self-Attentive Latent Cross (DASALC) model, described in an ICLR 2021 paper, which is the first to establish parity, and in some cases statistically significant improvements, of neural ranking models over strong LambdaMART baselines on open LTR datasets. This achievement is made possible through a combination of techniques, which include data augmentation, neural feature transformation, self-attention for modeling document interactions, listwise ranking loss, and model ensembling similar to boosting in GBDTs. The architecture of the DASALC model was entirely implemented using the TF-Ranking library.

All in all, we believe that the new Keras-based TF-Ranking version will make it easier to conduct neural LTR research and deploy production-grade ranking systems. We encourage everyone to try out the latest version and follow this introductory example for a hands-on experience. While we are very excited about this new release, our research and development journey is far from over, so we will continue to advance our understanding of learning-to-rank problems and share these advances with our users.

This project was only possible thanks to the current and past members of the TF-Ranking team: Honglei Zhuang, ‎Le Yan, Rama Pasumarthi, Rolf Jagerman, Zhen Qin, Shuguang Han, Sebastian Bruch, Nathan Cordeiro, Marc Najork and Patrick McGregor. We also extend special thanks to our collaborators from the Tensorflow team: Zhenyu Tan, Goldie Gadde, Rick Chao, Yuefeng Zhou‎, Hongkun Yu, and Jing Li.


함께 자라기: 우리는 함께 성장할 수 있을까?

우리는 점점 협업이 중요해지는 시대에 살고 있습니다. 도메인과 기술, 각각의 분야는 갈수록 세밀해지고 고도화되고 있기 때문에, 혼자서 이 모든 것을 다 알기란 불가능에 가까워지고 있습니다. 그래서 한명의 천재보다는 훌륭한 팀이 더 좋은 결과들을 만들어 내는 시대입니다.


출처: pixabay

면접에서 커뮤니케이션 스킬 역시 중요하게 평가되고 있죠. ‘팀원과의 협업에서 어려움이 있을 때 어떻게 하셨나요?’ 이런 질문들은 흔하게 접하셨을 것 같습니다. 여기에서 저는 개인적으로 ‘팀으로 일하면서 팀원 모두의 성장을 위해서 무엇을 해보았나요?’ 이 질문을 좋아합니다. 개인이 성장하는 것이 선형적이라면, 팀으로 성장하는 것은 기하급수적으로 볼 수 있기 때문입니다.

이번에 소개하는 책의 저자께서도 이 책을 읽으며, 다음과 같은 질문들로 생각이 나아갈 수 있기를 기대하고 있습니다.

  • 우리가 정말 함께 자랄 수 있을까?
  • 우리가 정말 매일매일 함께 자랄 수 있을까?

함께 자라기 : 애자일로 가는 길


출처: 알라딘 ‘함께 자라기’

이번 책은 애자일 컨설팅으로 알려져 있는 김창준님의 ≪함께 자라기≫ 입니다. 이 책은 그 동안 블로그와 페이스 북 등에서 공유해오시던 효과적으로 배우는 방법과 협업에 대한 다양한 글들을 엮은 결과입니다. 이 책의 특징 중 하나는 연구, 논문 등의 자료를 기반으로 조금 더 구체적이고 분석적으로 성장과 협업에 대해서 바라 본다는 것 입니다.

그럼 책의 내용들을 조금 더 살펴보겠습니다. 1장 자라기 에서는 성장을주제로 다양한 이야기를 하고 있습니다.


저는 시스템과 프로세스가 중요하다고 생각을 합니다. 적합한 사람들을 뽑는 것이 무엇보다 중요하지만, 이 사람들이 마음껏 능력을 펼칠 수 있는 조직의 시스템도 그에 못지 않게 중요합니다.

조직은 개인이 자신의 전문성을 좀 더 발전시키고 관리할 수 있게 최대한 지원을 해야 합니다. 그것이 윈윈하는 길입니다. 뽑고 나서 잘 교육하고 성장하게 도와주는 것 이상으로 중요한 것이 또 있습니다. 시스템입니다. 아무리 훌륭한 사람을 뽑아도 조직의 시스템과 문화에 문제가 있으면 그런 사람은 묻혀버리기 쉽고, 반대로 실력이 평범한 사람일지라도 좋은 시스템 속에서 뛰어난 성과를 낼 수도 있습니다.

  • 잘 뽑는 것 이상으로 중요한 것 중에서

프로세스와 시스템은 아래 더글러스의 말에서 B와 C단계에 해당하는 일 입니다. 이렇게 한 단계 혹은 한 차원 높게 개선을 함으로써 그 조직은 계속해서 발전할 수 있는 것이죠. 항상 일을 함에 있어서 언제 무엇에 집중해야 할지를 생각하는 것이 필요합니다. 일례로 스타트업에서는 빠르게 A 작업을 해내는 것이 중요한 반면, 대기업에서는 더 빠르게 확장할 수 있도록 B작업, 즉 프로세스를 개선하는데 집중해야 하는 것이죠.

더글러스는 작업을 세 가지 수준으로 구분합니다. A, B, C 작업입니다.
A 작업은 원래 그 조직이 하기로 되어 있는 일을 하는 걸 말합니다.
B 작업은 A 작업을 개선하는 걸 말합니다. 제품을 만드는 사이클에서 시간과 품질을 개선하는 것이죠
C 작업은 B 작업을 개선하는 것 입니다. 개선 사이클 자체의 시간과 품질을 개선하는 것입니다. … 한마디로 개선하는 능력을 개선하는 걸 말합니다.
더글러스는 “우리가 더 잘하는 것을 더 잘하게 될수록 우리는 더 잘하는 걸 더 잘 그리고 더 빨리 하게 될 것이다”

  • 복리의 비밀 중에서

의도적 수련


출처: 함께자라기 ‘제자리걸음에서 벗어나기’ 중에서

의도적 수련은 자신의 실력에 맞춰서 가장 빠르게 배울 수 있는 방법 중에 하나입니다. 위 그림처럼, ‘작업 난이도’ 와 ‘실력’ 을 유사한 수준으로 맞춰서 일에 몰입할 수 있도록 하는 것이죠. 너무 쉬운 일이라면, 스스로 퀘스트를 부여하면서 더 문제를 어렵게 만들거나 어려운 일의 경우에는 주변의 도움을 받기도 하고, 문제를 구조적으로 접근함녀서 난이도를 낮추는 방법 등을 제시하고 있습니다.

의도적 수련이 되려면 나의 실력과 작업의 난이도가 비슷해야 합니다. 이것은 미하이 칙센트미하이의 몰입이론(무슨 활동을 하냐가 중요한게 아니라 뭘 하든지 몰입해서 하면 만족도가 올라갔다)과도 일치하는 부분인데요, … 우리가 주목해야 할 부분은 C 영역입니다. 난이도와 실력이 엇비슷하게 맞는 부분이죠. 미하이는 이 부분에서 인간이 몰입을 경험한다고 합니다. 그리고 바로 이때 최고 수준의 집중력을 보이고, 그 덕분에 퍼포먼스나 학습 능력이 최대치가 될 수 있다고 합니다. 또한 그때 최고 수준의 행복감을 경험한다는 흥미로운 사실을 발견하기도 했습니다. 비슷한 이야기를 언어학자인 크라센이 입력가설을 통해 말합니다. i+1 이론이라고 하는데, 현재 언어 학습자의 언어 수준을 i라고 할 때 딱 한 단계 높은 i+1 수준의 입력이 주어질 때에만 언어 능력이 유의미하게 진전한다는 이론이죠.

  • 의도적 수련의 필수조건, 적절한 난이도 중에서

다음으로 2장 함께 에서는 협업에 대한 다양한 주제들을 다루고 있습니다.

심리적 안전감

성공적인 팀의 특징들 중에서 가장 중요하다고 이야기 되는 요소가 바로 ‘심리적 안전감’ 입니다. 이 ‘심리적 안전감’ 하나의 주제만을 가지고 다양한 이야기하는 ≪두려움 없는 조직≫ 이라는 책도 있죠. 어떻게 보면 뻔하게 보이기도 하지만 그 만큼 심리적 안전감을 팀 내에 정착시키는 것은 어렵기도 합니다.

구글은 데이터 중심 회사답게 데이터 기반으로 뛰어난 관리자의 특징을 찾는 옥시전 프로젝트 이후에도 뛰어난 팀의 특징을 찾기 위해 2년간 노력했습니다. 이름하여 아리스토텔레스 프로젝트 입니다.

  1. 팀에 누가 있는지 (전문가, 내향/외향, 지능 등) 보다 팀원들이 서로 어떻게 상호작용하고 자신의 일을 어떻게 바라보는지가 훨씬 중요했다.
  2. 5가지 성공적 팀의 특징을 찾았는데, 그중 압도적으로 높은 예측력을 보인 변수는 팀의 심리적 안전감이었다.
  3. 팀 토론 등 특별히 고안된 활동을 통해 심리적 안전감을 개선할 수 있었다.
    • 구글이 밝힌 탁월한 팀의 비밀 중에서

심리적 안전감은 보통 조직문화를 기반으로 하고 있다고 이야기합니다. 조직문화 중에서도 특히 ‘투명성’ 에 연결이 됩니다. 아래 사례처럼, 실수를 투명하게 공개하고 더 나은 방향으로 모두 나아갈 수 있는 것. 그 외에도 회사 내에서 정보가 투명하게 흐르게 되면 서로 간의 신뢰가 생기기 때문입니다. 이 신뢰가 곧 심리적 안전감으로 직결되게 되죠.

마이클 프레제는 회사에서의 실수 문화에 대해 연구를 했습니다. 그에 따르면 실수 문화에는 크게 두 가지가 있습니다. 실수 예방과 실수 관리. 실수 예방은 행동에서 실수로 가는 경로를 차단하려고 합니다. 즉, 실수를 저지르지 말라고 요구합니다. 근데, 사실 이것이 불가능에 가깝습니다. 전문가도 1시간에 평균 3~5개의 실수를 저지른다고 합니다. … 실수 예방 문화에서는 실수를 한 사람을 비난하고, 처벌하고, 따라서 실수를 감추고 그에 대해 논의하기 꺼리며 문제가 생겼을 때 협력도 덜하게 됩니다. 실수에서 배우지 못하겠지요. 반대로 실수 관리 문화에서는 실수가 나쁜 결과를 내기 전에 빨리 회복하도록 돕고, 실수를 공개하고, 실수에 대해 서로 이야기하고 거기에서 배우는 분위기가 생깁니다.
이 부분이 굉장히 중요합니다. 실수 연구의 역사를 보면, 초기에는 기술적인 부분만 보다가 그 다음에는 인간적인 부분 (결국 80%가 사람 실수라든지)을 보다가 이제는 문화적인 부분을 이야기합니다. 심리적 안전감이라고 하는 것이 이 문화의 일부입니다.

  • 두 가지의 실수 문화 중에서


다음은 개발자들끼리 많이 진행하는 짝 프로그래밍에 대한 이야기 입니다. 그 동안 많이 해봤음에도, 왜 효과적인지 잘 모르고 있다가 이 책을 읽으면서 깨닫게 되는 사례 중에 하나였습니다. 짝 프로그래밍까지 가지 않더라도 문제에 대해서 설명하다가 스스로 좋은 방법을 찾기도 하는데, 이것 역시 설명의 과정에서 추상화를 시키면서 스스로 이해도가 높아지기 때문이 아닐까 싶습니다.

짝 프로그래밍은 두 사람이 한 컴퓨터를 사용해 함께 프로그래밍하는 것입니다. 생각할수록 짝 프로그래밍의 구성은 절묘합니다. 두 사람이라는 구성은 대화를 통해 추상화를 높이게 합니다. 한 컴퓨터라는 구성은 구체화를 통해 검증하게 합니다. 미루고 헤아리는 것) 이 빈번히 교차합니다. 그리고 그 사이에서 “아하”가 터져 나옵니다. … 자신이 작성하는 코드의 추상성을 높이고 싶다면 혼자서 고민하지 말고 다른 사람들과 협동하고, 대화하세요. 같이 그림도 그려보고 함께 소스코드를 편집하세요. 인간에게는 다른 인간과 소통하고 협력할 수 있는 놀라운 능력이 있습니다. 대화는 기적입니다.

  • 대화하는 프로그래밍 중에서

새로운 방법론의 도입

아마 많은 이런 경험이 많이 있으실 것 같습니다. 같이 일을 하면서 새로운 프레임워크 혹은 애자일 등의 방법론 혹은 도구를 도입하는 것이죠. 무난하게 도입을 한 경우도 있을 것이고, 생각하지 못한 반대의견을 맞닥뜨린 경우도 있을 것 입니다. 어떻게 하는 것이 가장 좋은 방법인지 모르겠지만, 동료분들과 이야기를 충분히 하고 니즈를 이해해야 한다는 것 입니다. 이 도구가 왜 좋은지 보다는 동료분들이 어떤 생각을 가지고 있는지 알아보는 것이 어떨까요?

그리고 이렇게 대화를 하면서, 중간의 매개체가 될 수 있다면 단순히 도구를 도입하려는 시도에서 더 나아가 팀에서 필요로 하는 것이 무엇인지 제대로 이해하고 더 좋은 방안을 제시할 수 있을 것 입니다.

팀장 자리에 있으면 새로운 아이디어 전파가 쉬울 거라고 생각하는 것은 환상입니다. … 그 중 어떤 분들은 이미 나름의 객관적 수치들을 수집하고 계시죠. 그런 분들을 만나면 저는 다음과 같은 질문을 던집니다. “상대방에 대해 얼마나 이해를 하고 계신가요? 얼마나 대화를 해보셨나요?” 십중팔구는 “그분이랑은 별로 이야기 못 해봤습니다.” 란 답이 돌아옵니다. 만약 그렇다면 앞으로도 설득에 성공할 확률은 낫다고 봐야 합니다.

  • 객관성의 주관성 중에서

복잡한 분야일수록 어떤 특정 기법의 효과보다도 치료자 효과가 더 큰 영향을 미칠 것입니다. 그렇다면 어떻게 해야 할까요? 슈퍼슈링크들을 찾고 그들을 연구하고 육성해야 합니다. … 소프트웨어 개발 방법론, 새 프로젝트를 진행할 때에 우리가 어떤 방법론을 쓰느냐는 문제보다도 누가 참여하는가가 훨씬 더 압도적으로 중요한 문제가 아닐까요? 여러분은 어떻게 생각하시나요? 저는 이렇게 생각합니다. 예를 들어 애자일 방법론 도입을 원하는 팀장이라면 “나는 어떤 팀장인가”를 먼저 자문해봐야 하지 않을까 싶습니다.

  • 당신의 조직에 새 방법론이 먹히지 않는 이유 중에서

다음은 전문가들끼리 팀이 구성되었을 때, 가장 효과적일지에 대한 이야기가 있습니다. 분야가 겹치지 않는 상황에서는 전문가들이 서로의 전문성을 믿고 각자 최고의 결과를 만들어 낼 수 있지만, 비슷한 분야에서 전문가들이 같이 일을 하는 것은 개인에서 협업을 하게되는 상황이기도 합니다. 이때에는 필연적으로 생산성이 떨어지는 순간들이 있게 되는 것 같습니다. 협업에는 연습이 필요하기 때문이죠.

회사에서의 올스타는 어떨까요? 그로이스버그(Groysberg) 등의 연구에 따르면 이런 스타들이 한 명씩 팀에 추가될 때마다 팀의 추가적 성과 향상은 한계효용(점차 줄어듬)을 보이며 어느 수준을 지나면 음의 방향으로 작용한다(즉, 전체 팀의 성과를 깎아먹음)”고 합니다. … 성과를 깎아먹는 경향은 특히 전문가들이 전문성이 서로 유사할 때 도드라졌습니다. 이 연구는 그 원인 중 하나로 전문가들의 에고(ego)를 꼽습니다.

  • 전문가팀이 실패하는 이유 중에서


마지막 3장에서는 애자일에 대한 이야기가 간단하게 다루어집니다. 사실 앞의 1장, 2장에서도 ‘애자일’ 이라는 용어만 쓰지 않았지, 주제는 애자일에 포함되는 이야기였기 때문이죠.

그 동안 일을 해오면서, 아래의 사례처럼 ‘고객 참여’는 무엇보다 중요한 요소 입니다. 고객 참여에는 다양한 방식이 있을 것 입니다. 고객이 바로 옆에서 도움을 줄 수도 있고, CS를 통해서 피드백을 받을 수도 있고, 인터뷰를 진행할 수도 있습니다. 고객이 무엇을 원하는지 알아볼 수 있는 선구안은 정말 흔하지 않기 때문에, 고객 참여를 통해서 니즈를 발견하고 빠르게 개발해나가는 것이 중요하죠.

성숙도가 낮은 조직의 경우 (성숙도 4 이하), 고객 참여 (0.94), 통계적으로 유의미한 실천법 딱 하나입니다. 고객 참여. 그리고 기여도는 0.94로 아까 전체로 볼 때보다 더 높습니다. 거의 1 입니다. 성숙도가 낮아도 고객 참여를 잘하면 프로젝트 성공도가 한 칸 올라간다는 뜻 입니다. … 성숙도가 높은 조직을 보시죠. 짧은 반복 개발 주기가 1등입니다. 고객 참여보다 더 기여도가 높습니다. 그 말은 성숙도가 높은 조직에서는 고객 참여보다 짧은 반복 개발 주기가 성공에 더 도움이 될 수 있다는 뜻입니다. 그만큼 짧은 반복 개발 주기를 통해 고객 참여가 잘 안 될 때를 어느 정도 보완할 수 있다는 뜻일 수도 있겠습니다.

  • 성숙도가 낮다면 고객 참여는 필수 중에서



출처: 존잡생각 Ep.18 회사에서 본인을 빠르게 성장시키는 방법 – People Scaling

포스트를 작성하면서 협업에 대해서 생각을 하다보니, 최근에 자주 보고 있는 존잡생각 이라는 샌드버드 CEO인 김동선 대표님의 유투브 채널에서 다뤘던 내용이 생각났습니다. 저 문장이 협업의 측면에서 핵심이 되는 요소라고 생각합니다. 문제가 되는 약점은 고쳐야 하지만, 기본적으로 개개인이 가진 강점을 기반으로 팀으로서의 합이 최대치가 되도록 하는 것이죠.

이렇게 팀이 성장하는 방향으로, 함께 자랄 수 있기를 바랍니다!


Applying Advanced Speech Enhancement in Cochlear Implants

For the ~466 million people in the world who are deaf or hard of hearing, the lack of easy access to accessibility services can be a barrier to participating in spoken conversations encountered daily. While hearing aids can help alleviate this, simply amplifying sound is insufficient for many. One additional option that may be available is the cochlear implant (CI), which is an electronic device that is surgically inserted into a part of the inner ear, called the cochlea, and stimulates the auditory nerve electrically via external sound processors. While many individuals with these cochlear implants can learn to interpret these electrical stimulations as audible speech, the listening experience can be quite varied and particularly challenging in noisy environments.

Modern cochlear implants drive electrodes with pulsatile signals (i.e., discrete stimulation pulses) that are computed by external sound processors. The main challenge still facing the CI field is how to best process sounds — to convert sounds to pulses on electrodes — in a way that makes them more intelligible to users. Recently, to stimulate progress on this problem, scientists in industry and academia organized a CI Hackathon to open the problem up to a wider range of ideas.

In this post, we share exploratory research demonstrating that a speech enhancement preprocessor — specifically, a noise suppressor — can be used at the input of a CI’s processor to enhance users’ understanding of speech in noisy environments. We also discuss how we built on this work in our entry for the CI Hackathon and how we will continue developing this work.

Improving CIs with Noise Suppression
In 2019, a small internal project demonstrated the benefits of noise suppression at the input of a CI’s processor. In this project, participants listened to 60 pre-recorded and pre-processed audio samples and ranked them by their listening comfort. CI users listened to the audio using their devices’ existing strategy for generating electrical pulses.

Audio without background noise
Audio with background noise
Audio with background noise + noise suppression

Background audio clip from “IMG_0991.MOV” by Kenny MacCarthy, license: CC-BY 2.0.

As shown below, both listening comfort and intelligibility usually increased, sometimes dramatically, when speech with noise (the lightest bar) was processed with noise suppression.

CI users in an early research study have improved listening comfort — qualitatively scored from “very poor” (0.0) to “OK” (0.5) to “very good” (1.0) — and speech intelligibility (i.e., the fraction of words in a sentence correctly transcribed) when trying to listen to noisy audio samples of speech with noise suppression applied.

For the CI Hackathon, we built on the project above, continuing to leverage our use of a noise suppressor while additionally exploring an approach to compute the pulses too

Overview of the Processing Approach
The hackathon considered a CI with 16 electrodes. Our approach decomposes the audio into 16 overlapping frequency bands, corresponding to the positions of the electrodes in the cochlea. Next, because the dynamic range of sound easily spans multiple orders of magnitude more than what we expect the electrodes to represent, we aggressively compress the dynamic range of the signal by applying “per-channel energy normalization” (PCEN). Finally, the range-compressed signals are used to create the electrodogram (i.e., what the CI displays on the electrodes).

In addition, the hackathon required a submission be evaluated in multiple audio categories, including music, which is an important but notoriously difficult category of sounds for CI users to enjoy. However, the speech enhancement network was trained to suppress non-speech sounds, including both noise and music, so we needed to take extra measures to avoid suppressing instrumental music (note that in general, music suppression might be preferred by some users in certain contexts). To do this, we created a “mix” of the original audio with the noise-suppressed audio so that enough of the music would pass through to remain audible. We varied in real-time the fraction of original audio mixed from 0% to 40% (0% if all of the input is estimated as speech, up to 40% as more of the input is estimated as non-speech) based on the estimate from the open-source YAMNet classifier on every ~1 second window of audio of whether the input is speech or non-speech.

The Conv-TasNet Speech Enhancement Model
To implement a speech enhancement module that suppresses non-speech sounds, such as noise and music, we use the Conv-TasNet model, which can separate different kinds of sounds. To start, the raw audio waveforms are transformed and processed into a form that can be used by a neural network. The model transforms short, 2.5 millisecond frames of input audio with a learnable analysis transform to generate features optimized for sound separation. The network then produces two “masks” from those features: one mask for speech and one mask for noise. These masks indicate the degree to which each feature corresponds to either speech or noise. Separated speech and noise are reconstructed back to the audio domain by multiplying the masks with the analysis features, applying a synthesis transform back to audio-domain frames, and stitching the resulting short frames together. As a final step, the speech and noise estimates are processed by a mixture consistency layer, which improves the quality of the estimated waveforms by ensuring that they sum up to the original input mixture waveform.

Block diagram of the speech enhancement system, which is based on Conv-TasNet.

The model is both causal and low latency: for each 2.5 milliseconds of input audio, the model produces estimates of separated speech and noise, and thus could be used in real-time. For the hackathon, to demonstrate what could be possible with increased compute power in future hardware, we chose to use a model variant with 2.9 million parameters. This model size is too large to be practically implemented in a CI today, but demonstrates what kind of performance would be possible with more capable hardware in the future.

Listening to the Results
As we optimized our models and overall solution, we used the hackathon-provided vocoder (which required a fixed temporal spacing of electrical pulses) to produce audio simulating what CI users might perceive. We then conducted blind A-B listening tests as typical hearing users.

Listening to the vocoder simulations below, the speech in the reconstructed sounds — from the vocoder processing the electrodograms — is reasonably intelligible when the input sound doesn’t contain too much background noise, however there is still room to improve the clarity of the speech. Our submission performed well in the speech-in-noise category and achieved second place overall.

Simulated audio with fixed temporal spacing
Vocoder simulation of what CI users might perceive from audio from an electrodogram with fixed temporal spacing, with background noise and noise suppression applied.

A bottleneck on quality is that the fixed temporal spacing of stimulation pulses sacrifices fine-time structure in the audio. A change to the processing to produce pulses timed to peaks in the filtered sound waveforms captures more information about the pitch and structure of sound than is conventionally represented in implant stimulation patterns.

Simulated audio with adaptive spacing and fine time structure
Vocoder simulation, using the same vocoder as above, but on an electrodogram from the modified processing that synchronizes stimulation pulses to peaks of the sound waveform.

It’s important to note that this second vocoder output is overly optimistic about how well it might sound to a real CI user. For instance, the simple vocoder used here does not model how current spread in the cochlea blurs the stimulus, making it harder to resolve different frequencies. But this at least suggests that preserving fine-time structure is valuable and that the electrodogram itself is not the bottleneck.

Ideally, all processing approaches would be evaluated by a broad range of CI users, with the electrodograms implemented directly on their CIs rather than relying upon vocoder simulations.

Conclusion and a Call to Collaborate
We are planning to follow up on this experience in two main directions. First, we plan to explore the application of noise suppression to other hearing-accessibility modalities, including hearing aids, transcription, and vibrotactile sensory substitution. Second, we’ll take a deeper dive into the creation of electrodogram patterns for cochlear implants, exploiting fine temporal structure that is not accommodated in the usual CIS (continous interleaved sampling) patterns that are standard in the industry. According to Louizou: “It remains a puzzle how some single-channel patients can perform so well given the limited spectral information they receive”. Therefore, using fine temporal structure might be a critical step towards achieving an improved CI experience.

Google is committed to building technology with and for people with disabilities. If you are interested in collaborating to improve the state of the art in cochlear implants (or hearing aids), please reach out to

We would like to thank the Cochlear Impact hackathon organizers for giving us this opportunity and partnering with us. The participating team within Google is Samuel J. Yang, Scott Wisdom, Pascal Getreuer, Chet Gnegy, Mihajlo Velimirović, Sagar Savla, and Richard F. Lyon with guidance from Dan Ellis and Manoj Plakal.