Posted by Arsha Nagrani and Paul Hongsuck Seo, Research Scientists, Google Research
Automatic speech recognition (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, streamed video transcription and voice commands. While the challenges for this technology are centered around noisy audio inputs, the visual stream in multimodal videos (e.g., TV, online edited videos) can provide strong cues for improving the robustness of ASR systems — this is called audiovisual ASR (AV-ASR).
Although lip motion can provide strong signals for speech recognition and is the most common area of focus for AV-ASR, the mouth is often not directly visible in videos in the wild (e.g., due to egocentric viewpoints, face coverings, and low resolution) and therefore, a new emerging area of research is unconstrained AV-ASR (e.g., AVATAR), which investigates the contribution of entire visual frames, and not just the mouth region.
Building audiovisual datasets for training AV-ASR models, however, is challenging. Datasets such as How2 and VisSpeech have been created from instructional videos online, but they are small in size. In contrast, the models themselves are typically large and consist of both visual and audio encoders, and so they tend to overfit on these small datasets. Nonetheless, there have been a number of recently released large-scale audio-only models that are heavily optimized via large-scale training on massive audio-only data obtained from audio books, such as LibriLight and LibriSpeech. These models contain billions of parameters, are readily available, and show strong generalization across domains.
With the above challenges in mind, in “AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR”, we present a simple method for augmenting existing large-scale audio-only models with visual information, at the same time performing lightweight domain adaptation. AVFormer injects visual embeddings into a frozen ASR model (similar to how Flamingo injects visual information into large language models for vision-text tasks) using lightweight trainable adaptors that can be trained on a small amount of weakly labeled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training, which we show is crucial to enable the model to jointly process audio and visual information effectively. The resulting AVFormer model achieves state-of-the-art zero-shot performance on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (i.e., LibriSpeech).
Unconstrained audiovisual speech recognition. We inject vision into a frozen speech model (BEST-RQ, in grey) for zero-shot audiovisual ASR via lightweight modules to create a parameter- and data-efficient model called AVFormer (blue). The visual context can provide helpful clues for robust speech recognition especially when the audio signal is noisy (the visual loaf of bread helps correct the audio-only mistake “clove” to “loaf” in the generated transcript).
Injecting vision using lightweight modules
Our goal is to add visual understanding capabilities to an existing audio-only ASR model while maintaining its generalization performance to various domains (both AV and audio-only domains).
To achieve this, we augment an existing state-of-the-art ASR model (Best-RQ) with the following two components: (i) linear visual projector and (ii) lightweight adapters. The former projects visual features in the audio token embedding space. This process allows the model to properly connect separately pre-trained visual feature and audio input token representations. The latter then minimally modifies the model to add understanding of multimodal inputs from videos. We then train these additional modules on unlabeled web videos from the HowTo100M dataset, along with the outputs of an ASR model as pseudo ground truth, while keeping the rest of the Best-RQ model frozen. Such lightweight modules enable data-efficiency and strong generalization of performance.
We evaluated our extended model on AV-ASR benchmarks in a zero-shot setting, where the model is never trained on a manually annotated AV-ASR dataset.
Curriculum learning for vision injection
After the initial evaluation, we discovered empirically that with a naïve single round of joint training, the model struggles to learn both the adapters and the visual projectors in one go. To mitigate this issue, we introduced a two-phase curriculum learning strategy that decouples these two factors — domain adaptation and visual feature integration — and trains the network in a sequential manner. In the first phase, the adapter parameters are optimized without feeding visual tokens at all. Once the adapters are trained, we add the visual tokens and train the visual projection layers alone in the second phase while the trained adapters are kept frozen.
The first stage focuses on audio domain adaptation. By the second phase, the adapters are completely frozen and the visual projector must simply learn to generate visual prompts that project the visual tokens into the audio space. In this way, our curriculum learning strategy allows the model to incorporate visual inputs as well as adapt to new audio domains in AV-ASR benchmarks. We apply each phase just once, as an iterative application of alternating phases leads to performance degradation.
Overall architecture and training procedure for AVFormer. The architecture consists of a frozen Conformer encoder-decoder model, and a frozen CLIP encoder (frozen layers shown in gray with a lock symbol), in conjunction with two lightweight trainable modules – (i) visual projection layer (orange) and bottleneck adapters (blue) to enable multimodal domain adaptation. We propose a two-phase curriculum learning strategy: the adapters (blue) are first trained without any visual tokens, after which the visual projection layer (orange) is tuned while all the other parts are kept frozen.
The plots below show that without curriculum learning, our AV-ASR model is worse than the audio-only baseline across all datasets, with the gap increasing as more visual tokens are added. In contrast, when the proposed two-phase curriculum is applied, our AV-ASR model performs significantly better than the baseline audio-only model.
Effects of curriculum learning. Red and blue lines are for audiovisual models and are shown on 3 datasets in the zero-shot setting (lower WER % is better). Using the curriculum helps on all 3 datasets (for How2 (a) and Ego4D (c) it is crucial for outperforming audio-only performance). Performance improves up until 4 visual tokens, at which point it saturates.
Results in zero-shot AV-ASR
We compare AVFormer to BEST-RQ, the audio version of our model, and AVATAR, the state of the art in AV-ASR, for zero-shot performance on the three AV-ASR benchmarks: How2, VisSpeech and Ego4D. AVFormer outperforms AVATAR and BEST-RQ on all, even outperforming both AVATAR and BEST-RQ when they are trained on LibriSpeech and the full set of HowTo100M. This is notable because for BEST-RQ, this involves training 600M parameters, while AVFormer only trains 4M parameters and therefore requires only a small fraction of the training dataset (5% of HowTo100M). Moreover, we also evaluate performance on LibriSpeech, which is audio-only, and AVFormer outperforms both baselines.
Comparison to state-of-the-art methods for zero-shot performance across different AV-ASR datasets. We also show performances on LibriSpeech which is audio-only. Results are reported as WER % (lower is better). AVATAR and BEST-RQ are finetuned end-to-end (all parameters) on HowTo100M whereas AVFormer works effectively even with 5% of the dataset thanks to the small set of finetuned parameters.
Conclusion
We introduce AVFormer, a lightweight method for adapting existing, frozen state-of-the-art ASR models for AV-ASR. Our approach is practical and efficient, and achieves impressive zero-shot performance. As ASR models get larger and larger, tuning the entire parameter set of pre-trained models becomes impractical (even more so for different domains). Our method seamlessly allows both domain transfer and visual input mixing in the same, parameter efficient model.
Acknowledgements
This research was conducted by Paul Hongsuck Seo, Arsha Nagrani and Cordelia Schmid.
When you see a context-relevant advertisement on a web page, it’s most likely content served by a Taboola data pipeline. As the leading content recommendation…
When you see a context-relevant advertisement on a web page, it’s most likely content served by a Taboola data pipeline. As the leading content recommendation company in the world, a big challenge for Taboola was the frequent need to scale Apache Spark CPU cluster capacity to address the constantly growing compute and storage requirements.
Data center capacity and hardware costs are always under pressure.
What caused the scaling challenges? Taboola uses a complex data pipeline stretching from user browsers or mobile devices through multiple data centers. Complex deep learning algorithms, databases, infrastructure services (such as Apache Kafka), and thousands of servers are deployed to serve the best-fitting ads to users around the world.
This post describes Taboola’s motivation to move to RAPIDS Accelerator for Apache Spark to optimize processing costs, along with insights on the migration process, challenges, and lessons learned to date.
Challenges of feeding a compute-hungry pipeline
To plan solutions, you must fully understand the magnitude of the problem. When serving ad content, Taboola builds a unique pageview that identifies each user and their interaction with the system. The pageview, a large, wide data structure, is built within a massive CPU cluster using data collected across worldwide data centers. The structure contains over 1500 distinct columns amounting to >1 TB of hourly data, all processed in our Apache Spark CPU cluster.
Many distinct analyzers and SQL queries process incoming pageviews at a rate of 1 TB of raw data per hour and catchup runs at 2, 6, 12, and 48 hours. New analyzers are constantly being created that increase the load on the Apache Spark cluster. There is a growing need for more compute power.
Our main task was to make the complex compute-hungry pipeline more scalable while being cost-effective. To address this challenge, Taboola embarked on an effort to migrate thousands of CPU cores to GPUs, to help us cope with the increasing data load to be processed.
One tool we considered to accelerate Taboola’s Apache Spark environments was the RAPIDS Accelerator on GPUs. As such, it was a natural decision to attempt greater scalability on GPUs as opposed to CPUs.
Key considerations
First, we defined what we needed for testing to successfully migrate thousands of CPU cores to leverage GPU acceleration:
Real-life dataset
Hardware specifications
Minimum X factor
Complex query testing
Real-life dataset
We used real production data from Cyber Monday to test and benchmark a large dataset. The data is 1.5 TB of ZSTD-compressed Parquet files per hour. It has over 1500 columns of all native types including arrays, structures, and nested structures with arrays.
Hardware specifications
For hardware, we started with the following resources:
A 72 CPU core Intel server with three A30 GPUs
A 900-GB local SSD drive, for Apache Spark to store its intermediate files
380 GB RAM
A 10-Gb/s NIC card
Minimum X factor
The primary question when migrating a project from the CPU to the GPU is usually, “What’s the X factor?” For a real-world cluster with multiple GPUs, the question is, “How many GPUs do I need for sustaining a load manageable by X CPU cores?” The answer is your X factor.
We set a minimum bar of an X factor of 3 for the GPU solution to be considered successful in terms of cost. This factor helps guarantee our migration efforts will pay off.
Complex query testing
We picked 15 queries from production across multiple R&D departments to resemble as many of the hundreds of queries as are in production.
The queries are mostly complex including many SQL operations:
Aggregations
Sorts
Lateral view explode
Distribute by
Window functions
UDFS
Figure 1 shows an example query.
Figure 1. Live sample of a large complex query in production requiring more compute power
Table 1 shows you Taboola’s factors.
Analyzer name
Avg CPU production time
Avg GPU time
GPU factor
AdvertiserDimensionByRequest
586.41
31.91
18.38
ExperiementAnalysisPage
3021.6
102.92
29.36
ExperiementAnalysisPlacement
680.84
47.12
14.45
ExperimentAnalysisRequestBase
6605.44
362.68
18.21
ExperimentAnalysisSession
207.87
23.01
9.03
MediaDatatrendsBase
222.94
9.8
22.75
PerformanceMeasurements
397.17
86.22
4.61
PublisherPerformance
965.63
108.95
8.86
RBoxABTest
63.04
2.4
23.88
RevenueByHostHourly
487.44
95.03
5.13
SlaUnitAvailableFillRate
1199.93
152.38
7.87
SupplyDataTrends
529.92
45.28
11.7
Table 1. Production speedups
CPU-to-GPU migration goals
We began with a single server (as described earlier) capable of scaling to a multi-GPU and multi-server cluster. The cluster would be managed by Kubernetes, as opposed to the current Mesos cluster. Mesos was going to become obsolete and NVIDIA environments support Kubernetes.
Any changes in the software and hardware environment would have to be oblivious to Taboola’s code. Queries run on GPUs should execute with the exact results as CPUs. The team was aware of this challenge, as production stability was a critical goal.
Lastly, we wanted the GPU to outperform the CPUs by a minimum factor of 3. We benchmarked several GPUs including NVIDIA P100, NVIDIA V100, NVIDIA A100, and NVIDIA A30. We learned that the A30 GPU gave us the best price-performance fit.
First experiment with RAPIDS Accelerator
We ran SQL queries using RAPIDS Accelerator with somewhat disappointing results. Some of the less complex queries, mostly with lateral view explode, gave a 3x to 5x factor over the CPU. Some showed much lower factors, while other queries crashed.
After asking questions in the RAPIDS GitHub repo, we tried relevant Apache Spark and RAPIDS Accelerator parameters and began seeing better results:
sql.files.maxPartitionBytes: The CPU uses a default value of 128 MB. This is too low for the GPU. We are using 1–2 GB.
sql.shuffle.partitions: We found the 200 default to be sufficient in most cases.
rapids.sql.concurrentGpuTasks: Determines the number of tasks that can be run concurrently on the GPU. A minimum of at least two tasks appears to be best.
Tuning these parameters helped the queries run more smoothly and perform better in some cases. The NVIDIA Accelerated Spark Analysis tool can automatically generate tuning recommendations.
Challenge #1 – Parquet parsing overhead
We experienced bottlenecks on some of the less performant SQL queries, with most of them wasting time parsing Parquet footer data on the CPU. The Parquet data has more than 1500 columns. It was apparent that regular Java code for parsing the footer was not adequate for such a large footer.
Figure 2 shows a small section of NVIDIA profiler output for a 9-second Apache Spark task where the GPU was mostly idle (only working for 330 ms).
Figure 2. Inefficient Parquet parsing trace
Figure 3 shows a flame graph of a query suffering from this behavior. The purple bars indicate time spent inside the org.apache.parquet.hadoop.ParquetFileReader class. Almost 50% of the query time was spent on parsing Parquet’s footer while the GPU was idle.
Figure 3. Trace showing inefficient Parquet parsing in purple bars
As such, we set off to test an alternative solution. When parsing the footer, Parquet code would iterate over footer metadata serially for each row group. We adjusted Parquet parameters to decrease the number of row groups in each file. This gave us about 10–15% improvement. We found this improvement to be insufficient.
Each time metadata was read, all 1500 columns of metadata were read and parsed serially, even after only asking for 50-100 columns per query. We wanted to index footer metadata such that instead of reading the entire 1500 data columns serially, we’d just access it directly. We managed to make this happen by changing the Parquet-mr public code in C++ and Java. Although we got nice performance results, this was too cumbersome and complex.
Fortunately, the NVIDIA RAPIDS team had a much better idea. The solution was optimal Parquet file parsing. They replaced the Java code with Arrow’s C++ implementation. We now have the rapids.sql.format.parquet.reader.footer.typeset to NATIVE by default for our GPU implementation. The bottleneck was solved and there were no longer queries with the GPU idle because of footer parsing overheads on the CPU.
Challenge #2 – Network bottleneck
A weak network card caused the next bottleneck. While the 10-Gb/s Ethernet card sustained the CPU load, it failed to do so for the GPU load.
The solution was an optimal network card for GPU loads. Replacing the 10-Gb/s Ethernet card with a 25-Gb/s card removed this bottleneck.
Challenge #3 – Disk I/O bottleneck
Even after removing the two bottlenecks, queries still ran slow. On the Apache Spark user interface, we saw a clear indication as to what was happening.
Metric
Min
25th percentile
Median
75th percentile
Max
Duration
0.4 s
0.6 s
0.8 s
1 s
1.2 min
GC Time
0.0 ms
0.0 ms
0.0 ms
90.0 ms
0.5 s
Shuffle Read Size/Records
21.4 MB/1000
22.3 MB/1000
22.5 MB/1000
22.7 MB/1000
27.3 MB/1000
Shuffle Write Size/Records
17.5 MB/1000
17.9 MB/1000
18 MB/1000
18.1 MB/1000
18.1 MB/1000
Scheduler Delay
3.0 ms
5.0 ms
5.0 ms
7.0 ms
3 s
Peak Execution Memory
64 MB
64 MB
64 MB
64 MB
64 MB
Shuffle Write Time
9.0 ms
13.0 ms
18.0 ms
21.0 ms
59 s
Table 2. Statistics of disk I/O bottlenecks
In Table 2, in the Max column, the task’s duration is 1.2 minutes, while Shuffle Write Time took 58 seconds. A lot of time was wasted doing shuffle work while the GPU was idle.
Figures 4 and 5 show the appropriate event timeline graph. The orange part represents shuffle times, read or write. The green parts are compute time. We were wasting a lot of time reading or writing shuffle files.
Figure 4. Event timeline graph of Disk I/O bottlenecks as segments
Figure 5. Event timeline bar chart of disk I/O bottlenecks
Our shuffle files can get up to 500 GB and greater in some queries. We obviously can’t keep this huge amount of data in the server’s RAM, so shuffle files are stored in the local SSD drive.
After a quick investigation with our teams, we figured out that the SSD drive was configured to use RAID-1. Each temporary shuffle file was saved twice to the disk. This wasted a lot of time and switching to RAID-0 somewhat improved the situation.
The GPU put more pressure on the SSD drive than the CPU such that we had to replace the SSD with an NVMe drive. The solution was to switch to a 6-TB NVMe drive. We removed one of the GPUs to create a slot for the NVMe drive. Afterward, we had no shuffle read or write performance issues. We also learned that one NVMe drive could sustain the workload of two A30 GPUs.
Moving to Kubernetes
Because our Mesos cluster would become obsolete, the move to Kubernetes from a standalone POC machine was imperative. This involved a lot of configuration work, and other minor work. Still, it was straightforward to implement.
The basic idea is that the Apache Spark driver would sit on a non-GPU machine and each K8s Pod would be associated with a single GPU. For more information, see Getting Started with RAPIDS and Kubernetes.
The following code example shows some of the major relevant Kubernetes configurations.
If the goal is to use GPUs to accelerate workloads more cost-effectively than CPUs, we first had to understand how fast GPUs were. We set up a test system with two A30 GPUs and streamed production data to it in parallel with our large CPU core production environment.
Figure 6 shows two of our heaviest queries running on the production CPU cluster and also on the server with two A30 GPUs.
Figure 6. Heavy queries and resulting hourly processing time for CPU and GPU
The yellow and green lines denote the hourly total time across all task numbers of the two queries running on the CPU cluster. The blue and orange lines denote the same queries running on the GPU server. GPU factors are 20x and higher.
Figure 7 shows the two queries running on the GPU shown in Figure 6. Observe that they behave similarly to the CPU in that peaks and valleys roughly align at different times of the day.
Figure 7. Hourly GPU total time across all tasks
Figure 8 is interesting in that it shows the factor of all the queries that we migrated to the GPU and their counterparts on the CPU. The GPU run is missing the biggest query, which we are still working on migrating to the GPU. It will likely add 200 hours per day to the GPU total time.
Figure 8. Daily aggregation comparison of CPU vs GPU times
Lessons learned
Looking ahead to our next migration steps, Taboola would like to move queries from other R&D departments to the GPU, resulting in a greater number of GPUs in production. This means that QA must monitor the system more closely while in production.
Getting acquainted with RAPIDS Accelerator for Apache Spark was an amazing “joy ride” effort. From handling Parquet files to pushing GPUs to their limit, we became better equipped at handling a large data pipeline along with managing data center capacity and hardware costs. Identifying and coping with hardware limitations with this method proved to be rewarding.
Here are Taboola’s top takeaways for those considering CPU-to-GPU migration:
Tuning parameters in complex environments that have multiple variables is never straightforward. Automate this task to whatever degree possible. It’s probably a good idea to use the NVIDIA Accelerated Spark Analysis tool to help address this challenge, as it can easily suggest optimized parameters.
Look beyond CPUs and GPUs for solutions to bottlenecks. No amount of GPU horsepower can resolve issues that are fundamentally network, disk, bandwidth, or configuration and parsing related.
Multiple GPUs do not hinder performance, but GPUs are so powerful that you may have good performance with fewer GPUs than originally thought. It’s best to test for this to lower costs.
We achieved our 20x factor through RAPIDS Accelerator and NVIDIA GPUs. The biggest lesson learned was that we needed to better understand what was happening in our existing environment before benefiting from GPU acceleration. One A30 GPU sustained the same production load for some workloads as the ~200-CPU core test cluster.
For more information about achieving performance multiples over Apache Spark CPU-based environments, see GPU-Accelerated Apache Spark.
Acknowledgments
Our huge effort was more successful with the support, assistance, and patience of two great groups of people. At Taboola: Andrey Gourine, Gilad Zamoscinski, Igor Berman, Keren Corsia, Lior Chaga, and Michael Taranov. On the NVIDIA RAPIDS team: Alessandro Bellina, Hao Zhu, Karthikeyan Rajendran, Robert Evans, and Sameer Raheja.
A new model generates 3D reconstructions using neural networks, turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of…
A new model generates 3D reconstructions using neural networks, turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of buildings, sculptures and other real-world objects.
Posted by Ziniu Hu, Student Researcher, and Alireza Fathi, Research Scientist, Google Research, Perception Team
Large-scale models, such as T5, GPT-3, PaLM, Flamingo and PaLI, have demonstrated the ability to store substantial amounts of knowledge when scaled to tens of billions of parameters and trained on large text and image datasets. These models achieve state-of-the-art results on downstream tasks, such as image captioning, visual question answering and open vocabulary recognition. Despite such achievements, these models require a massive volume of data for training and end up with a tremendous number of parameters (billions in many cases), resulting in significant computational requirements. Moreover, the data used to train these models can become outdated, requiring re-training every time the world’s knowledge is updated. For example, a model trained just two years ago might yield outdated information about the current president of the United States.
In the fields of natural language processing (RETRO, REALM) and computer vision (KAT), researchers have attempted to address these challenges using retrieval-augmented models. Typically, these models use a backbone that is able to process a single modality at a time, e.g., only text or only images, to encode and retrieve information from a knowledge corpus. However, these retrieval-augmented models are unable to leverage all available modalities in a query and knowledge corpora, and may not find the information that is most helpful for generating the model’s output.
To address these issues, in “REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory”, to appear at CVPR 2023, we introduce a visual-language model that learns to utilize a multi-source multi-modal “memory” to answer knowledge-intensive queries. REVEAL employs neural representation learning to encode and convert diverse knowledge sources into a memory structure consisting of key-value pairs. The keys serve as indices for the memory items, while the corresponding values store pertinent information about those items. During training, REVEAL learns the key embeddings, value tokens, and the ability to retrieve information from this memory to address knowledge-intensive queries. This approach allows the model parameters to focus on reasoning about the query, rather than being dedicated to memorization.
We augment a visual-language model with the ability to retrieve multiple knowledge entries from a diverse set of knowledge sources, which helps generation.
Memory construction from multimodal knowledge corpora
Our approach is similar to REALM in that we precompute key and value embeddings of knowledge items from different sources and index them in a unified knowledge memory, where each knowledge item is encoded into a key-value pair. Each key is a d-dimensional embedding vector, while each value is a sequence of token embeddings representing the knowledge item in more detail. In contrast to previous work, REVEAL leverages a diverse set of multimodal knowledge corpora, including the WikiData knowledge graph, Wikipedia passages and images, web image-text pairs and visual question answering data. Each knowledge item could be text, an image, a combination of both (e.g., pages in Wikipedia) or a relationship or attribute from a knowledge graph (e.g., Barack Obama is 6’ 2” tall). During training, we continuously re-compute the memory key and value embeddings as the model parameters get updated. We update the memory asynchronously at every thousand training steps.
Scaling memory using compression
A naïve solution for encoding a memory value is to keep the whole sequence of tokens for each knowledge item. Then, the model could fuse the input query and the top-k retrieved memory values by concatenating all their tokens together and feeding them into a transformer encoder-decoder pipeline. This approach has two issues: (1) storing hundreds of millions of knowledge items in memory is impractical if each memory value consists of hundreds of tokens and (2) the transformer encoder has a quadratic complexity with respect to the total number of tokens times k for self-attention. Therefore, we propose to use the Perceiver architecture to encode and compress knowledge items. The Perceiver model uses a transformer decoder to compress the full token sequence into an arbitrary length. This lets us retrieve top-k memory entries for k as large as a hundred.
The following figure illustrates the procedure of constructing the memory key-value pairs. Each knowledge item is processed through a multi-modal visual-language encoder, resulting in a sequence of image and text tokens. The key head then transforms these tokens into a compact embedding vector. The value head (perceiver) condenses these tokens into fewer ones, retaining the pertinent information about the knowledge item within them.
We encode the knowledge entries from different corpora into unified key and value embedding pairs, where the keys are used to index the memory and values contain information about the entries.
Large-scale pre-training on image-text pairs
To train the REVEAL model, we begin with the large-scale corpus, collected from the public Web with three billion image alt-text caption pairs, introduced in LiT. Since the dataset is noisy, we add a filter to remove data points with captions shorter than 50 characters, which yields roughly 1.3 billion image caption pairs. We then take these pairs, combined with the text generation objective used in SimVLM, to train REVEAL. Given an image-text example, we randomly sample a prefix containing the first few tokens of the text. We feed the text prefix and image to the model as input with the objective of generating the rest of the text as output. The training goal is to condition the prefix and autoregressively generate the remaining text sequence.
To train all components of the REVEAL model end-to-end, we need to warm start the model to a good state (setting initial values to model parameters). Otherwise, if we were to start with random weights (cold-start), the retriever would often return irrelevant memory items that would never generate useful training signals. To avoid this cold-start problem, we construct an initial retrieval dataset with pseudo–ground-truth knowledge to give the pre-training a reasonable head start.
We create a modified version of the WIT dataset for this purpose. Each image-caption pair in WIT also comes with a corresponding Wikipedia passage (words surrounding the text). We put together the surrounding passage with the query image and use it as the pseudo ground-truth knowledge that corresponds to the input query. The passage provides rich information about the image and caption, which is useful for initializing the model.
To prevent the model from relying on low-level image features for retrieval, we apply random data augmentation to the input query image. Given this modified dataset that contains pseudo-retrieval ground-truth, we train the query and memory key embeddings to warm start the model.
REVEAL workflow
The overall workflow of REVEAL consists of four primary steps. First, REVEAL encodes a multimodal input into a sequence of token embeddings along with a condensed query embedding. Then, the model translates each multi-source knowledge entry into unified pairs of key and value embeddings, with the key being utilized for memory indexing and the value encompassing the entire information about the entry. Next, REVEAL retrieves the top-k most related knowledge pieces from multiple knowledge sources, returns the pre-processed value embeddings stored in memory, and re-encodes the values. Finally, REVEAL fuses the top-k knowledge pieces through an attentive knowledge fusion layer by injecting the retrieval score (dot product between query and key embeddings) as a prior during attention calculation. This structure is instrumental in enabling the memory, encoder, retriever and the generator to be concurrently trained in an end-to-end fashion.
Overall workflow of REVEAL.
Results
We evaluate REVEAL on knowledge-based visual question answering tasks using OK-VQA and A-OKVQA datasets. We fine-tune our pre-trained model on the VQA tasks using the same generative objective where the model takes in an image-question pair as input and generates the text answer as output. We demonstrate that REVEAL achieves better results on the A-OKVQA dataset than earlier attempts that incorporate a fixed knowledge or the works that utilize large language models (e.g., GPT-3) as an implicit source of knowledge.
Visual question answering results on A-OKVQA. REVEAL achieves higher accuracy in comparison to previous works including ViLBERT, LXMERT, ClipCap, KRISP and GPV-2.
We also evaluate REVEAL on the image captioning benchmarks using MSCOCO and NoCaps dataset. We directly fine-tune REVEAL on the MSCOCO training split via the cross-entropy generative objective. We measure our performance on the MSCOCO test split and NoCaps evaluation set using the CIDEr metric, which is based on the idea that good captions should be similar to reference captions in terms of word choice, grammar, meaning, and content. Our results on MSCOCO caption and NoCaps datasets are shown below.
Image Captioning results on MSCOCO and NoCaps using the CIDEr metric. REVEAL achieves a higher score in comparison to Flamingo, VinVL, SimVLM and CoCa.
Below we show a couple of qualitative examples of how REVEAL retrieves relevant documents to answer visual questions.
REVEAL can use knowledge from different sources to correctly answer the question.
Conclusion
We present an end-to-end retrieval-augmented visual language (REVEAL) model, which contains a knowledge retriever that learns to utilize a diverse set of knowledge sources with different modalities. We train REVEAL on a massive image-text corpus with four diverse knowledge corpora, and achieve state-of-the-art results on knowledge-intensive visual question answering and image caption tasks. In the future we would like to explore the ability of this model for attribution, and apply it to a broader class of multimodal tasks.
Acknowledgements
This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross and Alireza Fathi.
Neuralangelo, a new AI model by NVIDIA Research for 3D reconstruction using neural networks, turns 2D video clips into detailed 3D structures — generating lifelike virtual replicas of buildings, sculptures and other real-world objects. Like Michelangelo sculpting stunning, life-like visions from blocks of marble, Neuralangelo generates 3D structures with intricate details and textures. Creative professionals Read article >
The season of hot sun and longer days is here, so stay inside this summer with 20 games joining GeForce NOW in June. Or stream across devices by the pool, from grandma’s house or in the car — whichever way, GeForce NOW has you covered. Titles from the Age of Empires series are the next Read article >
Wireless technology has evolved rapidly and the 5G deployments have made good progress around the world. Up until recently, wireless RAN was deployed using…
Wireless technology has evolved rapidly and the 5G deployments have made good progress around the world. Up until recently, wireless RAN was deployed using closed-box appliance solutions by traditional RAN vendors. This closed-box approach is not scalable, underuses the infrastructure, and does not deliver optimal RAN TCO. It has many shortcomings.
We have come to realize that such closed-box solutions are not scalable and effective for the 5G era.
As a result, the telecoms industry has come together to promote and build virtualized and cloud-native RAN solutions on commercial-off-the-shelf (COTS) hardware platforms with open and standard interfaces. This enables a larger ecosystem and flexible solutions on general-purpose server platforms, leveraging the virtues of virtualization and cloud-native technologies.
There are many positives of such an approach: lower cost, larger ecosystem and vendor choices, faster innovation cycles, automation, and scalability. However, one area of concern is that the open RAN architecture may result in a larger attack surface and may cause new security risks.
As a technology leader in accelerated computing platforms, NVIDIA has been working closely with the standards community (3GPP and O-RAN Alliance), partners, and customers to define and deliver a robust set of security capabilities for vRAN platforms.
Our vision is to drive faster innovation at the confluence of cloud, AI, and 5G to enable the applications of tomorrow. We will ensure that the platforms at the foundation of such innovations are built inherently with the utmost security principles in mind.
Tackling the security challenges of openRAN architecture
The introduction of new standard interfaces in open RAN architecture, along with the decoupling of hardware and software, expands the threat surface of RAN systems. Some examples:
New open interfaces for disaggregated RAN, such as Open Fronthaul (OFH), A1, or E2.
Near-real-time RIC and vendor provided xApps that could exploit the RAN system.
Decoupling of hardware from software increases threats to the trust chain.
Management interfaces such as OFH M-plane, O1, or O2 may induce new vulnerabilities.
The strict latency requirements on the RAN must be considered when you implement security features, such as accelerated cryptographic operations, on the OFH interface. The increased reliance on open-source software also increases the open RAN dependence on secure development practices within open-source communities. Finally, the dramatic growth in the number of IoT devices requires all RAN deployments to protect themselves against the increasing likelihood of attacks by compromised devices.
CSRIC Council
To promote the security, reliability, and resiliency of the United States’ communications systems, the Federal Communications Commission (FCC) established the Communications Security, Reliability, and Interoperability (CSRIC) Council VIII.
The council published a detailed report on the challenges to the development of secure open RAN technology, and a series of recommendations for the industry on how to overcome them. The report also recommends that the open RAN industry adopts security requirements standardized by O-RAN Alliance Working Group 11. The next sections discuss these recommendations and requirements.
CSRIC Council architectural recommendations
The key set of security-centric architectural recommendations for the open RAN industry stemming from FCC CSRIC VIII’s report are as follows:
Digital signing of production software should apply to open RAN workloads, including network functions and applications.
Ethernet-based fronthaul networks should be segmented to isolate fronthaul traffic from other traffic flows.
Port-based authentication should be used to enable the authorization of network elements attached to the FH network.
Secure protocols providing mutual authentication should be used when deploying radio units (RUs) with Ethernet-based fronthaul in US production networks.
IEEE 802.1X port-based network access control should be implemented for all network elements that connect to the FH network deployed in hybrid mode.
Open RAN implementations should be based on the principles of zero trust architecture (ZTA).
Open RAN software should be deployed on secure server hardware. The credentials and keys used by the open RAN software should be encrypted and stored securely.
Open RAN architectures implement defenses to prevent adversarial machine learning (AML) attacks. Industry should work within the O-RAN Alliance to drive security specifications that mitigate AML attacks.
MNO should use secure boot based on hardware root of trust (RoT), with credentials securely stored (for example, in a hardware security module (HSM)), and software signing to establish an end-to-end chain of trust.
O-RAN WG11 defines robust security requirements
The O-RAN Alliance Security Working Group (WG11) is responsible for security guidelines that span across the entire O-RAN architecture. The security analysis and specifications are being developed in close coordination with other O-RAN Working Groups, as well as regulators, and standards development organizations. It complements the security recommendations published by FCC CSRIC VIII discussed earlier.
The fundamental security principles (SP) of open RAN systems, as per the O-RAN WG11 specification, are based on 16 pillars (Table 1).
Security principles
Requirements
NVIDIA features for O-RAN WG11 support
1. SP-AUTH: Mutual authentication
Detect fake base stations and unauthorized users or applications.
Supports access lists and access control list (ACL) based filtering as well as per-port authentication.
· Continuous integration and continuous development (CI/CD) · Software security auditing · Security event logging and analysis in real time
Supported by the partner ecosystem.
13. SP-ISO: Robust isolation
Intra-domain host isolation.
Supports complete isolation from host.
14. SP-PHY: Physical security
Physically secure environment for the following: -Sensitive data storage -Sensitive functions execution -Execution of boot and update processes.
Has hardened and ruggedized versions.
15. SP-CLD: Secure cloud computing and virtualization
Trust in the end-to-end stack from hardware and firmware to virtualized software.
Supported by the partner ecosystem.
16. SP-ROB: Robustness
Robustness of software and hardware resources
Supports as part of our software development, validation, QA, and release practices.
Table 1. Security principles of open RAN systems, key requirements, and how NVIDIA delivers these requirements
Security is built into NVIDIA platforms
A key NVIDIA goal is to provide robust security capabilities that span all aspects of security:
Zero-trust platform security for data in execution on the platform.
Network security to protect all data traversing across the air interface, fronthaul, and backhaul.
Storage security for all data stored on the platform and protection in all management interfaces.
Figure 1. Comprehensive E2E security for O-RAN-based virtualized RAN and the 5G core
Table 2. Delivering comprehensive E2E security for O-RAN-based virtualized RAN and the 5G Core
Fronthaul
gNB (DU/CU), CRAN
Transport (midhaul/backhaul)
5G Edge/Core
MACsec
Port-based authentication & network access control
IPsec
HW root of trust
Secure boot
Infra/tenant isolation
DDoS
Firewall
Physical security countermeasures
Overlay networks (VxLAN, EVPN)
Full transport security IPsec, SSL/TLS
Network ACLs
IDS/IPS
HW root of trust
Secure boot
Infra/tenant isolation
DDoS
NG-Firewall
Cloud SDN (OVN/OVS)
GTP-U tunnel classification, acceleration, and security
O&M
Kubernetes security Kubernetes API authentication POD Security Role-based access control (RBAC) All SW PKI authenticated Zero Trust Architecture (ZTA)
Multi-tenancy/AI
MIG (Multi-Instance GPU): Workload isolation, guaranteed latency, and performance Predictive service assurance: AI to identify, capture, and act on threats
To achieve this, we rely on the industry’s best security practices. NVIDIA ConnectX SmartNICs and NVIDIA BlueField data processing units (DPUs) were developed with far-edge, near-edge, and cloud security in mind. They implement all the requirements necessary for the edge and cloud providers and security vendors to shape their solutions based on NVIDIA platform features.
The ConnectX SmartNIC includes engines that offload and support MACSEC, IPSEC (as well as other encryption-based solutions), TLS, rule-based filtering, and precision time-stamping all at line-rate speeds.
The DPU adds more to the list with a complete isolated platform (server in a server), which includes:
Secure BMC
Secure boot
Root of trust
Deep packet inspection (DPI)
Additional engines for custom crypto operations and data plane pipeline processing
These features enable you to deploy a network that is encrypted, secured, and completely isolated from the hosts. The DPU works to create a secure cloudRAN architecture while also providing direct connection to the GPU and providing the post-screened packets while not involving the host.
We have implemented O-RAN WG11 requirements in our hardware and software platforms, with NVIDIA Aerial 5G vRAN running on converged accelerators consisting of the Bluefield DPU and NVIDIA A100 GPU. The NVIDIA Aerial software implements a full inline offload of RAN Layer 1 with key security capabilities as outlined.
Summary
At NVIDIA, security is top of mind as we transform and virtualize the RAN with open and standards-based architecture. NVIDIA also supports key security capabilities for 5G transport network, 5G Core, orchestration and management layers, and edge AI applications.
When planning your 5G security for open RAN or vRAN, speak to our experts and consider using NVIDIA architectures.
For more information, see the following resources:
Rapid digital transformation has led to an explosion of sensitive data being generated across the enterprise. That data has to be stored and processed in data…
Rapid digital transformation has led to an explosion of sensitive data being generated across the enterprise. That data has to be stored and processed in data centers on-premises, in the cloud, or at the edge. Examples of activities that generate sensitive and personally identifiable information (PII) include credit card transactions, medical imaging or other diagnostic tests, insurance claims, and loan applications.
This wealth of data presents an opportunity for enterprises to extract actionable insights, unlock new revenue streams, and improve the customer experience. Harnessing the power of AI enables a competitive edge in today’s data-driven business landscape.
However, the complex and evolving nature of global data protection and privacy laws can pose significant barriers to organizations seeking to derive value from AI:
General Data Protection Regulation (GDPR) in Europe
Health Insurance Portability and Accountability Act (HIPAA) and Gramm-Leach-Bliley Act (GLBA) in the United States
Customers in healthcare, financial services, and the public sector must adhere to a multitude of regulatory frameworks and also risk incurring severe financial losses associated with data breaches.
In addition to data, the AI models themselves are valuable intellectual property (IP). They are the result of significant resources invested by the model owners in building, training, and optimizing. When not adequately protected in use, AI models face the risk of exposing sensitive customer data, being manipulated, or being reverse-engineered. This can lead to incorrect results, loss of intellectual property, erosion of customer trust, and potential legal repercussions.
Data and AI IP are typically safeguarded through encryption and secure protocols when at rest (storage) or in transit over a network (transmission). But during use, such as when they are processed and executed, they become vulnerable to potential breaches due to unauthorized access or runtime attacks.
Figure 2. Security gap in the end-to-end data lifecycle without confidential computing
Preventing unauthorized access and data breaches
Confidential computing addresses this gap of protecting data and applications in use by performing computations within a secure and isolated environment within a computer’s processor, also known as a trusted execution environment (TEE).
The TEE acts like a locked box that safeguards the data and code within the processor from unauthorized access or tampering and proves that no one can view or manipulate it. This provides an added layer of security for organizations that must process sensitive data or IP.
Figure 3. Confidential computing addresses the security gap
The TEE blocks access to the data and code, from the hypervisor, host OS, infrastructure owners such as cloud providers, or anyone with physical access to the servers. Confidential computing reduces the surface area of attacks from internal and external threats. It secures data and IP at the lowest layer of the computing stack and provides the technical assurance that the hardware and the firmware used for computing are trustworthy.
Today’s CPU-only confidential computing solutions are not sufficient for AI workloads that demand acceleration for faster time-to-solution, better user experience, and real-time response times. Extending the TEE of CPUs to NVIDIA GPUs can significantly enhance the performance of confidential computing for AI, enabling faster and more efficient processing of sensitive data while maintaining strong security measures.
Secure AI on NVIDIA H100
Confidential computing is a built-in hardware-based security feature introduced in the NVIDIA H100 Tensor Core GPU that enables customers in regulated industries like healthcare, finance, and the public sector to protect the confidentiality and integrity of sensitive data and AI models in use.
Figure 4. Confidential computing on NVIDIA H100 GPUs
With security from the lowest level of the computing stack down to the GPU architecture itself, you can build and deploy AI applications using NVIDIA H100 GPUs on-premises, in the cloud, or at the edge. No unauthorized entities can view or modify the data and AI application during execution. This protects both sensitive customer data and AI intellectual property.
Figure 5. Full isolation of virtual machines with confidential computing
For a quick overview of the confidential computing feature in NVIDIA H100, see the following video.
Video 1. What is NVIDIA Confidential Computing?
Unlocking secure AI: Use cases empowered by confidential computing on NVIDIA H100
Accelerated confidential computing with NVIDIA H100 GPUs offers enterprises a solution that is performant, versatile, scalable, and secure for AI workloads. It unlocks new possibilities to innovate with AI while maintaining security, privacy, and regulatory compliance.
Confidential AI training
Confidential AI inference
AI IP protection for ISVs and enterprises
Confidential federated learning
Confidential AI training
In industries like healthcare, financial services, and the public sector, the data used for AI model training is sensitive and regulated. This includes PII, personal health information (PHI), and confidential proprietary data, all of which must be protected from unauthorized internal or external access during the training process.
With confidential computing on NVIDIA H100 GPUs, you get the computational power required to accelerate the time to train and the technical assurance that the confidentiality and integrity of your data and AI models are protected.
Figure 6. Protection for training data and AI models being trained
For AI training workloads done on-premises within your data center, confidential computing can protect the training data and AI models from viewing or modification by malicious insiders or any inter-organizational unauthorized personnel. When you are training AI models in a hosted or shared infrastructure like the public cloud, access to the data and AI models is blocked from the host OS and hypervisor. This includes server administrators who typically have access to the physical servers managed by the platform provider.
Confidential AI inference
When trained, AI models are integrated within enterprise or end-user applications and deployed on production IT systems—on-premises, in the cloud, or at the edge—to infer things about new user data.
End-user inputs provided to the deployed AI model can often be private or confidential information, which must be protected for privacy or regulatory compliance reasons and to prevent any data leaks or breaches.
The AI models themselves are valuable IP developed by the owner of the AI-enabled products or services. They are at risk of being viewed, modified, or stolen during inference computations, resulting in incorrect results and loss of business value.
Figure 7. Protection for the deployed AI model, AI-enabled apps, and user input data
Deploying AI-enabled applications on NVIDIA H100 GPUs with confidential computing provides the technical assurance that both the customer input data and AI models are protected from being viewed or modified during inference. This provides an added layer of trust for end users to adopt and use the AI-enabled service and also assures enterprises that their valuable AI models are protected during use.
AI IP protection for ISVs and enterprises
Independent software vendors (ISVs) invest heavily in developing proprietary AI models for a variety of application-specific or industry-specific use cases. Examples include fraud detection and risk management in financial services or disease diagnosis and personalized treatment planning in healthcare.
ISVs must protect their IP from tampering or stealing when it is deployed in customer data centers on-premises, in remote locations at the edge, or within a customer’s public cloud tenancy. In addition, customers need the assurance that the data they provide as input to the ISV application cannot be viewed or tampered with during use.
Figure 8. Secure deployment of AI IP
Confidential computing on NVIDIA H100 GPUs enables ISVs to scale customer deployments from cloud to edge while protecting their valuable IP from unauthorized access or modifications, even from someone with physical access to the deployment infrastructure. ISVs can also provide customers with the technical assurance that the application can’t view or modify their data, increasing trust and reducing the risk for customers using the third-party ISV application.
Confidential federated learning
Building and improving AI models for use cases like fraud detection, medical imaging, and drug development requires diverse, carefully labeled datasets for training. This demands collaboration between multiple data owners without compromising the confidentiality and integrity of the individual data sources.
Confidential computing on NVIDIA H100 GPUs unlocks secure multi-party computing use cases like confidential federated learning. Federated learning enables multiple organizations to work together to train or evaluate AI models without having to share each group’s proprietary datasets.
Figure 9. An added layer of protection for multi-party AI workloads like federated learning
Confidential federated learning with NVIDIA H100 provides an added layer of security that ensures that both data and the local AI models are protected from unauthorized access at each participating site.
When deployed at the federated servers, it also protects the global AI model during aggregation and provides an additional layer of technical assurance that the aggregated model is protected from unauthorized access or modification.
This helps drive the advancement of medical research, expedite drug development, mitigate insurance fraud, and trace money laundering globally while maintaining security, privacy, and regulatory compliance for all parties involved.
NVIDIA platforms for accelerated confidential computing on-premises
Getting started with confidential computing on NVIDIA H100 GPUs requires a CPU that supports a virtual machine (VM)–based TEE technology, such as AMD SEV-SNP and Intel TDX. Extending the VM-based TEE from the supported CPU to the H100 GPU enables all the VM memory to be encrypted and the application running does not require any code changes.
Top OEM partners are now shipping accelerated platforms for confidential computing, powered by NVIDIA H100 Tensor Core GPUs. These confidential computing–compatible systems combine the NVIDIA H100 PCIe Tensor Core GPUs with AMD Milan or AMD Genoa CPUs that support the AMD SEV-SNP technology.
The following partners are delivering the first wave of NVIDIA platforms for enterprises to secure their data, AI models, and applications in use in data centers on-premises:
ASRock Rack
ASUS
Cisco
Dell Technologies
GIGABYTE
Hewlett Packard Enterprise
Lenovo
Supermicro
Tyan
NVIDIA H100 GPU comes with the VBIOS (firmware) that supports all confidential computing features in the first production release. The NVIDIA confidential computing software stack to be released this summer will support single GPU first and then multi-GPU and Multi-Instance GPU in subsequent releases.
Posted by Petros Maniatis and Daniel Tarlow, Research Scientists, Google
Software isn’t created in one dramatic step. It improves bit by bit, one little step at a time — editing, running unit tests, fixing build errors, addressing code reviews, editing some more, appeasing linters, and fixing more errors — until finally it becomes good enough to merge into a code repository. Software engineering isn’t an isolated process, but a dialogue among human developers, code reviewers, bug reporters, software architects and tools, such as compilers, unit tests, linters and static analyzers.
Today we describe DIDACT (Dynamic Integrated Developer ACTivity), which is a methodology for training large machine learning (ML) models for software development. The novelty of DIDACT is that it uses the process of software development as the source of training data for the model, rather than just the polished end state of that process, the finished code. By exposing the model to the contexts that developers see as they work, paired with the actions they take in response, the model learns about the dynamics of software development and is more aligned with how developers spend their time. We leverage instrumentation of Google’s software development to scale up the quantity and diversity of developer-activity data beyond previous works. Results are extremely promising along two dimensions: usefulness to professional software developers, and as a potential basis for imbuing ML models with general software development skills.
DIDACT is a multi-task model trained on development activities that include editing, debugging, repair, and code review.
We built and deployed internally three DIDACT tools, Comment Resolution (which we recently announced), Build Repair, and Tip Prediction, each integrated at different stages of the development workflow. All three of these tools received enthusiastic feedback from thousands of internal developers. We see this as the ultimate test of usefulness: do professional developers, who are often experts on the code base and who have carefully honed workflows, leverage the tools to improve their productivity?
Perhaps most excitingly, we demonstrate how DIDACT is a first step towards a general-purpose developer-assistance agent. We show that the trained model can be used in a variety of surprising ways, via prompting with prefixes of developer activities, and by chaining together multiple predictions to roll out longer activity trajectories. We believe DIDACT paves a promising path towards developing agents that can generally assist across the software development process.
A treasure trove of data about the software engineering process
Google’s software engineering toolchains store every operation related to code as a log of interactions among tools and developers, and have done so for decades. In principle, one could use this record to replay in detail the key episodes in the “software engineering video” of how Google’s codebase came to be, step-by-step — one code edit, compilation, comment, variable rename, etc., at a time.
Google code lives in a monorepo, a single repository of code for all tools and systems. A software developer typically experiments with code changes in a local copy-on-write workspace managed by a system called Clients in the Cloud (CitC). When the developer is ready to package a set of code changes together for a specific purpose (e.g., fixing a bug), they create a changelist (CL) in Critique, Google’s code-review system. As with other types of code-review systems, the developer engages in a dialog with a peer reviewer about functionality and style. The developer edits their CL to address reviewer comments as the dialog progresses. Eventually, the reviewer declares “LGTM!” (“looks good to me”), and the CL is merged into the code repository.
Of course, in addition to a dialog with the code reviewer, the developer also maintains a “dialog” of sorts with a plethora of other software engineering tools, such as the compiler, the testing framework, linters, static analyzers, fuzzers, etc.
An illustration of the intricate web of activities involved in developing software: small actions by the developer, interactions with a code reviewer, and invocations of tools such as compilers.
A multi-task model for software engineering
DIDACT utilizes interactions among engineers and tools to power ML models that assist Google developers, by suggesting or enhancing actions developers take — in context — while pursuing their software-engineering tasks. To do that, we have defined a number of tasks about individual developer activities: repairing a broken build, predicting a code-review comment, addressing a code-review comment, renaming a variable, editing a file, etc. We use a common formalism for each activity: it takes some State (a code file), some Intent (annotations specific to the activity, such as code-review comments or compiler errors), and produces an Action (the operation taken to address the task). This Action is like a mini programming language, and can be extended for newly added activities. It covers things like editing, adding comments, renaming variables, marking up code with errors, etc. We call this language DevScript.
The DIDACT model is prompted with a task, code snippets, and annotations related to that task, and produces development actions, e.g., edits or comments.
This state-intent-action formalism enables us to capture many different tasks in a general way. What’s more, DevScript is a concise way to express complex actions, without the need to output the whole state (the original code) as it would be after the action takes place; this makes the model more efficient and more interpretable. For example, a rename might touch a file in dozens of places, but a model can predict a single rename action.
An ML peer programmer
DIDACT does a good job on individual assistive tasks. For example, below we show DIDACT doing code clean-up after functionality is mostly done. It looks at the code along with some final comments by the code reviewer (marked with “human” in the animation), and predicts edits to address those comments (rendered as a diff).
Given an initial snippet of code and the comments that a code reviewer attached to that snippet, the Pre-Submit Cleanup task of DIDACT produces edits (insertions and deletions of text) that address those comments.
The multimodal nature of DIDACT also gives rise to some surprising capabilities, reminiscent of behaviors emerging with scale. One such capability is history augmentation, which can be enabled via prompting. Knowing what the developer did recently enables the model to make a better guess about what the developer should do next.
An illustration of history-augmented code completion in action.
A powerful such task exemplifying this capability is history-augmented code completion. In the figure below, the developer adds a new function parameter (1), and moves the cursor into the documentation (2). Conditioned on the history of developer edits and the cursor position, the model completes the line (3) by correctly predicting the docstring entry for the new parameter.
An illustration of edit prediction, over multiple chained iterations.
In an even more powerful history-augmented task, edit prediction, the model can choose where to edit next in a fashion that is historically consistent. If the developer deletes a function parameter (1), the model can use history to correctly predict an update to the docstring (2) that removes the deleted parameter (without the human developer manually placing the cursor there) and to update a statement in the function (3) in a syntactically (and — arguably — semantically) correct way. With history, the model can unambiguously decide how to continue the “editing video” correctly. Without history, the model wouldn’t know whether the missing function parameter is intentional (because the developer is in the process of a longer edit to remove it) or accidental (in which case the model should re-add it to fix the problem).
The model can go even further. For example, we started with a blank file and asked the model to successively predict what edits would come next until it had written a full code file. The astonishing part is that the model developed code in a step-by-step way that would seem naturalto a developer: It started by first creating a fully working skeleton with imports, flags, and a basic main function. It then incrementally added new functionality, like reading from a file and writing results, and added functionality to filter out some lines based on a user-provided regular expression, which required changes across the file, like adding new flags.
Conclusion
DIDACT turns Google’s software development process into training demonstrations for ML developer assistants, and uses those demonstrations to train models that construct code in a step-by-step fashion, interactively with tools and code reviewers. These innovations are already powering tools enjoyed by Google developers every day. The DIDACT approach complements the great strides taken by large language models at Google and elsewhere, towards technologies that ease toil, improve productivity, and enhance the quality of work of software engineers.
Acknowledgements
This work is the result of a multi-year collaboration among Google Research, Google Core Systems and Experiences, and DeepMind. We would like to acknowledge our colleagues Jacob Austin, Pascal Lamblin, Pierre-Antoine Manzagol, and Daniel Zheng, who join us as the key drivers of this project. This work could not have happened without the significant and sustained contributions of our partners at Alphabet (Peter Choy, Henryk Michalewski, Subhodeep Moitra, Malgorzata Salawa, Vaibhav Tulsyan, and Manushree Vijayvergiya), as well as the many people who collected data, identified tasks, built products, strategized, evangelized, and helped us execute on the many facets of this agenda (Ankur Agarwal, Paige Bailey, Marc Brockschmidt, Rodrigo Damazio Bovendorp, Satish Chandra, Savinee Dancs, Matt Frazier, Alexander Frömmgen, Nimesh Ghelani, Chris Gorgolewski, Chenjie Gu, Vincent Hellendoorn, Franjo Ivančić, Marko Ivanković, Emily Johnston, Luka Kalinovcic, Lera Kharatyan, Jessica Ko, Markus Kusano, Kathy Nix, Sara Qu, Marc Rasi, Marcus Revaj, Ballie Sandhu, Michael Sloan, Tom Small, Gabriela Surita, Maxim Tabachnyk, David Tattersall, Sara Toth, Kevin Villela, Sara Wiltberger, and Donald Duo Zhao) and our extremely supportive leadership (Martín Abadi, Joelle Barral, Jeff Dean, Madhura Dudhgaonkar, Douglas Eck, Zoubin Ghahramani, Hugo Larochelle, Chandu Thekkath, and Niranjan Tulpule). Thank you!