DataBloom - Part 225

Visual captions: Using large language models to augment video conferences with dynamic visuals

Post author By
Post date June 6, 2023
No Comments on Visual captions: Using large language models to augment video conferences with dynamic visuals

Posted by Ruofei Du, Research Scientist, and Alex Olwal, Senior Staff Research Scientist, Google Augmented Reality

Recent advances in video conferencing have significantly improved remote video communication through features like live captioning and noise cancellation. However, there are various situations where dynamic visual augmentation would be useful to better convey complex and nuanced information. For example, when discussing what to order at a Japanese restaurant, your friends could share visuals that would help you feel more confident about ordering the “Sukiyaki”. Or when talking about your recent family trip to San Francisco, you may want to show a photo from your personal album.

In “Visual Captions: Augmenting Verbal Communication With On-the-fly Visuals”, presented at ACM CHI 2023, we introduce a system that uses verbal cues to augment synchronous video communication with real-time visuals. We fine-tuned a large language model to proactively suggest relevant visuals in open-vocabulary conversations using a dataset we curated for this purpose. We open sourced Visual Captions as part of the ARChat project, which is designed for rapid prototyping of augmented communication with real-time transcription.

Visual Captions facilitates verbal communication with real-time visuals. The system is even robust against typical mistakes that may often appear in real-time speech-to-text transcription. For example, out of context, the transcription model misunderstood the word “pier” as “pair”, but Visual Captions still recommends images of the Santa Monica Pier.

Design space for augmenting verbal communication with dynamic visuals

We invited 10 internal participants, each with various technical and non-technical backgrounds, including software engineers, researchers, UX designers, visual artists, students, etc., to discuss their particular needs and desires for a potential real-time visual augmentation service. In two sessions, we introduced low-fidelity prototypes of the envisioned system, followed by video demos of the existing text-to-image systems. These discussions informed a design space with eight dimensions for visual augmentation of real-time conversations, labeled below as D1 to D8.

Visual augmentations could be synchronous or asynchronous with the conversation (D1: Temporal), could be used for both expressing and understanding speech content (D2: Subject), and could be applied using a wide range of different visual content, visual types, and visual sources (D3: Visual). Such visual augmentation might vary depending on the scale of the meetings (D4: Scale) and whether a meeting is in co-located or remote settings (D5: Space). These factors also influence whether the visuals should be displayed privately, shared between participants, or public to everyone (D6: Privacy). Participants also identified different ways in which they would like to interact with the system while having conversations (D7: Initiation). For example, people proposed different levels of “proactivity”, which indicates the degree to which users would like the model to take the initiative. Finally, participants envisioned different methods of interaction, for example, using speech or gestures for input. (D8: Interaction).

Design space for augmenting verbal communication with dynamic visuals.

Informed by this initial feedback, we designed Visual Captions to focus on generating synchronous visuals of semantically relevant visual content, type, and source. While participants in these initial exploratory sessions were participating in one-to-one remote conversations, deployment of Visual Captions in the wild will often be in one-to-many (e.g., an individual giving a presentation to an audience) and many-to-many scenarios (e.g., a discussion among multiple people in a meeting).

Because the visual that best complements a conversation depends strongly on the context of the discussion, we needed a training set specific to this purpose. So, we collected a dataset of 1595 quadruples of language (1), visual content (2), type (3), and source (4) across a variety of contexts, including daily conversations, lectures, and travel guides. For example, “I would love to see it!” corresponds to visual content of “face smiling”, a visual type of “emoji”, and visual source of “public search”. “Did she tell you about our trip to Mexico?” corresponds to visual content of “a photo from the trip to Mexico”, a visual type of “photo”, and visual source of “personal album”. We publicly released this VC1.5K dataset for the research community.

Visual intent prediction model

To predict what visuals could supplement a conversation, we trained a visual intent prediction model based on a large language model using the VC1.5K dataset. For training, we parsed each visual intent into the format of “<Visual Type> of <Visual Content> from <Visual Source>“.

{"prompt": "<Previous Two Sentences> →", 
  "completion": 
"<Visual Type 1> of "<Visual Type 1> from "<Visual Source 1>;
 <Visual Type 2> of "<Visual Type 2> from "<Visual Source 2>; 
  ... 𝑛"}

Using this format, this system can handle open-vocabulary conversations and contextually predict visual content, visual source, and visual type. Anecdotally, we found that it outperforms keyword-based approaches, which fail to handle open-vocabulary examples like “Your aunt Amy will be visiting this Saturday,” and cannot suggest relevant visual types or visual sources.

Examples of visual intent predictions by our model.

We used 1276 (80%) examples from the VC1.5K dataset for fine-tuning the large language model and the remaining 319 (20%) examples as test data. We measured the performance of the fine-tuned model with the token accuracy metric, i.e., the percentage of tokens in a batch that were correctly predicted by the model. During training, our model reached a training token accuracy of 97% and a validation token accuracy of 87%.

Performance

To evaluate the utility of the trained Visual Captions model, we invited 89 participants to perform 846 tasks. They were asked to provide feedback on a scale of “1 — Strongly Disagree” to “7 — Strongly Agree” for six qualitative statements. Most participants preferred to have the visual during a conversation (Q1, 83% ≥ 5–Somewhat Agree). Moreover, they considered the displayed visuals to be useful and informative (Q2, 82% ≥ 5–Somewhat Agree), high-quality (Q3, 82% ≥ 5–Somewhat Agree), and relevant to the original speech (Q4, 84% ≥ 5–Somewhat Agree). Participants also found the predicted visual type (Q5, 87% ≥ 5–Somewhat Agree) and visual source (Q6, 86% ≥ 5–Somewhat Agree) to be accurate given the context of the corresponding conversation.

Technical evaluation results of the visual prediction model rated by study participants.

With this fine-tuned visual intent prediction model, we developed Visual Captions on the ARChat platform, which can add new interactive widgets directly on the camera streams of video conferencing platforms, such as Google Meet. As shown in the system workflow below, Visual Captions automatically captures the user’s speech, retrieves the last sentences, feeds them into the visual intent prediction model every 100 ms, retrieves relevant visuals, and then suggests visuals in real time.

System workflow of Visual Captions.

Visual Captions provides three levels of proactivity when suggesting visuals:

Auto-display (high-proactivity): The system autonomously searches and displays visuals publicly to all meeting participants. No user interaction required.
Auto-suggest (medium-proactivity): The suggested visuals are shown in a private scrolling view. A user then clicks a visual to display it publicly. In this mode, the system is proactively recommending visuals, but the user decides when and what to display.
On-demand-suggest (low-proactivity): The system will only suggest visuals if a user presses the spacebar.

Quantitative and qualitative evaluation: User studies

We evaluated Visual Captions in both a controlled lab study (n = 26) and in-the-wild deployment studies (n = 10). Participants found that real-time visuals facilitated live conversations by helping explain unfamiliar concepts, resolve language ambiguities, and make conversations more engaging. Participants also reported different preferences for interacting with the system in-situ, and that varying levels of proactivity were preferred in different social scenarios.

Participants’ Task Load Index and Likert scale ratings (from 1 – Strongly Disagree to 7 – Strongly Agree) of four conversations without Visual Captions (“No VC”) and the three Visual Captions modes: auto-display, auto-suggest, and on-demand suggest.

Conclusions and future directions

This work proposes a system for real-time visual augmentation of verbal communication, called Visual Captions, that was trained using a dataset of 1595 visual intents collected from 246 participants, covering 15 topic categories. We publicly release the training dataset, VC1.5K to the research community to support further research in this space. We have also deployed Visual Captions in ARChat, which facilitates video conferences in Google Meet by transcribing meetings and augmenting the camera video streams.

Visual Captions represents a significant step towards enhancing verbal communication with on-the-fly visuals. By understanding the importance of visual cues in everyday conversations, we can create more effective communication tools and improve how people connect.

Acknowledgements

This work is a collaboration across multiple teams at Google. Key contributors to the project include Xingyu “Bruce” Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Peggy Chi, Alex Olwal, and Ruofei Du.

We would like to extend our thanks to those on the ARChat team who provided assistance, including Jason Mayes, Max Spear, Na Li, Jun Zhang, Jing Jin, Yuan Ren, Adarsh Kowdle, Ping Yu, Darcy Philippon, and Ezgi Oztelcan. We would also like to thank the many people with whom we’ve had insightful discussions and those who provided feedback on the manuscript, including Eric Turner, Yinda Zhang, Feitong Tan, Danhang Tang, and Shahram Izadi. We would also like to thank our CHI reviewers for their insightful feedback.

Misc

Unlocking Speech AI Technology for Global Language Users: Top Q&As

Post author By
Post date June 6, 2023
No Comments on Unlocking Speech AI Technology for Global Language Users: Top Q&As

Voice-enabled technology is becoming ubiquitous. But many are being left behind by an anglocentric and demographically biased algorithmic world. Mozilla Common…

Voice-enabled technology is becoming ubiquitous. But many are being left behind by an anglocentric and demographically biased algorithmic world. Mozilla Common Voice (MCV) and NVIDIA are collaborating to change that by partnering on a public crowdsourced multilingual speech corpus—now the largest of its kind in the world—and open-source pretrained models. It is now easier than ever before to develop automatic speech recognition (ASR) technology that works for speakers of many languages.

This post summarizes the top questions asked during Unlocking Speech AI Technology for Global Language Users, a recorded talk from the Speech AI Summit 2022 featuring EM Lewis-Jong of Mozilla Common Voice and Caroline de Brito Gottlieb of NVIDIA.

Do multilingual NVIDIA NeMo open-source models exist?

Caroline de Brito Gottlieb: To make Speech AI more accessible and serve a global community, we first need to understand how the world uses language. Monolingualism is an anomaly worldwide, so researchers at NVIDIA are focused on creating state-of-the-art AI for multilingual contexts.

Through NeMo, NVIDIA has released its first model for multilingual and code-switched/code-mixed speech recognition, which can transcribe audio samples into English, Latin/North American Spanish, as well as both English and Spanish used in the same sentence—a phenomenon called code-switching, or code-mixing. NVIDIA will soon have a multilingual model on NeMo for Indic languages as well.

The switching or mixing of codes is very common in multilingual communities and communities speaking multiple dialects or varieties of the same language. This poses unique challenges for existing speech AI solutions. However, the open-source NeMo model is an important step toward AI that accurately reflects and supports how global communities actually use speech in real-world contexts.

Do datasets extend beyond “language” to include domain-specific vocabulary? For example, finance and healthcare datasets may differ.

EM Lewis-Jong: Domains represented within the corpora on MCV have been historically driven by communities who choose to create datasets through the platform. That means different languages have varied domains represented in their datasets—some might be heavy on news and media, whereas others might contain more educational text. If you want to enhance domain-specific coverage in a Common Voice dataset, simply go through the process of adding text into the platform through GitHub or the Sentence Collector tool. All domains are welcome.

MCV is actively rebuilding and expanding the Sentence Collector tool to make it easier to ingest large volumes of text, and tag them appropriately. Expect to see these changes in April 2023. Also, the team has been collaborating closely with NVIDIA and other data partners to ensure the metadata schema is as interoperable as possible. Domain tagging the Common Voice corpora is a big part of that.

Caroline de Brito Gottlieb: Accounting for domain-specific language is a critical challenge, in particular when applying AI solutions across industries. That is why NVIDIA Riva offers multiple techniques, such as word boosting and vocabulary extension, for customizing ASR models to improve the recognition of specific words.

Our team primarily thinks of domain as a matter of vocabulary and terminology. This alone is a big challenge, given the different levels of specialized terminology and acronyms like GPU, FTP, and more. But it is also important to collect domain-specific data beyond just individual words to capture grammatical or structural differences; for example, the way negation is expressed in clinical practice guidelines. Designing and curating domain-specific datasets is an active area of collaboration between Common Voice and NVIDIA, and we’re excited to see progress in domain-specific ASR for languages beyond English.

How do you differentiate varied versions of Spanish, English, Portuguese, and other languages across geographies?

EM Lewis-Jong: Historically, MCV didn’t have a great system for differentiating between varied versions of a language. Communities chose between creating an entirely new dataset (organized by language), or they could use the accent field. In 2021, MCV did an intensive research exercise and discovered the following:

Limited community awareness about variants: New communities without much context weren’t always sure about how to categorize themselves. Once they’d made the decision about whether to become a new language dataset or remain as an accent, it was difficult to change their minds.
Dataset fragmentation: Diverse communities, such as those with large diaspora populations, may feel they need to split up entirely and set up a whole new language. This fragments the dataset and confuses contributors.
Identity and experience: Some language communities and contributors make use of accent tags, but can feel marginalized and undermined by this. Talking about language is talking about power, and some people want to have the ability to identify their speech beyond ‘accent’ in ways that respect and represent them.
Linguistic and orthographic diversity: Some communities felt there was no suitable arrangement for them, as their spoken language had multiple writing systems. Currently, MCV assumes a 1:1 relationship between spoken word and written word.

For these reasons, the team enabled a new category on the platform called Variant. This is intended to help communities systematically differentiate within languages, and especially to support large languages with a diverse range of speakers.

Where possible, MCV uses BCP-47 codes for tagging. BCP 47 is a flexible system that enables communities to pull out key information such as region, dialect, and orthography.

For example, the Kiswahili community might like to differentiate between Congolese Swahili and Chimwiini. Historically on the platform, this would be framed as an ‘accent’ difference—despite the fact that the variants have different vocabulary and grammar and would not be easily mutually intelligible. In other words, speakers might struggle to understand one another.

Communities are now free to choose whether and how they make use of the variant tag. MCV is rolling this out to language communities in phases. The team produced new definitions around language, variant, and accent to act as helpful guidelines for communities. These are living definitions that will evolve with the MCV community. For more information, check out How We’re Making Common Voice Even More Linguistically Inclusive.

What are some examples of successfully deployed use cases?

EM Lewis-Jong: MCV is used by researchers, engineers, and data scientists at most of the world’s largest tech companies, as well as by academics, startups, and civil society. It is downloaded hundreds of thousands of times a year.

Some recent use cases the team is very excited about include the Kinyarwanda Mbaza chatbot, which provides COVID-19 guidance, Thai language health tracking wearables for the visually impaired, financial planning apps in Kiswahili like ChamaChat and agricultural health guidance for farmers in Kenya like LivHealth.

Caroline de Brito Gottlieb: NeMo—which uses MCV, among other datasets—is also widely deployed. Tarteel AI is an AI-focused, faith-based startup focusing on religious and educational tech. The Tarteel team leveraged NVIDIA Riva and NeMo AI tooling to achieve state-of-the-art word error rate (WER) of 4% on Arabic transcription by fine-tuning an English ASR model on Arabic language data. This enabled Tarteel to develop the world’s first Quranic Arabic ASR, providing technology to support a community of 1.8 billion Muslim across the world in improving their Quran recitation through real-time feedback.

In January 2023, Riva released an out-of-the-box Arabic ASR model that can be seamlessly customized for specific dialects, accents, and domains. Another use case on Singaporean English, or Singlish, is presented in Easy Speech AI Customization for Local Singaporean Voice.

How does Mozilla collect the diversity attributes of the Common Voice data set for a language, such as age and sex?

EM Lewis-Jong: MCV enables users to self-identify and associate their clips with relevant information: variant (if your language has them), accent (an important diversity attribute), sex, and age. This year MCV will expand these options for some demographic categories, in particular sex, to be more inclusive.

This information will be associated with your clips, and then securely and safely pseudonymised before the dataset is released. You can tell MCV about your linguistic features in the usual contribution flow; however, for sensitive demographic attributes, you must create an account.

What type of ASR model is best to use when fine-tuning a particular language?

Caroline de Brito Gottlieb: NeMo is a toolkit with pretrained models that enables you to fine-tune for your own language and specific use case. State-of-the-art pretrained NeMo models are freely available on NGC, the NVIDIA hub for GPU-optimized software, and HuggingFace. Check out the extensive tutorials that can all be run on Google Colab, and a full suite of example scripts supporting multi-GPU/multi-node training.

In addition to the languages already offered in NeMo ASR, community members have used NeMo to obtain state-of-the-art results for new languages, dialects, variants, and accents by fine-tuning NeMo base models. Much of that work has used NVIDIA pretrained English language ASR models, but I encourage you to try fine-tuning on a NeMo model for a language most related to the one you are working on. You can start by looking up the family and genealogical classification of a language in Glottolog.

My native language, Yoruba, is not on MCV. What can be done to include it along with its different dialects?

EM Lewis-Jong: Anyone can add a new language to MCV. Reach out about adding your language.

There are two stages to the process: translating the site and collecting sentences.

Translating the site involves a Mozilla tool called Pontoon for translations. Pontoon has lots of languages, but if it doesn’t have yours you can request for your language to be added. Then, to make the language available on the Common Voice project, request the new language on GitHub. Get more details about site translation and how to use Pontoon.

Collecting sentences involves adding small numbers of sentences, or performing bulk imports using GitHub. Remember that sentences need to be CC0 (or public domain), or you can write your own. Learn more about sentence collection and using the Sentence Collector.

Does data augmentation factor into the need for more diversity?

Caroline de Brito Gottlieb: Speech AI models need to be robust for diverse environmental factors and contextual variations, especially as the team scales up to more languages, communities, and therefore, contexts. However, authentic data is not always available to represent this diversity.

Data augmentation is a powerful tool to enhance the size and variety of datasets by simulating speech data characteristics. When applied to training data, the resulting expanded or diversified dataset can help models generalize better to new scenarios and unseen data.

When data augmentation techniques are applied to datasets used for testing, it enables understanding the model’s performance in an expanded variety of speech data contexts. NeMo offers various data augmentation techniques such as noise perturbation, speech perturbation, and time stretch augmentation, which can be applied to training and testing data.

Do the datasets in MCV support different accents, such as speaking German with a French accent?

EM Lewis-Jong: There are as many accents as there are speakers, and all are welcome. As of December 2021, you can easily add multiple accents in your profile page.

Accents are not limited by what others have chosen. You can stipulate your accent on your own terms, making it easier for contributors to quickly identify their speech in a natural way.

For example, if you’re a French speaker originally from Germany, who learned French in a Cote D’Ivoire context, you can add accents like ‘German’ and ‘Cote D’Ivoire’ to your French clip submissions.

Summary

To create a healthier AI ecosystem, communities need to be meaningfully engaged in the data creation process. In addition, open-sourcing speech datasets and ASR models enables innovation for everyone.

If you would like to contribute to the public crowdsourced multilingual speech corpus, check out NVIDIA NeMo on GitHub and Mozilla Common Voice to get involved.

Misc

Fish-Farming Startup Casts AI to Make Aquaculture More Efficient, Sustainable

Post author By
Post date June 6, 2023
No Comments on Fish-Farming Startup Casts AI to Make Aquaculture More Efficient, Sustainable

As a marine biology student, Josef Melchner always dreamed of spending his days cruising the oceans to find dolphins, whales and fish — but also “wanted to do something practical, something that would benefit the world,” he said. When it came time to choose a career, he dove head first into aquaculture. He’s now CEO Read article >

Misc

Technical Artist Builds Great Woolly Mammoth With NVIDIA Omniverse USD Composer This Week ‘In the NVIDIA Studio’

Post author By
Post date June 6, 2023
No Comments on Technical Artist Builds Great Woolly Mammoth With NVIDIA Omniverse USD Composer This Week ‘In the NVIDIA Studio’

Keerthan Sathya, a senior technical artist specializing in 3D, emerged trium-elephant In the NVIDIA Studio this week with the incredibly detailed, expertly constructed, jaw-droppingly beautiful animation Tiny Mammoth.

Misc

CUDA 12.1 Supports Large Kernel Parameters

Post author By
Post date June 5, 2023
No Comments on CUDA 12.1 Supports Large Kernel Parameters

Abstract image CUDA kernel function parameters are passed to the device through constant memory and have been limited to 4,096 bytes. CUDA 12.1 increases this parameter limit…

CUDA kernel function parameters are passed to the device through constant memory and have been limited to 4,096 bytes. CUDA 12.1 increases this parameter limit from 4,096 bytes to 32,764 bytes on all device architectures including NVIDIA Volta and above.

Previously, passing kernel arguments exceeding 4,096 bytes required working around the kernel parameter limit by copying excess arguments into constant memory with cudaMemcpyToSymbol or cudaMemcpyToSymbolAsync, as shown in the snippet below.

#define TOTAL_PARAMS        (8000) // ints
#define KERNEL_PARAM_LIMIT  (1024) // ints
#define CONST_COPIED_PARAMS (TOTAL_PARAMS - KERNEL_PARAM_LIMIT)

__constant__ int excess_params[CONST_COPIED_PARAMS];

typedef struct {
    int param[KERNEL_PARAM_LIMIT];
} param_t;

__global__ void kernelDefault(__grid_constant__ const param_t p,...) {
    // access >>(p,...);
    cudaDeviceSynchronize();
}

This approach limits usability because you must explicitly manage both the constant memory allocation and the copy. Copy operation also adds significant latency, degrading the performance of latency-bound kernels that accept greater than 4,096 byte parameters.

Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above, resulting in the simplified implementation shown in the second snippet below.

#define TOTAL_PARAMS (8000) // ints

typedef struct {
    int param[TOTAL_PARAMS];
} param_large_t;

__global__ void kernelLargeParam(__grid_constant__ const param_large_t p,...) {
    // access all parameters from p
}

int main() {
    param_large_t p_large;
    kernelLargeParam>>(p_large,...);
    cudaDeviceSynchronize();
}

Note that in both preceding examples, kernel parameters are annotated with the __grid_constant__ qualifier to indicate they are read-only.

Toolkit and driver compatibility

Note that use of CUDA Toolkit 12.1 and a R530 driver or higher are required to compile, launch, and debug kernels with large kernel parameters. CUDA will issue the CUDA_ERROR_NOT_SUPPORTED error if the launch is attempted on an older driver.

Supported architectures

The higher parameter limit is available on all architectures, including NVIDIA Volta and above. The parameter limit remains at 4,096 bytes on architectures below NVIDIA Volta.

Link compatibility across CUDA Toolkit revisions

When linking device objects, if at least one device object contains a kernel with the higher parameter limit, you must recompile all objects from your device sources, with CUDA Toolkit 12.1 linking them together. Failure to do so will result in a linker error.

As an example, consider the scenario when two device objects—a.o and b.o—are linked together. If a.o or b.o contains at least one kernel with the higher parameter limit, then you must recompile respective sources and link the resulting objects together.

Performance savings with large kernel parameters

Figure 1 compares the performance of the two code snippets (provided above) on a single NVIDIA H100 system measured over 1,000 iterations. In this example, avoiding constant copies resulted in 28% overall savings in application runtime. For the same snippets, Figure 2 shows a 9% improvement in kernel execution time, as measured with NVIDIA Nsight Systems.

Figure 1. Application performance improvement with large kernel parameters on NVIDIA H100

For both images, the gray bar shows execution time for a kernel where 1,024 integers are passed as kernel parameters and remaining integers are passed using constant memory (code snippet 1). The green bar shows execution time for a kernel where 8,000 integers are passed as kernel parameters (code snippet 2). Both kernels accumulate 8,000 integers.

The time in the gray bar shows the execution time for a kernel where 1024-integers are passed as kernel parameters and (8,000 - 1,024) integers are passed using constant memory. The green bar shows the execution time for a kernel where 8,000 integers are passed as kernel parameters. — *Figure 2. Kernel execution time improvement with large kernel parameters on NVIDIA H100*

Note that if you omit the __grid_constant__ qualifier to the kernel parameter and perform a subsequent write operation to it from the kernel, an automatic copy to thread-local-memory is triggered. This may offset any performance gains.

Figure 3 shows the kernel execution time improvement profiled using Nsight Systems on QUDA. QUDA is an HPC library used for performing calculations in lattice quantum chromodynamics.

The reference kernel in this example performs a batched matrix multiply X * A + Y, where A, X, and Y are matrices. Kernel parameters store the coefficients of A. Prior to CUDA 12.1, when the coefficients exceeded the parameter limit of 4,096 bytes, they were explicitly copied over to constant memory, greatly increasing the kernel latency. With that copy removed, a significant performance improvement can be observed (Figure 3).

Figure 3. Kernel execution time improvement in QUDA with large kernel parameters

Summary

CUDA 12.1 offers you the option of passing up to 32,764 bytes using kernel parameters, which can be exploited to simplify applications as well as gain performance improvements. To see the full code sample referenced in this post, visit NVIDIA/cuda-samples on GitHub.

Misc

Accelerating the Accelerator: Scientist Speeds CERN’s HPC With GPUs, AI

Post author By
Post date June 5, 2023
No Comments on Accelerating the Accelerator: Scientist Speeds CERN’s HPC With GPUs, AI

Editor’s note: This is part of a series profiling researchers advancing science with high performance computing. Maria Girone is expanding the world’s largest network of scientific computers with accelerated computing and AI. Since 2002, the Ph.D. in particle physics has worked on a grid of systems across 170 sites in more than 40 countries that Read article >

Misc

Microsoft Bing Speeds Ad Delivery With NVIDIA Triton

Post author By
Post date June 5, 2023
No Comments on Microsoft Bing Speeds Ad Delivery With NVIDIA Triton

Jiusheng Chen’s team just got accelerated. They’re delivering personalized ads to users of Microsoft Bing with 7x throughput at reduced cost, thanks to NVIDIA Triton Inference Server running on NVIDIA A100 Tensor Core GPUs. It’s an amazing achievement for the principal software engineering manager and his crew. Tuning a Complex System Bing’s ad service uses Read article >

Misc

Take AI Learning to the Edge with NVIDIA Jetson

Post author By
Post date June 5, 2023
No Comments on Take AI Learning to the Edge with NVIDIA Jetson

Picture of the Jetson AGX Orin and Jetson Orin Nano developer kits on a black background. The NVIDIA Jetson Orin Nano and Jetson AGX Orin Developer Kits are now available at a discount for qualified students, educators, and researchers.Since its…

The NVIDIA Jetson Orin Nano and Jetson AGX Orin Developer Kits are now available at a discount for qualified students, educators, and researchers.Since its initial release almost 10 years ago, the NVIDIA Jetson platform has set the global standard for embedded computing and edge AI. These high-performance, low-power modules and developer kits for deep learning and computer vision give developers a small yet powerful platform to bring their robotics and edge AI vision to life. It’s the perfect tool for learning and teaching AI.

With 80x the performance over the original Jetson Nano, the Jetson Orin Nano Developer Kit enables you to run any kind of modern AI models, including transformer and advanced robotics models. It also has 5.4x the CUDA compute, 6.6x the CPU performance, and 50x the performance per watt compared to the Nano.

The NVIDIA Jetson AGX Orin Developer Kit and all Jetson Orin modules share one SoC architecture. This enables the developer kit to emulate any of the modules and makes it easy for you to start developing your next product. Compact size, lots of connectors, and up to 275 TOPS of AI performance make this developer kit perfect for prototyping advanced AI-powered robots and other edge AI devices.

The following videos show how easy it is to get started with the Jetson Orin Developer Kits.

Video 1. Getting Started with the Jetson Orin Nano Developer Kit

Video 2. Getting Started with the Jetson AGX Orin Developer Kit

Students making the most of Jetson developer kits

When you’ve got a Jetson-powered developer kit, imagine the possibilities for your next edge AI application.

For example, a group of mathematics, computer science, and biology researchers from the University of Marburg in Germany devised a way to quickly identify 80 species of birds and monitor local biodiversity based solely on the sounds those birds make. They used audio recordings captured with portable devices connected to the NVIDIA Jetson Nano Developer Kit.

Two pictures of the Bird@Edge system with colored overlays and component labels. — *Figure 1. Bird@Edge system built with the Jetson Nano*

Last fall, students from Southern Methodist University in Dallas, built what they called a baby supercomputer using 16 NVIDIA Jetson Nano modules, four power supplies, more than 60 handmade wires, a network switch, and some cooling fans. Compared to a normal-sized supercomputer, which can be huge, this mini supercomputer fits neatly on a desk, enabling students to easily use it and learn hands-on.

Picture of the student sitting next to a baby supercomputer on a desk. — *Figure 2. SMU student with a baby supercomputer built using 16 Jetson Nano modules*

Special student and educator pricing for Jetson Orin Developer Kits

If you are affiliated with an educational institution, you may be eligible for a discount on Jetson Orin Developer Kits. You must have a valid accredited university or education-related email address. You will then be provided with a one-time use code to order the Jetson Orin Nano Developer Kit or the Jetson AGX Orin Developer Kit. At this time, there is a limit of one of each kit per person.

Qualified students and educators can get the Jetson Orin Nano Developer Kit for $399 (USD) and the Jetson AGX Orin Developer Kit for $1,699 (USD). For more information, see Jetson for Education (login required). Requests for multiple units will be evaluated on a case-by-case basis.

The Ultimate School Supply: NVIDIA Jetson Nano

Picture looks down on a student desk setup with a Jetson Nano Developer Kit. — *Figure 3. A Jetson Nano Developer Kit helps students get started*

Don’t need all the power and performance from Jetson Orin? The NVIDIA Jetson Nano Developer Kit is the perfect companion for the upcoming back-to-school season. Powered by NVIDIA GPU-accelerated computing, the Jetson Nano brings your ideas to life, whether you’re a student, educator, or DIY enthusiast. From building intelligent robots to developing groundbreaking AI applications, the possibilities are endless.

The Jetson Nano Developer Kit is available for $149 (USD).

Need help getting started?

Want to learn more about how you can use the Jetson Nano Developer Kit? Consider signing up for a free course from the NVIDIA Deep Learning Institute: Getting Started with AI on Jetson Nano. You may also want to look at the two Jetson AI certification programs available. For more inspiration, see the Jetson Community Projects page. Be sure to share your hard work in the Jetson forums so other developers can learn from your successes.

Misc

Harnessing the Power of NVIDIA AI Enterprise on Azure Machine Learning

Post author By
Post date June 2, 2023
No Comments on Harnessing the Power of NVIDIA AI Enterprise on Azure Machine Learning

AI is transforming industries, automating processes, and opening new opportunities for innovation in the rapidly evolving technological landscape. As more…

AI is transforming industries, automating processes, and opening new opportunities for innovation in the rapidly evolving technological landscape. As more businesses recognize the value of incorporating AI into their operations, they face the challenge of implementing these technologies efficiently, effectively, and reliably.

Enter NVIDIA AI Enterprise, a comprehensive software suite designed to help organizations implement enterprise-ready AI, machine learning (ML), and data analytics at scale with security, reliability, API stability, and enterprise-grade support.

What is NVIDIA AI Enterprise?

Deploying AI solutions can be complex, requiring specialized hardware and software, as well as expert knowledge to develop and maintain these systems. NVIDIA AI Enterprise addresses these challenges by providing a complete ecosystem of tools, libraries, frameworks, and support services tailored for enterprise environments.

With GPU-accelerated computing capabilities, NVIDIA AI Enterprise enables enterprises to run AI workloads more efficiently, cost effectively, and at scale. NVIDIA AI Enterprise is built on top of the NVIDIA CUDA-X AI software stack, providing high-performance GPU-accelerated computing capabilities.

The suite includes:

VMI: A preconfigured virtual machine image that includes the necessary drivers and software to support GPU-accelerated AI workloads in the major clouds.
AI frameworks: Software that can run in a VMI (such as PyTorch, TensorFlow, RAPIDS, NVIDIA Triton with TensorRT and ONNX support, and more) that serves as the basis for AI development and deployment.
Pretrained models: Models that can be used as-is, or fine-tuned on enterprise-relevant data.
AI workflows: Prepackaged reference examples that illustrate how AI frameworks and pretrained models can be leveraged to build AI solutions to solve common business problems. These workflows provide guidance around fine-tuning pretrained models and AI model creation to build on NVIDIA frameworks. The pipelines to create applications are highlighted, as well as opinions on how to deploy customized applications and integrate them with various components typically found in enterprise environments, such as software for orchestration and management, storage, security, and networking. Available AI workflows include:

Intelligent virtual assistant: Engaging around-the-clock contact center assistance for lower operational costs.
Audio transcription: World-class, accurate transcripts based on GPU-optimized models.
Digital fingerprinting threat detection: Cybersecurity threat detection and alert prioritization to identify and act faster.
Next item prediction: Personalized product recommendations for increased customer engagement and retention.
Route optimization: Vehicle and robot routing optimization to reduce travel times and fuel costs.

Supported software with release branches

One of the main benefits of using the software available in NVIDIA AI Enterprise is that it is supported by NVIDIA with security and stability as guiding principles. NVIDIA AI Enterprise includes three release branches to cater to varying requirements across industries and use cases:

Latest Release Branch: Geared towards those needing top-of-the-tree software optimizations, this branch will have a monthly release cadence, ensuring users have access to the latest features and improvements. CVE patches, along with bug fixes, will also be included in roll-forward releases.
Production Release Branch: Designed for environments that prioritize API stability, this branch will receive monthly CVE patches and bug fixes, with two new branches introduced each year, each having a 9-month lifespan. To ensure seamless transitions and support, there will be a 3-month overlap period between two consecutive production branches. Production branches will be available in the second half of 2023.
Long-Term Release Branch: Tailored for highly regulated industries where long-term support is paramount, this branch will receive quarterly CVE patches and bug fixes and offers up to 3 years of support for a particular release. Complementing this long-term stability is a 6-month overlap period to ensure smooth transitions between versions, thus providing the longevity and consistency needed for these highly regulated industries.

Diagram depicting the three release branches of NVIDIA AI Enterprise: Latest Release Branch, Production Release Branch, and Long-Term Release Branch — *Figure 1. The three release branches of NVIDIA AI Enterprise serve varying requirements across industries and use cases*

How to use NVIDIA AI Enterprise with Microsoft Azure Machine Learning

Microsoft Azure Machine Learning is a platform for AI development in the cloud and on premises, including a service for training, experimentation, deploying, and monitoring models, as well as designing and constructing prompt flows for large language models. An open platform, Azure Machine Learning supports all popular machine learning frameworks and toolkits, including those from NVIDIA AI Enterprise.

This collaboration optimizes the experience of running NVIDIA AI software by integrating it with the Azure Machine Learning training and inference platform. Users no longer need to spend time setting up training environments, installing packages, writing training code, logging training metrics, and deploying models. With this integration, users will be able to leverage the power of NVIDIA enterprise-ready software, complementing Azure Machine Learning’s high performance and secure infrastructure, to build production-ready AI workflows.

To get started today, follow these steps:

1. Sign in to Microsoft Azure and launch Azure Machine Learning Studio.

2. View and access all prebuilt NVIDIA AI Enterprise Components, Environments, and Models from the NVIDIA AI Enterprise Preview Registry (Figure 2).

Figure 2. NVIDIA AI Enterprise Preview Registry on Azure Machine Learning

3. Use these assets from within a workspace to create ML pipelines within the designer through simple drag and drop (Figure 3).

Figure 3. Pipelines in Azure Machine Learning using NVIDIA AI Enterprise components

Find NVIDIA AI Enterprise sample assets in the Azure Machine Learning registry. Visit NVIDIA_AI_Enterprise_AzureML on GitHub to find code for the preview assets.

Use case: Body pose estimation

Using the various elements within the NVIDIA AI Enterprise Preview Registry is easy. This example showcases a computer vision task that uses NVIDIA DeepStream for body pose estimation. NVIDIA TAO Toolkit provides the basis for the body pose model and the ability to refine it with new data.

Figure 4 shows a video analytics pipeline example running the NVIDIA DeepStream sample app for body pose estimation. It runs on a GPU cluster and can be easily adapted to leverage updated models and videos, unlocking the power of the Azure Machine Learning platform.

Figure 4. NVIDIA TAO Toolkit and NVIDIA DeepStream for body pose estimation with Azure Machine Learning

The example includes two URI-based data assets created for storing the inputs for the DeepStream sample app command component. The data assets leverage a pretrained model, which is readily available in the NVIDIA AI Enterprise Registry. They also include additional calibration and label information.

The DeepStream body pose command component is configured to use Microsoft Azure blob storage. This component monitors the input directory for any new video files that require inference. When a video file appears, the component picks it up and performs body pose inference. The outputted video includes bounding boxes and tracking lines and is stored in an output directory.

Additional samples available within the registry include:

bodyposenet
citysemsegformer
dashcamnet
emotionnet
fpenet
gazenet
gesturenet
lprnet
peoplenet
peoplenet_transformer
peoplesemsegnet
reidentificationnet
retail_object_detection
retail_object_recognition
trafficcamnet

Each of these samples can be improved with a TAO Toolkit-based training pipeline, which performs transfer learning. The model output changes to fit a specific use case. You can find TAO Toolkit computer vision sample workflows on NGC.

Get started with NVIDIA AI Enterprise on Azure Machine Learning

NVIDIA AI Enterprise and Azure Machine Learning together create a powerful combination of GPU-accelerated computing and a comprehensive cloud-based machine learning platform, enabling businesses to develop and deploy AI models more efficiently. This synergy enables enterprises to harness the flexibility of cloud resources while leveraging the performance advantages of NVIDIA GPUs and software.

To get started with NVIDIA AI Enterprise on Azure Machine Learning, sign up for a Tech Preview. This will give you access to all of the prebuilt Components, Environments, and Models from the NVIDIA AI Enterprise Preview Registry on Azure Machine Learning.

Offsites

AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

Post author By
Post date June 2, 2023
No Comments on AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

Posted by Arsha Nagrani and Paul Hongsuck Seo, Research Scientists, Google Research

Automatic speech recognition (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, streamed video transcription and voice commands. While the challenges for this technology are centered around noisy audio inputs, the visual stream in multimodal videos (e.g., TV, online edited videos) can provide strong cues for improving the robustness of ASR systems — this is called audiovisual ASR (AV-ASR).

Although lip motion can provide strong signals for speech recognition and is the most common area of focus for AV-ASR, the mouth is often not directly visible in videos in the wild (e.g., due to egocentric viewpoints, face coverings, and low resolution) and therefore, a new emerging area of research is unconstrained AV-ASR (e.g., AVATAR), which investigates the contribution of entire visual frames, and not just the mouth region.

Building audiovisual datasets for training AV-ASR models, however, is challenging. Datasets such as How2 and VisSpeech have been created from instructional videos online, but they are small in size. In contrast, the models themselves are typically large and consist of both visual and audio encoders, and so they tend to overfit on these small datasets. Nonetheless, there have been a number of recently released large-scale audio-only models that are heavily optimized via large-scale training on massive audio-only data obtained from audio books, such as LibriLight and LibriSpeech. These models contain billions of parameters, are readily available, and show strong generalization across domains.

With the above challenges in mind, in “AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR”, we present a simple method for augmenting existing large-scale audio-only models with visual information, at the same time performing lightweight domain adaptation. AVFormer injects visual embeddings into a frozen ASR model (similar to how Flamingo injects visual information into large language models for vision-text tasks) using lightweight trainable adaptors that can be trained on a small amount of weakly labeled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training, which we show is crucial to enable the model to jointly process audio and visual information effectively. The resulting AVFormer model achieves state-of-the-art zero-shot performance on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (i.e., LibriSpeech).

Unconstrained audiovisual speech recognition. We inject vision into a frozen speech model (BEST-RQ, in grey) for zero-shot audiovisual ASR via lightweight modules to create a parameter- and data-efficient model called AVFormer (blue). The visual context can provide helpful clues for robust speech recognition especially when the audio signal is noisy (the visual loaf of bread helps correct the audio-only mistake “clove” to “loaf” in the generated transcript).

Injecting vision using lightweight modules

Our goal is to add visual understanding capabilities to an existing audio-only ASR model while maintaining its generalization performance to various domains (both AV and audio-only domains).

To achieve this, we augment an existing state-of-the-art ASR model (Best-RQ) with the following two components: (i) linear visual projector and (ii) lightweight adapters. The former projects visual features in the audio token embedding space. This process allows the model to properly connect separately pre-trained visual feature and audio input token representations. The latter then minimally modifies the model to add understanding of multimodal inputs from videos. We then train these additional modules on unlabeled web videos from the HowTo100M dataset, along with the outputs of an ASR model as pseudo ground truth, while keeping the rest of the Best-RQ model frozen. Such lightweight modules enable data-efficiency and strong generalization of performance.

We evaluated our extended model on AV-ASR benchmarks in a zero-shot setting, where the model is never trained on a manually annotated AV-ASR dataset.

Curriculum learning for vision injection

After the initial evaluation, we discovered empirically that with a naïve single round of joint training, the model struggles to learn both the adapters and the visual projectors in one go. To mitigate this issue, we introduced a two-phase curriculum learning strategy that decouples these two factors — domain adaptation and visual feature integration — and trains the network in a sequential manner. In the first phase, the adapter parameters are optimized without feeding visual tokens at all. Once the adapters are trained, we add the visual tokens and train the visual projection layers alone in the second phase while the trained adapters are kept frozen.

The first stage focuses on audio domain adaptation. By the second phase, the adapters are completely frozen and the visual projector must simply learn to generate visual prompts that project the visual tokens into the audio space. In this way, our curriculum learning strategy allows the model to incorporate visual inputs as well as adapt to new audio domains in AV-ASR benchmarks. We apply each phase just once, as an iterative application of alternating phases leads to performance degradation.

Overall architecture and training procedure for AVFormer. The architecture consists of a frozen Conformer encoder-decoder model, and a frozen CLIP encoder (frozen layers shown in gray with a lock symbol), in conjunction with two lightweight trainable modules – (i) visual projection layer (orange) and bottleneck adapters (blue) to enable multimodal domain adaptation. We propose a two-phase curriculum learning strategy: the adapters (blue) are first trained without any visual tokens, after which the visual projection layer (orange) is tuned while all the other parts are kept frozen.

The plots below show that without curriculum learning, our AV-ASR model is worse than the audio-only baseline across all datasets, with the gap increasing as more visual tokens are added. In contrast, when the proposed two-phase curriculum is applied, our AV-ASR model performs significantly better than the baseline audio-only model.

Effects of curriculum learning. Red and blue lines are for audiovisual models and are shown on 3 datasets in the zero-shot setting (lower WER % is better). Using the curriculum helps on all 3 datasets (for How2 (a) and Ego4D (c) it is crucial for outperforming audio-only performance). Performance improves up until 4 visual tokens, at which point it saturates.

Results in zero-shot AV-ASR

We compare AVFormer to BEST-RQ, the audio version of our model, and AVATAR, the state of the art in AV-ASR, for zero-shot performance on the three AV-ASR benchmarks: How2, VisSpeech and Ego4D. AVFormer outperforms AVATAR and BEST-RQ on all, even outperforming both AVATAR and BEST-RQ when they are trained on LibriSpeech and the full set of HowTo100M. This is notable because for BEST-RQ, this involves training 600M parameters, while AVFormer only trains 4M parameters and therefore requires only a small fraction of the training dataset (5% of HowTo100M). Moreover, we also evaluate performance on LibriSpeech, which is audio-only, and AVFormer outperforms both baselines.

Comparison to state-of-the-art methods for zero-shot performance across different AV-ASR datasets. We also show performances on LibriSpeech which is audio-only. Results are reported as WER % (lower is better). AVATAR and BEST-RQ are finetuned end-to-end (all parameters) on HowTo100M whereas AVFormer works effectively even with 5% of the dataset thanks to the small set of finetuned parameters.

Conclusion

We introduce AVFormer, a lightweight method for adapting existing, frozen state-of-the-art ASR models for AV-ASR. Our approach is practical and efficient, and achieves impressive zero-shot performance. As ASR models get larger and larger, tuning the entire parameter set of pre-trained models becomes impractical (even more so for different domains). Our method seamlessly allows both domain transfer and visual input mixing in the same, parameter efficient model.

Acknowledgements

This research was conducted by Paul Hongsuck Seo, Arsha Nagrani and Cordelia Schmid.