Categories
Offsites

Google at Interspeech 2022

This week, the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022) is being held in Incheon, South Korea, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.

We are excited to be a Diamond Sponsor of INTERSPEECH 2022, where we will be showcasing nearly 50 research publications and supporting a number of workshops, special sessions and tutorials. We welcome in-person attendees to drop by the Google booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in GatherTown where you can get up-to-date information on research and opportunities at Google. You can also learn more about the Google research being presented at INTERSPEECH 2022 below (Google affiliations in bold).

Organizing Committee

Industry Liaisons include: Bhuvana Ramabahdran

Area Chairs include: John Hershey, Heiga Zen, Shrikanth Narayanan, Bastiaan Kleijn

ISCA Fellows

Include: Tara Sainath, Heiga Zen

Publications

Production Federated Keyword Spotting via Distillation, Filtering, and Joint Federated-Centralized Training
Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays

Leveraging Unsupervised and Weakly-Supervised Data to Improve Direct Speech-to-Speech Translation
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

UserLibri: A Dataset for ASR Personalization Using Only Text
Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

SNRi Target Training for Joint Speech Enhancement and Recognition
Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani

Turn-Taking Prediction for Natural Conversational Speech
Shuo-Yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

Streaming Intended Query Detection Using E2E Modeling for Continued Conversation
Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

Improving Distortion Robustness of Self-Supervised Speech Processing Tasks with Domain Adaptation
Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee

XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Ehsan Amid, Om Thakkar, Arun Narayanan, Rajiv Mathews, Françoise Beaufays

Detecting Unintended Memorization in Language-Model-Fused ASR
W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews

AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

End-to-End Multi-talker Audio-Visual ASR Using an Active Speaker Attention Module
Richard Rose, Olivier Siohan

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-person Video
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

Unsupervised Data Selection via Discrete Speech Representation for ASR
Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa Haghani

Non-parallel Voice Conversion for ASR Augmentation
Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno

Ultra-Low-Bitrate Speech Coding with Pre-trained Transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani

Improving Deliberation by Text-Only and Semi-supervised Training
Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

CycleGAN-Based Unpaired Speech Dereverberation
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

TRILLsson: Distilled Universal Paralinguistic Speech Representations (see blog post)
Joel Shor, Subhashini Venugopalan

Learning Neural Audio Features Without Supervision
Sarthak Yadav, Neil Zeghidour

SpeechPainter: Text-Conditioned Speech Inpainting
Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi

SpecGrad: Diffusion Probabilistic Model-Based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

Distance-Based Sound Separation
Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey

Analysis of Self-Attention Head Diversity for Conformer-Based Automatic Speech Recognition
Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

Improving Rare Word Recognition with LM-Aware MWER Training
Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

MAESTRO: Matched Speech Text Representations Through Modality Matching
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen

Pseudo Label is Better Than Human Label
Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman

On the Optimal Interpolation Weights for Hybrid Autoregressive Transducer Model
Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran

Streaming Align-Refine for Non-autoregressive Deliberation
Wang Weiran, Ke Hu, Tara Sainath

Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin*, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Weiran Wang, Ding Zhao, Tara N Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

4-Bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

Visually-Aware Acoustic Event Detection Using Heterogeneous Graphs
Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

A Conformer-Based Waveform-Domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein

Reducing Domain Mismatch in Self-Supervised Speech Pre-training
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Nicolás Serrano

On-the-Fly ASR Corrections with Audio Exemplars
Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim

A Language Agnostic Multilingual Streaming On-Device ASR System
Bo Li, Tara Sainath, Ruoming Pang*, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

XTREME-S: Evaluating Cross-Lingual Speech Representations
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

Towards Disentangled Speech Representations
Cal Peyser, Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O’Malley, Ian McGraw

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation
Tom O’Malley, Arun Narayanan, Quan Wang

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen*, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark

A Scalable Model Specialization Framework for Training and Inference Using Submodels and Its Application to Speech Model Personalization
Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro Moreno

Text-Driven Separation of Arbitrary Sounds
Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

Workshops, Tutorials & Special Sessions

The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
Organizers include: Arsha Nagrani

Self-Supervised Representation Learning for Speech Processing
Organizers include: Tara Sainath

Learning from Weak Labels
Organizers include: Ankit Shah

RNN Transducers for Named Entity Recognition with Constraints on Alignment for Understanding Medical Conversations
Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey

Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids
Authors: Samuel Yang, Scott Wisdom, Chet Gnegy, Richard F. Lyon, Sagar Savla

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Authors: Michael Chinen, Jan Skoglund, Chandan K. A. Reddy, Alessandro Ragano, Andrew Hines

Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays

Trustworthy Speech Processing
Organizers include: Shrikanth Narayanan



*Work done while at Google.  

Categories
Misc

Text Normalization and Inverse Text Normalization with NVIDIA NeMo

Text normalization (TN) converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech (TTS). TN…

Text normalization (TN) converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech (TTS). TN ensures that TTS can handle all input texts without skipping unknown symbols. For example, “$123” is converted to “one hundred and twenty-three dollars.”

To learn more about the technology behind this post, tune into the author’s presentation at INTERSPEECH 2022 on Monday, September 19, 11:00–13:00 (KST), Virtual Poster: Speech Synthesis: Linguistic processing, paradigms, and other topics II and at 21:00–23:00 (KST), Other Topics in Speech Recognition.

Inverse text normalization (ITN) is a part of the automatic speech recognition (ASR) post-processing pipeline. ITN converts ASR model output into its written form to improve text readability. For example, the ITN module replaces “one hundred and twenty-three dollars” transcribed by an ASR model with “$123.”

ITN not only improves readability but also boosts the performance of downstream tasks such as neural machine translation or named entity recognition, as such tasks use written text during training.

ITN is the post-processing step after ASR, while TN is the preprocessing step before TTS. ITN converts ASR output “on may third we paid one hundred and twenty three dollars” to a written form “on may 3 we paid $123,” while TN reverts the process and outputs the original spoken text.
Figure 1. TN and ITN in the conversational AI pipeline

TN and ITN tasks face several challenges:

  • Labeled data is scarce and difficult to collect.
  • There is a low tolerance for unrecoverable errors, as TN and ITN errors cascade down to subsequent models. TN and ITN errors that alter the input semantics are called unrecoverable.

TN and ITN systems support a wide variety of semiotic classes, that is, words or tokens where the spoken form differs from the written form, requiring normalization. Examples are dates, decimals, cardinals, measures, and so on.

Many state-of-the-art TN systems in production are still rule-based using weighted finite state transducers (WFST). WFSTs are a form of finite-state machines used to graph relations between regular languages (or regular expressions). For this post, they can be defined by two major properties:

  • Mappings between accepted input and output expressions for text substitution
  • Path weighting to direct graph traversal

In case of ambiguity, the path with the smallest sum of weights is chosen. In Figure 2, “twenty-three” is transduced to “23″ instead of “20 3.”

In the diagram, the shortest path is selected to output “23” instead of “20 3.”
Figure 2. WFST lattice for input “twenty-three”

Currently, NVIDIA NeMo offers the following option for TN and ITN systems:

  1. Context-independent WFST-based TN and ITN grammars
  2. Context-aware WFST-based grammars + neural LM for TN
  3. Audio-based TN for speech datasets creation
  4. Neural TN and ITN

WFST-based grammar (systems 1, 2, and 3)

The NeMo Text Processing package is a Python framework that relies on the Python package Pynini to write and compile normalization grammars. For more information about the latest supported languages, see Language Support Matrix. For more information about how to extend or add your language grammar, see Grammar customization.

Pynini is a toolkit built on top of OpenFst, and it supports the export of the grammars into an OpenFST Archive File (FAR) (Figure 3). The FAR file can be used in a C++ production framework, which is based on Sparrowhawk.

NeMo TN and ITN uses WFST grammars based on Pynini for development, then exports them in .FAR files, and deploys them in the Sparrowhawk (C++) framework.
Figure 3. Schematic diagram of NeMo inverse text normalization development and deployment

Our initial version of TN/ITN system #1 does not take context into account, as that would make the rules significantly more complex, which requires extensive linguistic knowledge and deteriorates latency. If an input is ambiguous, for example, “1/4” in “The train leaves on 1/4” compared to “1/4 of a cup,” system #1 chooses the normalization deterministically without considering the context.

The system extends system #1 and incorporates context during normalization. The system outputs multiple normalization options in case of contextual ambiguity, which is rescored using a pretrained language model using Masked Language Model Scoring (Figure 4).

Given input “The train leaves on 1/4”, WFST grammars generate all possible normalization options, “The train leaves on one quarter,” “The train leaves on January fourth,” “The train leaves on one/four,” and “The train leaves on one divided by four.” Then, options with weights higher than the threshold values are disregarded. Here, the option “The train leaves on one/four” is dropped.  Finally, the LM scores the remaining options and selects “The train leaves on January fourth” as the best matching one.
Figure 4. WFST+LM shallow fusion pipeline
  1. WFST generates all possible normalization forms and assigns weights to each option.
  2. Pruning normalization options with weights higher than the threshold value “401.2″. In this example, we dropped “one/four”. It has a higher weight as it was not fully normalized.
  3. LM rescoring picks the best among the remaining options.

This approach is similar to shallow fusion for ASR and combines the benefits of the rule-based and neural system. The WFST still limits unrecoverable errors while the neural language model resolves contextual ambiguity without the need for extensive rules or hard-to-get data. For more information, see Text normalization.

Dataset Number of sentences Det WFST Duplex WFST + LM
EngConf 231 68.83 55.41 94.37
GoogleTN 7551 97.29 99.07 97.79
LibriTTS 7677 98.65 90.40 99.01
Table 1. Sentence accuracies on EngConf dataset using different language models for LM rescoring

Table 1 compares the WFST+LM approach in terms of sentence accuracy with the previous system #1 (DetWFST) and a purely neural-based system (Duplex) on three datasets. Later in this post, we provide more details on system #4.

Overall, the WFST+LM model is most effective, particularly on EngConf, a self-collected dataset with ambiguous examples.

Figure 5 shows how susceptible the three methods are to errors. While the neural method is most affected by unrecoverable errors, such as hallucinations or omissions, WFST+LM is least affected by those and class ambiguity.

The following Duplex, Det WFST, and WFST+LM error patterns are showcased: “Number error” (Duplex is affected and input “10001” got altered to “one hundred”, the rest of the models are not affected), “Unknown format” (all models are affected), “Hallucination” (Duplex changes “Mrs.” to “m r e”, the rest of the models are not affected), “Omission” (given input “10 1”, Duplex returns “one one”, i.e. omits “zero”, the rest of the models are not affected), “Class ambiguity” (DetWFST produces a wrong form “leaves on one quarter” for input “leaves on 1/4”, the rest of the models are less affected by such error), “Smart URL splitting” (DetWFST produces “w e A r e s c dot com” for input “WeAreSC.com”, the rest of the models are less affected by such error).
Figure 5. Error patterns for context-free WFST, Duplex, and WFST+LM systems

Audio-based TN (system 3)

Text normalization also comes in handy during the creation of new speech datasets. For instance, “six two seven” and “six twenty-seven” are both valid normalization options of ”627”. However, you must select the option that best reflects what is actually said in the corresponding audio. Audio-based text normalization provides such functionality (Figure 6).

Given input “627”, audio-based TN outputs all possible normalization options, for example, “six hundred twenty seven,” “six twenty seven,” “six two seven,” and so on Then character error rate (CER) is calculated to compare the ASR transcript of the corresponding audio with each normalized option. The option with the lowest CER is selected as the final output.
Figure 6. Example of audio-based normalization resolution

Neural TN and ITN model (system 4)

One significant advantage of neural systems compared to rule-based systems is they are easy to scale if training data for a new language exists. Rule-based systems require much effort to create and may work slowly on some inputs due to combinatorial bursts.

As an alternative to the WFST solution, NeMo hosts a seq2seq Duplex model for TN/ITN and a tagger-based neural model for ITN.

Duplex TN and ITN

Duplex TN and ITN is a neural-based system that can do both TN and ITN. At a high level, the system consists of two components:

  • DuplexTaggerModel:  A transformer-based tagger for identifying semiotic spans in the input (for example, spans about times, dates, or monetary amounts). [NEED LINK]
  • DuplexDecoderModel: A transformer-based seq2seq model for decoding the semiotic spans into their appropriate forms (for example, spoken forms for TN and written forms for ITN).

The term duplex refers to the fact that this system can be trained to do both TN and ITN. However, you can also specifically train the system for only one of the tasks.

Thutmose tagger

The Duplex model is a sequence-to-sequence model. Unfortunately, such neural models are prone to hallucinations that could lead to unrecoverable errors.

The Thutmose Tagger model regards ITN as a tagging task and mitigates hallucination issues (Figures 7 and 8). Thutmose is a single-pass token classifier model that assigns a replacement fragment to every input token or marks it for deletion or copying without changes.

NeMo provides a method of dataset preparation, based on granular alignment of ITN examples. The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets.

Tables 2 and 3 summarize evaluation results for two metrics:

  • Sentence accuracy: An automatic metric that matches each prediction with multiple possible variants of the reference. All errors are divided into two groups: digit error and other error. Digit error occurs when at least one digit differs from the closest reference variant. Other error means a non-digit error is present in the prediction, for example, a punctuation or letter mismatch.
  • Word error rate (WER): An automatic metric commonly used in ASR.
Table 2. Performance metrics (percentage) on English
Test set Metric  Duplex model  Thutmose (BERT)  Thutmose (d-BERT)
Default Sent. acc.  97.31  97.43  97.36
Digit error  0.35  0.31  0.38
Other error  2.34  2.26  2.26
WER  2.9  3.7  3.74
Hard Sent. acc.  85.34  85.17  84.71
Digit error  3.12  3.13  3.06
Other error  11.54  11.70  12.23
WER  9.34  9.02  9.10

d-BERT stands for distilBERT.
Default is the default Google Text Normalization test set.
Hard is a test set with sampling of at least 1,000 examples for each semiotic class.

Table 3. Performance metrics (percentage) on Russian
Test set Metric  Duplex model  Thutmose (BERT)  Thutmose (d-BERT)
Default Sent. acc.  92.34  93.45  92.72
Digit error  0.51  0.43  0.52
Other error  7.15  6.11  6.75
WER  3.63  2.94  3.67
Hard Sent. acc.  81.02  84.03  81.75
Digit error  3.24  3.08  3.77
Other error  15.74  12.90  14.48
WER  11.76  7.07  8.05

One-to-one correspondence between tags and input words improves the interpretability of the model’s predictions, simplifies debugging, and enables post-processing corrections. The model is simpler than sequence-to-sequence models and easier to optimize in production settings.

Thutmose tagger inference pipeline. The model takes as input a sequence of spoken-domain words, passes them through a BERT encoder and a classification head. It assigns a tag to each input word. After a simple post-processing step, the final written-domain output is generated.
Figure 7. ITN as tagging: inference example

The sequence of input words is processed by the BERT-based token classifier, giving the output tag sequence. Simple deterministic post-processing gives the final output.

The following Thutmose and Duplex error patterns are showcased: “Duplication due to alignment mistakes is common error pattern for Thutmose, for example “million million.” Duplex error patterns include hallucinations and overconfident choice of more frequent phrase even when it is not supported by the input, for example it predicts “air canada 777” instead of “air canada 773.”
Figure 8. Examples of errors: (left) Thutmose tagger, (right) Duplex model

Conclusion

Text normalization and inverse text normalization are crucial for conversational systems and considerably affect users’ experience. This post introduced a novel way of handling TN task by combining the benefits of WFST and pretrained language models and a new neural tagging-based approach for tackling ITN task.

For more information, including code examples, tutorials, and documentation for the TN/ITN solutions discussed in this post, see the NVIDIA/NeMo GitHub repo.

Categories
Misc

Dynamic Scale Weighting Through Multiscale Speaker Diarization

Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question “Who spoke when?”. It makes a clear…

Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question “Who spoke when?”. It makes a clear distinction when it is compared with speech recognition.

To learn more about the technology behind this post, tune into the author’s presentation at INTERSPEECH 2022 on Thursday, September 22, 13:30-15:30 (KST), On-Site Poster: Speaker Recognition and Diarization.

Before you perform speaker diarization, you know “what is spoken” but you don’t know “who spoke it”. Therefore, speaker diarization is an essential feature for a speech recognition system that enriches the transcription with speaker labels. That is, conversational speech recordings can never be considered to be fully transcribed without a speaker diarization process because transcriptions without speaker labels cannot inform you who is speaking to whom.

Diagram shows that a box named “Automatic Speech Recognition” produces transcribed words “hey how are you quite busy” but those words are all in the same gray color. After the speech signal waveform goes through a Speaker Diarization, “hey”,“quite”, “busy” are colored in green and “how”, “are”, “you” are colored in blue.
Figure 1. Speaker diarization is the task of partitioning audio recordings into speaker-homogeneous regions

Speaker diarization must produce accurate timestamps as speaker turns can be extremely short in conversational settings. We often use short back-channel words such as “yes”, “uh-huh,” or “oh.” These words are challenging for machines to transcribe and identify the speaker. 

While segmenting audio recordings in terms of speaker identity, speaker diarization requires fine-grained decisions on relatively short segments, ranging from a few tenths of a second to several seconds. Making accurate, fine-grained decisions on such short audio segments is challenging because it is less likely to capture reliable speaker traits.

In this post, we discuss how this problem can be addressed by introducing a new technique called the multi-scale approach and multiscale diarization decoder (MSDD) to handle multi-scale inputs.

Mechanism of multi-scale segmentation

Extracting long audio segments is desirable in terms of the quality of speaker characteristics. However, the length of audio segments also limits the granularity, which leads to a coarse unit length for speaker label decisions. Speaker diarization systems are challenged by a trade-off between temporal resolution and the fidelity of the speaker representation, as shown by the curve shown in Figure 2.

During the speaker feature extraction process in the speaker diarization pipeline, the temporal resolution is inevitably sacrificed by taking a long speech segment to obtain high-quality speaker representation vectors. In plain and simple language, if you try to be accurate on voice characteristics, then you have to look into a longer span of time.

At the same time, if you look into a longer span of time, you have to make a decision on a fairly long span of time. This leads to coarse decisions (temporal resolution is low). Think about the fact that even human listeners cannot accurately tell who is speaking if only half a second of recorded speech is given.

In most diarization systems, an audio segment length ranges from 1.5~3.0 seconds becausesuch numbers make a good compromise between the quality of speaker characteristics and temporal resolution. This type of segmentation method is known as a single-scale approach.

Even with an overlap technique, the single-scale segmentation limits the temporal resolution to 0.75~1.5 seconds, which leaves room for improvement in terms of temporal accuracy.

Having a coarse temporal resolution not only deteriorates the performance of diarization but also decreases speaker counting accuracy since short speech segments are not captured properly. More importantly, such coarse temporal resolution in the speaker timestamps makes the matching between the decoded ASR text and speaker diarization result more error-prone.  

To tackle the problem, we proposed a multi-scale approach, which is a way to cope with such a trade-off by extracting speaker features from multiple segment lengths and then combining the results from multiple scales. The multi-scale technique achieves state-of-the-art accuracy on the most popular speaker diarization benchmark datasets. It is already part of the open-source conversational AI toolkit NVIDIA NeMo.

Figure 2 shows the key technical solutions of multi-scale speaker diarization.

On the left, multiple bars in different lengths are drawn below an example picture of speech signal waveform. On the right, a curve showing trade-off between two quantities “Fidelity of speaker representations” and “Temporal resolution”. A circle named “Multiscale approach” is drawn above the trade-off curve showing that “Multiscale approach” can get high-level of both quantities at the same time.
Figure 2. Corresponding trade-off curve on temporal resolution and fidelity of speaker representation

The multi-scale approach is fulfilled by employing multi-scale segmentation and extracting speaker embeddings from each scale. On the left side of Figure 2, four different scales in a multi-scale segmentation approach are performed.

During the segment affinity calculation process, all the information from the longest scale to the shortest scale is combined, yet a decision is made only for the shortest segment range. When combining the features from each scale, the weight of each scale largely affects the speaker diarization performance.

Multiscale diarization pipeline with neural models

Because scale weights largely determine the accuracy of the speaker diarization system, the scale weights should be set to have the maximized speaker diarization performance.

We came up with a novel multi-scale diarization system called multiscale diarization decoder (MSDD) that dynamically determines the importance of each scale at each time-step.

Speaker diarization systems rely on the speaker characteristics captured by audio feature vectors called speaker embeddings. The speaker embedding vectors are extracted by a neural model to generate a dense floating point number vector from a given audio signal.

MSDD takes the multiple speaker embedding vectors from multiple scales and then estimates desirable scale weights. Based on the estimated scale weights, speaker labels are generated. The proposed system weighs more on the large scale if the input signals are considered to have more accurate information on certain scales.

Figure 3 shows the data flow of the proposed multiscale speaker diarization system. Multi-scale segments are extracted from audio input, and corresponding speaker embedding vectors for multi-scale audio input are generated by using the speaker embedding extractor (TitaNet).

Data-flow starts from Audio input, then goes to Embedding Extractor, Clustering Initialization. Then, the signal is split into boxes named Multi-scale Cosine Similarity and Scale Weight Calculation, then merged again at a box named Sequence Model. Lastly, the last box outputs Speaker Labels.
Figure 3. Data-flow of the proposed multi-scale speaker diarization system

The extracted multi-scale embeddings are processed by clustering algorithm to provide an initializing clustering result to the MSDD module. The MSDD module uses cluster-average speaker embedding vectors to compare these with input speaker embedding sequences. The scale weights for each step are sestimated to weigh the importance of each scale.

Finally, the sequence model is trained to output speaker label probabilities for each speaker.

MSDD mechanism

Diagram of input speech embeddings vectors in green, and clustered speaker embeddings are colored in blue for speaker 1 and red for speaker 2. All these green, blue and red speaker embedding vectors are fed into a neural network model which has a couple of 1-D filter layers followed by linear layer and softmax layer.
Figure 4. Scale weights calculated from a 1-D CNN in MSDD

In Figure 4, the 1-D filter captures the context from the input embeddings and cluster average embeddings.

Diagram shows how context vector is calculated. The speaker embedding vectors from the input signal is in green and cosine similarity values are calculated for both input-speaker1 (blue) and input-speaker2 (red) pairs. These cosine similarity values are then multiplied by scale-weights then becomes a context vector which is drawn at the top.
Figure 5. Context vector for MSDD

In Figure 5, cosine similarity values from each speaker and each scale are weighted by the scale weights to form a weighted cosine similarity vector.

The neural network model MSDD is trained to take advantage of a multi-scale approach by dynamically calculating the weight of each scale. MSDD takes the initial clustering results and compares the extracted speaker embeddings with the cluster-average speaker representation vectors.

Most importantly, the weight of each scale at each time step is determined through a scale weighting mechanism where the scale weights are calculated from a 1-D convolutional neural networks (CNNs) applied to the multi-scale speaker embedding inputs and the cluster average embeddings (Figure 3).

The estimated scale weights are applied to cosine similarity values calculated for each speaker and each scale. Figure 5 shows the process of calculating the context vector by applying the estimated scale weights on cosine similarity calculated (Figure 4) between cluster-average speaker embedding and input speaker embeddings.

Finally, each context vector for each step is fed to a multi-layer LSTM model that generates per-speaker speaker existence probability. Figure 6 shows how speaker label sequences are estimated by the LSTM model and context vector input.

A picture showing layers of neural networks. A context vector is fed to a linear layer than it goes through two layers of LSTMs and then goes through another layer to finally generate sigmoid values.
Figure 6.  Sequence modeling using LSTM

Figure 6, sequence modeling using LSTM takes the context vector input and generates speaker labels. The output of MSDD is the probability values of speaker existence at each timestep for two speakers.

The proposed speaker diarization system is designed to support the following features:

  • Flexible number of speakers
  • Overlap-aware diarization
  • Pretrained speaker embedding model

Flexible number of speakers

MSDD employs pairwise inference to diarize conversation with arbitrary numbers of speakers. For example, if there are four speakers, six pairs are extracted, and inference results from MSDD are averaged to obtain results for each of the four speakers.

Overlap-aware diarization

MSDD independently estimates the probability of two speaker labels of two speakers at each step (Figure 6). This enables overlap detection where two speakers are speaking at the same time.

Pretrained speaker embedding model

MSDD is based on the pretrained embedding extractor (TitaNet) model. By using a pretrained speaker model, you can use the neural network weights learned from a relatively large amount of single-speaker speech data.

In addition, MSDD is designed to be optimized with a pretrained speaker to fine-tune the entire speaker diarization system on a domain-specific diarization dataset.

Experimental results and quantitative benefits

The proposed MSDD system has several quantitative benefits: superior temporal resolution and improved accuracy.

Superior temporal resolution

While the single-scale clustering diarizer shows the best performance at a 1.5-second segment length where the unit decision length is 0.75 seconds (half-overlap), the proposed multi-scale approach has a unit decision length of 0.25 seconds. The temporal resolution can be even more enhanced by using a shorter shift length that requires more steps and resources.

Figure 2 shows the concept of the multi-scale approach and the unit decision length of 0.5 seconds. Merely applying 0.5-second segment length to a single-scale diarizer significantly drops the diarization performance due to the degraded fidelity of speaker features.

Improved accuracy

Diarization error rate (DER) is calculated by comparing hypothesis timestamps and ground-truth timestamps. Figure 7 shows the quantified performance of the multi-scale diarization approach over the state-of-the-art and single-scale clustering methods.

Bar plots showing diarization error rate for three different datasets. Left, “Landini et al”, shows 4.4% for CallHome and 2.2 % for AMI-MH-test. Middle, “Single-scale approach” shows 5.3%, 1.5%, 1.8% for CallHome, CH109, AMI-MH-test, respectively. Farthest right, “Multi-scale Approach” shows 4.0%, 0.6%, 1.1%  for CallHome, CH109, AMI-MH-test, respectively.
Figure 7. Quantitative evaluation of the previous state-of-the-art result (Landini et al. 2022), single-scale clustering method (prior work), and multi-scale approach (proposed) on three different datasets

The proposed MSDD approach can reduce DER up to 60% on two-speaker datasets when compared to the single-scale clustering diarizer. 

Conclusion

The proposed system has the following benefits:

  • This is the first neural network architecture that applies a multi-scale weighting concept with sequence model (LSTM) based speaker label estimation.
  • The weighing scheme is integrated in a single inference session and does not require fusion of multiple diarization results as in other speaker diarization systems.
  • The proposed multi-scale diarization system enables overlap-aware diarization which cannot be achieved with traditional clustering-based diarization systems.
  • Because the decoder is based on a clustering-based initialization, the diarization system can deal with a flexible number of speakers. This indicates that you can train the proposed model on two-speaker datasets and then use it for diarizing two or more speakers.
  • While having all previously mentioned benefits, the proposed approach shows a superior diarization performance compared to the previously published results.

There are two future areas of research regarding the proposed system:

  • We plan to implement a streaming version of the proposed system by implementing diarization decoder based on short-term window-based clustering.
  • The end-to-end optimization from speaker embedding extractor to diarization decoder can be investigated to improve the speaker diarization performance.

For more information, see Multiscale Speaker Diarization with Dynamic Scale Weighting or see the Interspeech 2022 session.

Categories
Offsites

Robust Online Allocation with Dual Mirror Descent

The emergence of digital technologies has transformed decision making across commercial sectors such as airlines, online retailing, and internet advertising. Today, real-time decisions need to be repeatedly made in highly uncertain and rapidly changing environments. Moreover, organizations usually have limited resources, which need to be efficiently allocated across decisions. Such problems are referred to as online allocation problems with resource constraints, and applications abound. Some examples include:

  • Bidding with Budget Constraints: Advertisers increasingly purchase ad slots using auction-based marketplaces such as search engines and ad exchanges. A typical advertiser can participate in a large number of auctions in a given month. Because the supply in these marketplaces is uncertain, advertisers set budgets to control their total spend. Therefore, advertisers need to determine how to optimally place bids while limiting total spend and maximizing conversions.
  • Dynamic Ad Allocation: Publishers can monetize their websites by signing deals with advertisers guaranteeing a number of impressions or by auctioning off slots in the open market. To make this choice, publishers need to trade off, in real-time, the short-term revenue from selling slots in the open market and the long-term benefits of delivering good quality spots to reservation ads.
  • Airline Revenue Management: Planes have a limited number of seats that need to be filled up as much as possible before a flight’s departure. But demand for flights changes over time and airlines would like to sell airline tickets to the customers who are willing to pay the most. Thus, airlines have increasingly adopted sophisticated automated systems to manage the pricing and availability of airline tickets.
  • Personalized Retailing with Limited Inventories: Online retailers can use real-time data to personalize their offerings to customers who visit their store. Because product inventory is limited and cannot be easily replenished, retailers need to dynamically decide which products to offer and at what price to maximize their revenue while satisfying their inventory constraints.

The common feature of these problems is the presence of resource constraints (budgets, contractual obligations, seats, or inventory, respectively in the examples above) and the need to make dynamic decisions in environments with uncertainty. Resource constraints are challenging because they link decisions across time — e.g., in the bidding problem, bidding too high early can leave advertisers with no budget, and thus missed opportunities later. Conversely, bidding too conservatively can result in a low number of conversions or clicks.

Two central resource allocation problems faced by advertisers and publishers in internet advertising markets.

In this post, we discuss state-of-the-art algorithms that can help maximize goals in dynamic, resource-constrained environments. In particular, we have recently developed a new class of algorithms for online allocation problems, called dual mirror descent, that are simple, robust, and flexible. Our papers have appeared in Operations Research, ICML’20, and ICML’21, and we have ongoing work to continue progress in this space. Compared to existing approaches, dual mirror descent is faster as it does not require solving auxiliary optimization problems, is more flexible because it can handle many applications across different sectors with minimal modifications, and is more robust as it enjoys remarkable performance under different environments.

Online Allocation Problems
In an online allocation problem, a decision maker has a limited amount of total resources (B) and receives a certain number of requests over time (T). At any point in time (t), the decision maker receives a reward function (ft) and resource consumption function (bt), and takes an action (xt). The reward and resource consumption functions change over time and the objective is to maximize the total reward within the resource constraints. If all the requests were known in advance, then an optimal allocation could be obtained by solving an offline optimization problem for how to maximize the reward function over time within the resource constraints1.

The optimal offline allocation cannot be implemented in practice because it requires knowing future requests. However, this is still useful for framing the goal of online allocation problems: to design an algorithm whose performance is as close to optimal as possible without knowing future requests.

Achieving the Best of Many Worlds with Dual Mirror Descent
A simple, yet powerful idea to handle resource constraints is introducing “prices” for the resources, which enables accounting for the opportunity cost of consuming resources when making decisions. For example, selling a seat on a plane today means it can’t be sold tomorrow. These prices are useful as an internal accounting system of the algorithm. They serve the purpose of coordinating decisions at different moments in time and allow decomposing a complex problem with resource constraints into simpler subproblems: one per time period with no resource constraints. For example, in a bidding problem, the prices capture an advertiser’s opportunity cost of consuming one unit of budget and allow the advertiser to handle each auction as an independent bidding problem.

This reframes the online allocation problem as a problem of pricing resources to enable optimal decision making. The key innovation of our algorithm is using machine learning to predict optimal prices in an online fashion: we choose prices dynamically using mirror descent, a popular optimization algorithm for training machine learning predictive models. Because prices for resources are referred to as “dual variables” in the field of optimization, we call the resulting algorithm dual mirror descent.

The algorithm works sequentially by assuming uniform resource consumption over time is optimal and updating the dual variables after each action. It starts at a moment in time (t) by taking an action (xt) that maximizes the reward minus the opportunity cost of consuming resources (shown in the top gray box below). The action (e.g., how much to bid or which ad to show) is implemented if there are enough resources available. Then, the algorithm computes the error in the resource consumption (gt), which is the difference between uniform consumption over time and the actual resource consumption (below in the third gray box). A new dual variable for the next time period is computed using mirror descent based on the error, which then informs the next action. Mirror descent seeks to make the error as close as possible to zero, improving the accuracy of its estimate of the dual variable, so that resources are consumed uniformly over time. While the assumption of uniform resource consumption may be surprising, it helps avoid missing good opportunities and often aligns with commercial goals so is effective. Mirror descent also allows a variety of update rules; more details are in the paper.

An overview of the dual mirror descent algorithm.

By design, dual mirror descent has a self-correcting feature that prevents depleting resources too early or waiting too long to consume resources and missing good opportunities. When a request consumes more or less resources than the target, the corresponding dual variable is increased or decreased. When resources are then priced higher or lower, future actions are chosen to consume resources more conservatively or aggressively.

This algorithm is easy to implement, fast, and enjoys remarkable performance under different environments. These are some salient features of our algorithm:

  • Existing methods require periodically solving large auxiliary optimization problems using past data. In contrast, this algorithm does not need to solve any auxiliary optimization problem and has a very simple rule to update the dual variables, which, in many cases, can be run in linear time complexity. Thus, it is appealing for many real-time applications that require fast decisions.
  • There are minimal requirements on the structure of the problem. Such flexibility allows dual mirror descent to handle many applications across different sectors with minimal modifications. Moreover, our algorithms are flexible since they accommodate different objectives, constraints, or regularizers. By incorporating regularizers, decision makers can include important objectives beyond economic efficiency, such as fairness.
  • Existing algorithms for online allocation problems are tailored for either adversarial or stochastic input data. Algorithms for adversarial inputs are robust as they make almost no assumptions on the structure of the data but, in turn, obtain performance guarantees that are too pessimistic in practice. On the other hand, algorithms for stochastic inputs enjoy better performance guarantees by exploiting statistical patterns in the data but can perform poorly when the model is misspecified. Dual mirror descent, however, attains performance close to optimal in both stochastic and adversarial input models while being oblivious to the structure of the input model. Compared to existing work on simultaneous approximation algorithms, our method is more general, applies to a wide range of problems, and requires no forecasts. Below is a comparison of our algorithm to other state-of-the-art methods. Results are based on synthetic data for an ad allocation problem.
Performance of dual mirror descent, a training based method, and an adversarial method relative to the optimal offline solution. Lower values indicate performance closer to the optimal offline allocation. Results are generated using synthetic experiments based on public data for an ad allocation problem.

Conclusion
In this post we introduced dual mirror descent, an algorithm for online allocation problems that is simple, robust, and flexible. It is particularly notable that after a long line of work in online allocation algorithms, dual mirror descent provides a way to analyze a wider range of algorithms with superior robustness priorities compared to previous techniques. Dual mirror descent has a wide range of applications across several commercial sectors and has been used over time at Google to help advertisers capture more value through better algorithmic decision making. We are also exploring further work related to mirror descent and its connections to PI controllers.

Acknowledgements
We would like to thank our co-authors Haihao Lu and Balu Sivan, and Kshipra Bhawalkar for their exceptional support and contributions. We would also like to thank our collaborators in the ad quality team and market algorithm research.


1Formalized in the equation below: 

Categories
Misc

Meet the Omnivore: Christopher Scott Constructs Architectural Designs, Virtual Environments With NVIDIA Omniverse

Growing up in a military family, Christopher Scott moved more than 30 times, which instilled in him “the ability to be comfortable with, and even motivated by, new environments,” he said.

The post Meet the Omnivore: Christopher Scott Constructs Architectural Designs, Virtual Environments With NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Offsites

PaLI: Scaling Language-Image Learning in 100+ Languages

Advanced language models (e.g., GPT, GLaM, PaLM and T5) have demonstrated diverse capabilities and achieved impressive results across tasks and languages by scaling up their number of parameters. Vision-language (VL) models can benefit from similar scaling to address many tasks, such as image captioning, visual question answering (VQA), object recognition, and in-context optical-character-recognition (OCR). Increasing the success rates for these practical tasks is important for everyday interactions and applications. Furthermore, for a truly universal system, vision-language models should be able to operate in many languages, not just one.

In “PaLI: A Jointly-Scaled Multilingual Language-Image Model”, we introduce a unified language-image model trained to perform many tasks and in over 100 languages. These tasks span vision, language, and multimodal image and language applications, such as visual question answering, image captioning, object detection, image classification, OCR, text reasoning, and others. Furthermore, we use a collection of public images that includes automatically collected annotations in 109 languages, which we call the WebLI dataset. The PaLI model pre-trained on WebLI achieves state-of-the-art performance on challenging image and language benchmarks, such as COCO-Captions, CC3M, nocaps, TextCaps, VQAv2, OK-VQA, TextVQA and others. It also outperforms prior models’ multilingual visual captioning and visual question answering benchmarks.

Overview
One goal of this project is to examine how language and vision models interact at scale and specifically the scalability of language-image models. We explore both per-modality scaling and the resulting cross-modal interactions of scaling. We train our largest model to 17 billion (17B) parameters, where the visual component is scaled up to 4B parameters and the language model to 13B. 

The PaLI model architecture is simple, reusable and scalable. It consists of a Transformer encoder that processes the input text, and an auto-regressive Transformer decoder that generates the output text. To process images, the input to the Transformer encoder also includes “visual words” that represent an image processed by a Vision Transformer (ViT). A key component of the PaLI model is reuse, in which we seed the model with weights from previously-trained uni-modal vision and language models, such as mT5-XXL and large ViTs. This reuse not only enables the transfer of capabilities from uni-modal training, but also saves computational cost.

The PaLI model addresses a wide range of tasks in the language-image, language-only and image-only domain using the same API (e.g., visual-question answering, image captioning, scene-text understanding, etc.). The model is trained to support over 100 languages and tuned to perform multilingually for multiple language-image tasks.

Dataset: Language-Image Understanding in 100+ Languages
Scaling studies for deep learning show that larger models require larger datasets to train effectively. To unlock the potential of language-image pretraining, we construct WebLI, a multilingual language-image dataset built from images and text available on the public web.

WebLI scales up the text language from English-only datasets to 109 languages, which enables us to perform downstream tasks in many languages. The data collection process is similar to that employed by other datasets, e.g. ALIGN and LiT, and enabled us to scale the WebLI dataset to 10 billion images and 12 billion alt-texts.

In addition to annotation with web text, we apply the Cloud Vision API to perform OCR on the images, leading to 29 billion image-OCR pairs. We perform near-deduplication of the images against the train, validation and test splits of 68 common vision and vision-language datasets, to avoid leaking data from downstream evaluation tasks, as is standard in the literature. To further improve the data quality, we score image and alt-text pairs based on their cross-modal similarity, and tune the threshold to keep only 10% of the images, for a total of 1 billion images used for training PaLI.

Sampled images from WebLI associated with multilingual alt-text and OCR. The second image is by jopradier (original), used under the CC BY-NC-SA 2.0 license. Remaining images are also used with permission.
Statistics of recognized languages from alt-text and OCR in WebLI.
Image-text pair counts of WebLI and other large-scale vision-language datasets, CLIP, ALIGN and LiT.

Training Large Language-Image Models
Vision-language tasks require different capabilities and sometimes have diverging goals. Some tasks inherently require localization of objects to solve the task accurately, whereas some other tasks might need a more global view. Similarly, different tasks might require either long or compact answers. To address all of these objectives, we leverage the richness of the WebLI pre-training data and introduce a mixture of pre-training tasks, which prepare the model for a variety of downstream applications. To accomplish the goal of solving a wide variety of tasks, we enable knowledge-sharing between multiple image and language tasks by casting all tasks into a single generalized API (input: image + text; output: text), which is also shared with the pretraining setup. The objectives used for pre-training are cast into the same API as a weighted mixture aimed at both maintaining the ability of the reused model components and training the model to perform new tasks (e.g., split-captioning for image description, OCR prediction for scene-text comprehension, VQG and VQA prediction).

The model is trained in JAX with Flax using the open-sourced T5X and Flaxformer framework. For the visual component, we introduce and train a large ViT architecture, named ViT-e, with 4B parameters using the open-sourced BigVision framework. ViT-e follows the same recipe as the ViT-G architecture (which has 2B parameters). For the language component, we concatenate the dense token embeddings with the patch embeddings produced by the visual component, together as the input to the multimodal encoder-decoder, which is initialized from mT5-XXL. During the training of PaLI, the weights of this visual component are frozen, and only the weights of the multimodal encoder-decoder are updated.

Results
We compare PaLI on common vision-language benchmarks that are varied and challenging. The PaLI model achieves state-of-the-art results on these tasks, even outperforming very large models in the literature. For example, it outperforms the Flamingo model, which is several times larger (80B parameters), on several VQA and image-captioning tasks, and it also sustains performance on challenging language-only and vision-only tasks, which were not the main training objective.

PaLI (17B parameters) outperforms the state-of-the-art approaches (including SimVLM, CoCa, GIT2, Flamingo, BEiT3) on multiple vision-and-language tasks. In this plot we show the absolute score differences compared with the previous best model to highlight the relative improvements of PaLI. Comparison is on the official test splits when available. CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.

<!–

PaLI (17B parameters) outperforms the state-of-the-art approaches (including SimVLM, CoCa, GIT2, Flamingo, BEiT3) on multiple vision-and-language tasks. In this plot we show the absolute score differences compared with the previous best model to highlight the relative improvements of PaLI. Comparison is on the official test splits when available. CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.

–>

Model Scaling Results
We examine how the image and language model components interact with each other with regards to model scaling and where the model yields the most gains. We conclude that scaling both components jointly results in the best performance, and specifically, scaling the visual component, which requires relatively few parameters, is most essential. Scaling is also critical for better performance across multilingual tasks.

Scaling both the language and the visual components of the PaLI model contribute to improved performance. The plot shows the score differences compared to the PaLI-3B model: CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.
Multilingual captioning greatly benefits from scaling the PaLI models. We evaluate PaLI on a 35-language benchmark Crossmodal-3600. Here we present the average score over all 35 languages and the individual score for seven diverse languages.

Model Introspection: Model Fairness, Biases, and Other Potential Issues
To avoid creating or reinforcing unfair bias within large language and image models, important first steps are to (1) be transparent about the data that were used and how the model used those data, and (2) test for model fairness and conduct responsible data analyses. To address (1), our paper includes a data card and model card. To address (2), the paper includes results of demographic analyses of the dataset. We consider this a first step and know that it will be important to continue to measure and mitigate potential biases as we apply our model to new tasks, in alignment with our AI Principles.

Conclusion
We presented PaLI, a scalable multi-modal and multilingual model designed for solving a variety of vision-language tasks. We demonstrate improved performance across visual-, language- and vision-language tasks. Our work illustrates the importance of scale in both the visual and language parts of the model and the interplay between the two. We see that accomplishing vision and language tasks, especially in multiple languages, actually requires large scale models and data, and will potentially benefit from further scaling. We hope this work inspires further research in multi-modal and multilingual models.

Acknowledgements
We thank all the authors who conducted this research Soravit (Beer) Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari,Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut. We also thank Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, Jeremiah Harmsen, Zoubin Ghahramani, Erica Moreira, Victor Gomes, Sarah Laszlo, Kathy Meier-Hellstern, Susanna Ricco, Rich Lee, Austin Tarango, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, and Maysam Moussalem for their suggestions, improvements and support. We thank Tom Small for providing visualizations for the blogpost.

Categories
Misc

Top HPC Sessions at GTC 2022

Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions.  

Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions.  

Categories
Misc

Deploying AI Models at Scale with an Operating System for Smart Hospitals

There is an abundance of market-approved medical AI software that can be used to improve patient care and hospital operations, but we have not yet seen these…

There is an abundance of market-approved medical AI software that can be used to improve patient care and hospital operations, but we have not yet seen these technologies create the large-scale transformation in healthcare that was expected.

Adopting cutting-edge technologies is not a trivial exercise for healthcare institutions. It requires a balance of legal, clinical, and technical risks against the promise of improved patient outcomes and operational efficiency.

Traditionally, the challenges around the adoption of such technologies fell into one of three buckets: people, platforms, and policy. The challenges around a platform for AI adoption are particularly unique given the nature of deep learning technology and the current state of the medical AI ecosystem.

Most deep learning applications have a narrow field of scope. If they stray beyond their domain, they can exhibit unpredictable and unintuitive behavior. This means that to achieve large-scale transformation in medicine, we need thousands of AI applications.

Each of these AI models in production will be communicating information with live clinical systems and making all kinds of inferences that must then be managed. This has the potential of creating an “AI jungle,” with a huge amount of technical debt in an environment where there hasn’t been substantial investment in people to manage such risks.

Another challenge for deploying AI at scale is the lack of interoperability of AI models. Deployment and data integration don’t scale within or across institutions. This lack of interoperability exists within information systems and semantics and between organizations. The result is a high barrier of entry for data scientists and start-ups who don’t have the capacity or domain knowledge to make an impact.

Finally, in recognition of the immaturity of the medical AI economy relative to other subdomains in MedTech, evidence generation must be at the heart of the design of a platform for AI adoption, as many of the AI applications on the market today still require extensive research and analysis of their performance. This is true not only from the perspective of monitoring but also to measure impact on health outcomes.

AIDE: An enterprise approach to AI deployment in healthcare systems

An AI platform that solves these challenges must solve them at the enterprise level to fully capture the benefits of the ‘virtuous cycle of AI’ and to fully mitigate risks around deployment of artificial intelligence.

This ensures the lowered costs of deployment by plugging into a platform that is already integrated to the clinical information systems within healthcare facilities. It also lowers costs around staff and support by creating an opportunity for a single team to service the entire institution, empowered with an enterprise-wide view for managing risks and continuous improvement.

AIDE, developed by the UK Government-funded AI Centre for Value Based Healthcare, is a new operating system for the hospital that allows healthcare providers to deploy AI models safely, effectively, and efficiently. It provides a harmonized hardware and software layer that facilitates the deployment and use of any AI application.

A visual representation of clinical data being analyzed by AI that outputs clinical recommendations
Figure 1. AIDE can receive a live stream of clinical data, allowing clinicians to access near real-time AI analysis within seconds

There are numerous technical risks involved in deploying a large number of models. AIDE mitigates these by providing an administration view that reports every inference of every deployed model as well as performance trend analysis to enable real-time intervention in the case of poor performance.

AIDE also solves the challenge of interoperability by packaging and deploying containerized applications and communicating with the rest of the hospital through standard protocols such as DICOM, HL7, and FHIR.

Clinicians can also review AI inference results with AIDE before they are sent to the patient’s electronic health record (EHR). This clinical review stage can collect useful data around failure instances, which can be fed back to the developer and close the feedback loop.

An open-source standard for healthcare AI with MONAI Deploy

When considering the wide scale adoption of AI, it is important to first consider, as an analogous example, the discovery of X-rays and the subsequent transformation of healthcare through the development of radiology.

After the discovery by Dr Wilhelm Roentgen in 1895 and the famous X-ray of his wife Bertha’s hand, the first uses of X-ray technology were for industrial applications, such as welding inspection, and consumer applications, such as shoe-fitting, rather than medical applications.

Today, most patient experiences involve medical imaging for diagnosis, prognosis, treatment monitoring and more. Rural medical centers can acquire images in the middle of the night and have them reported within an hour by a specialist in another part of the world.

That kind of transformation was only made possible almost 100 years after the invention of the x-ray when the American College of Radiology and National Electrical Manufacturers Association published a standard for the encoding and transfer of medical images, named “Digital Imaging and Communications in Medicine.”

With the birth of the standard in the early 1990s, a transformational journey had begun that would change what would be possible in the art of medicine, leading to advances in oncology, neurology, and many other medical specialties.

Similarly, with deep learning, industrial and consumer applications have raced ahead while medical applications have had limited adoption and even less transformational impact.

That is why the key innovation in AIDE, as an enterprise AI platform, is that it is built on top of the open-source MONAI Deploy architecture.

MONAI Deploy was built to bridge the gap from research innovation to validation and clinical production environments. It gives developers and researchers a default standard, called MONAI Deploy Application Package (MAP), that easily integrates into health IT standards such as DICOM. It also integrates into deployment options across a variety of data center, cloud, and edge environments, making it easy for you to adopt new medical AI applications.

The MONAI Deploy Working Group has defined an open architecture and standard APIs for developing, packaging, testing, deploying, and running medical AI applications in clinical production.

The high-level architecture includes the following components:

  • MONAI Application Package (MAP): Defines how applications can be packaged and distributed.
  • MONAI Informatics Gateway: Communicatesthe clinical information systems and medical devices, such as MRI scanners, over DICOM, FHIR, and HL7 standards.
  • MONAI Workflow Manager: Orchestrates clinical-inspired workflows, composed of AI tasks.

The system has been designed to allow pluggable execution of tasks by different inference engines. The MONAI community is going to keep moving in that direction as well. 

The MONAI Deploy architecture has been co-designed by an international community of hardware, software, academic, and healthcare partners for the mutual aim of standardizing the medical AI lifecycle. This is much in the same way the ACR and NEMA did with medical images three decades ago.

A new era of data-driven medicine

This new layer of informatics, built on top of existing clinical information systems and medical devices, will help usher in the new era of data-driven medicine. For more information about AIDE, see AI Centre for Value Based Healthcare Platforms.

Categories
Misc

GFN Thursday Delivers Seven New Games This Week

TGIGFNT: thank goodness it’s GFN Thursday. Start your weekend early with seven new games joining the GeForce NOW library of over 1,400 titles. Whether it’s streaming on an older-than-the-dinosaurs PC, a Mac that normally couldn’t dream of playing PC titles, or mobile devices – it’s all possible to play your way thanks to GeForce NOW. Read article >

The post GFN Thursday Delivers Seven New Games This Week appeared first on NVIDIA Blog.

Categories
Misc

Upcoming Event: Deep Learning Framework Sessions at GTC 2022

Join us for these featured GTC 2022 sessions to learn about optimizing PyTorch models, accelerating graph neural networks, improving GPU performance with…

Join us for these featured GTC 2022 sessions to learn about optimizing PyTorch models, accelerating graph neural networks, improving GPU performance with automated code generation, and more.