Categories
Misc

NVIDIA Jetson Orin Nano Sets New Standard for Entry-Level Edge AI and Robotics With 80x Performance Leap

Canon, John Deere, Microsoft Azure, Teradyne, TK Elevator Join Over 1,000 Customers Adopting Jetson Orin Family Within Six Months of Launch SANTA CLARA, Calif., Sept. 20, 2022 (GLOBE NEWSWIRE) …

Categories
Misc

Enhancing AI Transparency and Ethical Considerations with Model Card++

An AI model card is a document that details how machine learning (ML) models work. Model cards provide detailed information about the ML model’s metadata…

An AI model card is a document that details how machine learning (ML) models work. Model cards provide detailed information about the ML model’s metadata including the datasets that it is based on, performance measures that it was trained on, and the deep learning training methodology itself. This post walks you through the current practice for AI model cards and how NVIDIA is planning to advance them with Model Card++, the enhanced next-generation AI model card. 

In their 2019 paper, Model Cards for Model Reporting, a group of data scientists, including Margaret Mitchell, Timnit Gebru, and Lucy Vasserman, sought to create a documentation standard for AI models. Their primary motivation was to promote transparency and accountability in the AI model development process by disclosing essential information about an AI model.

This information includes who developed the model, intended use cases and out-of-scope applications, expected users, how the model performs with different demographic groups, information about the data used to train and verify the model, limitations, and ethical considerations.

Until the development of the first AI model card, little information was shared about a particular AI model to help determine whether the model was suitable for a particular organization’s purpose. 

This becomes problematic if the output of a model could have an adverse impact on a particular group of people. For example, the 2019 university-led study, Discrimination through Optimization: How Facebook’s Ad Delivery Can Lead to Skewed Outcomes revealed that algorithms for delivering ads on social media resulted in discriminatory ad delivery despite the use of neutral parameters for targeting the ads.

The adoption of model cards helps the development and improvement of models by allowing developers to compare their results to those of similar models. Model cards highlight performance problems for those who plan to deploy a model. 

At the same time, model cards also educate policymakers who are drafting regulations and legislation governing AI models and systems. Although not required, developing model cards is a best practice that encourages developers to engage with the people who will ultimately be impacted by the model’s output.

The importance of model cards

While model cards are designed to encourage model transparency and trustworthiness, they are also used by stakeholders to improve developer understanding and standardize the decision-making processes.

Model cards are structured and organized in a schema. They concisely report information on different factors like demographics, environmental conditions, quantitative evaluation metrics, and, where provided, ethical considerations. 

Model cards can also record model version, type, date, license restrictions, information about the publishing organization, and other qualitative information. Model cards are designed to educate and allow an informed comparison of measures and benchmarks. Figure 1 shows the NGC Model Card for StyleGAN3.

Screenshot of a StyleGAN3 model card showing information about its architecture, training, and dataset.
Figure 1. A StyleGAN3 model card shows information about its architecture, training, and dataset

Model cards are like open-source fact sheets. Unless you are the developer of the model itself, you would likely not even know much about the AI model itself without model cards. Model cards provide the most comprehensive understanding of a model’s details and considerations individuals should take into account for its application. 

For instance, a smartphone might have a face detection system that allows the user to unlock it based on recognition. Without model cards, model developers might not realize how a model will behave until it is deployed. This is what happened when Dr. Joy Buolamwini tried to use a face detection system as part of her graduate work at MIT.

AI model card accessibility

AI model cards should not just be built for developers; companies should also build model cards that are accessible to and readable by nontechnical individuals and technical experts alike. 

Model cards are not restricted to a given industry or domain. They can be used for computer vision, speech, recommender systems, and other AI workflows. In addition to having active use in higher education and research and high performance computing spaces, model cards have utility across multiple industries including automotive, healthcare, and robotics applications. Model cards can: 

  • Teach students and help them understand real-world use cases 
  • Inform policymakers and clarify intended use for non-model developers 
  • Educate those interested in seeking the benefits of AI  

Mobilizing AI through model cards is a decisive and transparent step that companies can take toward the advancement of trustworthy AI.  

Improving and enhancing model cards   

We conducted market research to inform the improvements to existing model cards. While 90% of respondents in the developer sample agree that model cards are important and 70% would recommend them as-is, there is room for improvement to drive their adoption, use, and impact. 

Based on our research, existing model cards should be enhanced in two primary areas: accessibility and content quality. Model card users need model cards to be easily accessible and understandable. 

Accessibility 

Discovery is one element of model card accessibility that needs improvement. In releasing models, AI developers should be able to find and then promote model cards alongside their work. This is true of models introduced in research papers as well as models deployed for commercial use. 

Secondly, model cards need to be located where interested individuals can reference them. One of the ways NVIDIA promotes model cards is through the NGC Catalog. Models and model cards are located side-by-side in this same repository.   

Content quality

After a model card has been located, the next challenge for the user is understanding the information contained in it. This is particularly critical in the model evaluation stage before selection. Not understanding the information contained in the model card leads to the same outcome as not knowing that the information exists; either way, model users cannot make informed decisions. 

To address this, NVIDIA encourages using a consistent organizational structure, simple format, and clear language for model cards. Adding filterable and searchable fields is also recommended. When individuals can find the information contained in model cards, they are more likely to understand the software. According to our research, respondents liked and relied on the information contained in the model card when it was easily accessible and understandable.           

In fact, performance and licensing information were the two most important areas respondents wanted to see in model cards. Figure 2 shows how the StyleGAN3 model card devotes separate sections to performance and licensing.

Screenshot of a StyleGAN3 model card including sections on performance and licensing.
Figure 2. A StyleGAN3 model card includes sections on performance and licensing

After performance and licensing information, respondents felt that the section on ethical considerations was the most important category of information to include in model selection criteria. Within ethical considerations, respondents shared that they wanted more information about the datasets used to train and validate the model—particularly details regarding unwanted bias—as well as information about safety and security.   

Overview of Model Card++ 

Model Card++ is the improved NGC Catalog Model Card prototype that NVIDIA has developed over the last 9 months. In addition to the typical information given in the Overview section of a model card in the NGC Catalog, Model Card++ incorporates:

  • The Plus Plus Promise (also known as the ++ Promise or the Triple P), describing the NVIDIA software development approach and the standards we hold ourselves to in all model development 
  • Subsections detailing model-specific information concerning bias, explainability, privacy, safety, and security     

Figure 4 shows the ++ Promise, which will be embedded in every Model Card++.

A checklist of the Model Card++ Promise describing the steps NVIDIA is taking to demonstrate trustworthiness in AI model development.
Figure 4. The Model Card++ Promise describes the steps NVIDIA is taking to demonstrate trustworthiness in AI model development

The ++ Promise describes the steps that NVIDIA is taking to demonstrate the trustworthiness of our work embedded in design. The subcards outline: 

  • Steps taken to mitigate unwanted bias
  • Decision logic and example domains
  • Provenance of training datasets and what type of data was collected and how 
  • Development controls used and known restrictions     

This is not an exhaustive list but demonstrates the intent by design and commitment to standards and protections that value individuals, the data, and the NVIDIA contribution to AI. This applies to every model, across domains and use cases. 

Figure 5 shows an example of the Explainability subcard. Each Model Card++ will include a dedicated section of fields and responses for each of the subsections. What is shown in the response section for each field is not meant to represent a real-world model, but to illustrate what will be provided based on current understanding and the latest research.

An example of the Explainability Subcard, one of four subsections Model Card++ incorporates into the NGC model card.
Figure 5. An example of the Explainability subcard, one of four subsections Model Card++ incorporates into the NGC model card

The Explainability subcard gives information about example domains for an AI model, intended users, decision logic, and compliance review. NVIDIA model cards aim to present AI models using clear, consistent, and concise language.  

NVIDIA will start rolling out Model Card++ by the end of the year, with all commercial models using it by the end of 2023.   

How we built Model Card++

Model Card++ is the next generation of AI model card. It is the result of a disciplined, cross-functional approach in partnership with engineering, product, research, product security, and legal teams. Building on existing NGC model cards, we reviewed model cards from other organizations and templates, including GitHub, to find out what other information could be provided consistently. 

We worked with engineering to pilot what could be consistently provided in addition to information that is currently provided. We discovered that although our model cards have a section for ethical considerations, there is more that can be provided, like measures we took to mitigate unwanted bias. We also found we could describe dataset provenance and traceability, dataset storage, and quality validation.

We look to provide more details about dataset demographic make-up, performance metrics for different demographic groups, and specific mitigation efforts we have taken to address unwanted bias. We also worked with an algorithmic bias consultant to develop a process for assessing unwanted bias that is compliant with data privacy laws and coupled that with our latest market research.

As we built Model Card++, we also corroborated our work with market research by surveying developers who used our models and those from across the industry. We validated the desired information and structured it with our user design experience team to present it in a clear and organized format. We are excited about introducing Model Card++ to the world and hope to continue leading efforts that encourage inclusive AI for all.  

Get the latest updates about Model Card++ at the September GTC 2022 session, Ingredients of Trust: Moving towards Model Card++.

Categories
Misc

Accelerating NVIDIA HPC Software with SVE on AWS Graviton3

The latest NVIDIA HPC SDK update expands portability and now supports the Arm-based AWS Graviton3 processor. In this post, you learn how to enable Scalable…

The latest NVIDIA HPC SDK update expands portability and now supports the Arm-based AWS Graviton3 processor. In this post, you learn how to enable Scalable Vector Extension (SVE) auto-vectorization with the NVIDIA compilers to maximize the performance of HPC applications running on the AWS Graviton3 CPU.

NVIDIA HPC SDK

The NVIDIA HPC SDK includes the proven compilers, libraries, and software tools essential to maximizing developer productivity and building HPC applications for GPUs, CPUs, or the cloud.

NVIDIA HPC compilers enable cross-platform C, C++, and Fortran programming for NVIDIA GPUs and multicore Arm, OpenPOWER, or x86-64 CPUs. These are ideal for HPC modeling and simulation applications written in C, C++, or Fortran with OpenMP, OpenACC, and CUDA

For example, SPEC CPU® 2017 benchmark scores are estimated to increase by 17% on the AWS Graviton 3 when compiled with the NVIDIA HPC compilers vs. GCC 12.1.

  Speedup (est.) Ratio (est.) Seconds (est.)
    NVHPC GC1 12.1 NVHPC GCC 12.1
64 Copy FPRate 1.04 263 254 501 519
64 Thread FPSpeed 1.17 188 161 73.6 85.9
Table 1. SPEC CPU 2017 estimates

The compilers are also fully interoperable with the optimized NVIDIA math libraries, communication libraries, and performance tuning and debugging tools. These accelerated math libraries maximize performance on common HPC algorithms, and the optimized communications libraries enable standards-based scalable systems programming.

The integrated performance profiling and debugging tools simplify porting and optimization of HPC applications, and the containerization tools enable easy deployment on-premises or in the cloud.

Arm and AWS Graviton3

AWS Graviton3 launched in May 2022 as the Arm-based CPU from AWS. The Arm architecture has a legacy of power efficiency and support for high memory bandwidth that makes it ideal for cloud and data center computing. Amazon reports:

The Amazon EC2 C7g instances, powered by the latest generation AWS Graviton3 processors, provide the best price performance in Amazon EC2 for compute-intensive workloads. C7g instances are ideal for HPC, batch processing, electronic design automation (EDA), gaming, video encoding, scientific modeling, distributed analytics, CPU-based machine learning (ML) inference, and ad-serving. They offer up to 25% better performance over the sixth generation AWS Graviton2-based C6g instances.

Compared to AWS Graviton2, ANSYS benchmarked 35% better performance on AWS Graviton3. Formula 1 simulations are also 40% faster. Arm-based CPUs have been delivering significant innovations and performance enhancements since the launch of the Arm Neoverse product line, when the Neoverse N1 core exceeded performance expectations by 30%.

In keeping with the history of Arm enabling support for new computing technologies well ahead of the competition, AWS Graviton3 features DDR5 memory and the SVE to the Arm architecture.

Amazon EC2 C7g instances are the first in the cloud to feature DDR5 memory, which provides 50% higher memory bandwidth compared to DDR4 memory to enable high-speed access to data in memory. The best way to take full advantage of all that memory bandwidth is to use the latest in vectorization technologies: Arm SVE.

SVE architecture

In addition to being the first cloud-hosted CPU to offer DDR5, AWS Graviton3 is also the first in the cloud to feature SVE.

SVE was first introduced in the Fujitsu A64FX CPU, which powers the RIKEN Fugaku supercomputer. When Fugaku launched, it shattered all contemporary HPC CPU benchmarks and placed confidently at the top of the TOP500 supercomputers list for two years.

SVE and high-bandwidth memory are the key design features of the A64FX that make it ideal for HPC, and both these features are present in the AWS Graviton3 processor.

SVE is a next-generation SIMD extension to the Arm architecture. It enables flexible vector length implementations with a range of possible values in CPU implementations. The vector length can vary from a minimum of 128 bits to a maximum of 2,048 bits, at 128-bit increments.

For example, the Fujitsu A64FX implements SVE at 512-bits, while AWS Graviton3 implements it at 256-bits. Unlike other SIMD architectures, the same assembly code runs on both CPUs, even though the hardware vector bit-width is different. This is called vector-length agnostic (VLA) programming.

VLA code is highly portable and can enable compilers to generate better assembly code. But, if a compiler knows the target CPU’s hardware vector bit-width, it can enable further optimizations for that specific architecture. This is vector length–specific (VLS) programming.

SVE uses the same assembly language for both VLA and VLS. The only difference is that the compiler is free to make additional assertions about data layout, loop trip counts, and other relevant features while generating the code. This results in highly optimized, target-specific code that takes full advantage of the CPU.

SVE also introduces a powerful range of advanced features ideal for HPC and ML applications:

  • Gather-load and scatter-store instructions allow operations on arrays-of-structures and other noncontiguous data to vectorize.
  • Speculative vectorization enables the SIMD acceleration of string manipulation functions and loops that contain control flow.
  • Horizontal and serialized vector operations facilitate data reductions and help optimize loops processing large datasets.

SVE is not an extension or the replacement of the NEON instruction set, which is also available in AWS Gravition3. SVE is redesigned for better data parallelism for HPC and ML.

Maximizing Graviton3 performance with NVIDIA HPC compilers

Compiler auto-vectorization is one of the easiest ways to take advantage of SVE, and the NVIDIA HPC compilers add support for SVE auto-vectorization in the 22.7 release.

To maximize performance, the compiler performs analysis to determine which SIMD instructions to generate. SVE auto-vectorization uses target-specific information to generate highly optimized vector length–specific (VLS) code based on the vector bit-width of the CPU core.

To enable SVE auto-vectorization, specify the appropriate -tp architecture flag for the target CPU: -tp=neoverse-v1. Not specifying a -tp option assumes that the application will be executed on the same system on which it was compiled.

Applications compiled with the NVIDIA HPC compilers on Graviton3 automatically take full advantage of the CPU’s 256-bit SVE SIMD units. Graviton3 is also backward compatible with the -tp=neoverse-n1 option but only runs vector code on its 128-bit NEON SIMD units.

Getting started with the NVIDIA HPC SDK

The NVIDIA HPC SDK provides a comprehensive and proven software stack. It enables HPC developers to create and optimize application performance on high-performance systems such as the NVIDIA platform and AWS Graviton3.

By providing a wide range of programming models, libraries, and development tools, applications can be efficiently developed for the specialized hardware that enables state-of-the-art performance in systems such as NVIDIA GPUs and SVE-enabled processors like AWS Graviton3.

For more information, see the following resources:

Categories
Offsites

Google at Interspeech 2022

This week, the 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH 2022) is being held in Incheon, South Korea, representing one of the world’s most extensive conferences on research and technology of spoken language understanding and processing. Over 2,000 experts in speech-related research fields gather to take part in oral presentations and poster sessions and to collaborate with streamed events across the globe.

We are excited to be a Diamond Sponsor of INTERSPEECH 2022, where we will be showcasing nearly 50 research publications and supporting a number of workshops, special sessions and tutorials. We welcome in-person attendees to drop by the Google booth to meet our researchers and participate in Q&As and demonstrations of some of our latest speech technologies, which help to improve accessibility and provide convenience in communication for billions of users. In addition, online attendees are encouraged to visit our virtual booth in GatherTown where you can get up-to-date information on research and opportunities at Google. You can also learn more about the Google research being presented at INTERSPEECH 2022 below (Google affiliations in bold).

Organizing Committee

Industry Liaisons include: Bhuvana Ramabahdran

Area Chairs include: John Hershey, Heiga Zen, Shrikanth Narayanan, Bastiaan Kleijn

ISCA Fellows

Include: Tara Sainath, Heiga Zen

Publications

Production Federated Keyword Spotting via Distillation, Filtering, and Joint Federated-Centralized Training
Andrew Hard, Kurt Partridge, Neng Chen, Sean Augenstein, Aishanee Shah, Hyun Jin Park, Alex Park, Sara Ng, Jessica Nguyen, Ignacio Lopez Moreno, Rajiv Mathews, Françoise Beaufays

Leveraging Unsupervised and Weakly-Supervised Data to Improve Direct Speech-to-Speech Translation
Ye Jia, Yifan Ding, Ankur Bapna, Colin Cherry, Yu Zhang, Alexis Conneau, Nobu Morioka

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition
W. Ronny Huang, Cal Peyser, Tara N. Sainath, Ruoming Pang, Trevor Strohman, Shankar Kumar

UserLibri: A Dataset for ASR Personalization Using Only Text
Theresa Breiner, Swaroop Ramaswamy, Ehsan Variani, Shefali Garg, Rajiv Mathews, Khe Chai Sim, Kilol Gupta, Mingqing Chen, Lara McConnaughey

SNRi Target Training for Joint Speech Enhancement and Recognition
Yuma Koizumi, Shigeki Karita, Arun Narayanan, Sankaran Panchapagesan, Michiel Bacchiani

Turn-Taking Prediction for Natural Conversational Speech
Shuo-Yiin Chang, Bo Li, Tara Sainath, Chao Zhang, Trevor Strohman, Qiao Liang, Yanzhang He

Streaming Intended Query Detection Using E2E Modeling for Continued Conversation
Shuo-Yiin Chang, Guru Prakash, Zelin Wu, Tara Sainath, Bo Li, Qiao Liang, Adam Stambler, Shyam Upadhyay, Manaal Faruqui, Trevor Strohman

Improving Distortion Robustness of Self-Supervised Speech Processing Tasks with Domain Adaptation
Kuan Po Huang, Yu-Kuan Fu, Yu Zhang, Hung-yi Lee

XLS-R: Self-Supervised Cross-Lingual Speech Representation Learning at Scale
Arun Babu, Changhan Wang, Andros Tjandra, Kushal Lakhotia, Qiantong Xu, Naman Goyal, Kritika Singh, Patrick von Platen, Yatharth Saraf, Juan Pino, Alexei Baevski, Alexis Conneau, Michael Auli

Extracting Targeted Training Data from ASR Models, and How to Mitigate It
Ehsan Amid, Om Thakkar, Arun Narayanan, Rajiv Mathews, Françoise Beaufays

Detecting Unintended Memorization in Language-Model-Fused ASR
W. Ronny Huang, Steve Chien, Om Thakkar, Rajiv Mathews

AVATAR: Unconstrained Audiovisual Speech Recognition
Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

End-to-End Multi-talker Audio-Visual ASR Using an Active Speaker Attention Module
Richard Rose, Olivier Siohan

Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-person Video
Dmitriy Serdyuk, Otavio Braga, Olivier Siohan

Unsupervised Data Selection via Discrete Speech Representation for ASR
Zhiyun Lu, Yongqiang Wang, Yu Zhang, Wei Han, Zhehuai Chen, Parisa Haghani

Non-parallel Voice Conversion for ASR Augmentation
Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran, Fadi Biadsy, Jesse Emond, Yinghui Huang, Pedro J. Moreno

Ultra-Low-Bitrate Speech Coding with Pre-trained Transformers
Ali Siahkoohi, Michael Chinen, Tom Denton, W. Bastiaan Kleijn, Jan Skoglund

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification
Chao Zhang, Bo Li, Tara Sainath, Trevor Strohman, Sepand Mavandadi, Shuo-Yiin Chang, Parisa Haghani

Improving Deliberation by Text-Only and Semi-supervised Training
Ke Hu, Tara N. Sainath, Yanzhang He, Rohit Prabhavalkar, Trevor Strohman, Sepand Mavandadi, Weiran Wang

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR
W. Ronny Huang, Shuo-yiin Chang, David Rybach, Rohit Prabhavalkar, Tara N. Sainath, Cyril Allauzen, Cal Peyser, Zhiyun Lu

CycleGAN-Based Unpaired Speech Dereverberation
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

TRILLsson: Distilled Universal Paralinguistic Speech Representations (see blog post)
Joel Shor, Subhashini Venugopalan

Learning Neural Audio Features Without Supervision
Sarthak Yadav, Neil Zeghidour

SpeechPainter: Text-Conditioned Speech Inpainting
Zalan Borsos, Matthew Sharifi, Marco Tagliasacchi

SpecGrad: Diffusion Probabilistic Model-Based Neural Vocoder with Adaptive Noise Spectral Shaping
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani

Distance-Based Sound Separation
Katharine Patterson, Kevin Wilson, Scott Wisdom, John R. Hershey

Analysis of Self-Attention Head Diversity for Conformer-Based Automatic Speech Recognition
Kartik Audhkhasi, Yinghui Huang, Bhuvana Ramabhadran, Pedro J. Moreno

Improving Rare Word Recognition with LM-Aware MWER Training
Wang Weiran, Tongzhou Chen, Tara Sainath, Ehsan Variani, Rohit Prabhavalkar, W. Ronny Huang, Bhuvana Ramabhadran, Neeraj Gaur, Sepand Mavandadi, Cal Peyser, Trevor Strohman, Yanzhang He, David Rybach

MAESTRO: Matched Speech Text Representations Through Modality Matching
Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro J. Moreno, Ankur Bapna, Heiga Zen

Pseudo Label is Better Than Human Label
Dongseong Hwang, Khe Chai Sim, Zhouyuan Huo, Trevor Strohman

On the Optimal Interpolation Weights for Hybrid Autoregressive Transducer Model
Ehsan Variani, Michael Riley, David Rybach, Cyril Allauzen, Tongzhou Chen, Bhuvana Ramabhadran

Streaming Align-Refine for Non-autoregressive Deliberation
Wang Weiran, Ke Hu, Tara Sainath

Federated Pruning: Improving Neural Network Efficiency with Federated Learning
Rongmei Lin*, Yonghui Xiao, Tien-Ju Yang, Ding Zhao, Li Xiong, Giovanni Motta, Françoise Beaufays

A Unified Cascaded Encoder ASR Model for Dynamic Model Sizes
Shaojin Ding, Weiran Wang, Ding Zhao, Tara N Sainath, Yanzhang He, Robert David, Rami Botros, Xin Wang, Rina Panigrahy, Qiao Liang, Dongseong Hwang, Ian McGraw, Rohit Prabhavalkar, Trevor Strohman

4-Bit Conformer with Native Quantization Aware Training for Speech Recognition
Shaojin Ding, Phoenix Meadowlark, Yanzhang He, Lukasz Lew, Shivani Agrawal, Oleg Rybakov

Visually-Aware Acoustic Event Detection Using Heterogeneous Graphs
Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha

A Conformer-Based Waveform-Domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy
Sankaran Panchapagesan, Arun Narayanan, Turaj Zakizadeh Shabestary, Shuai Shao, Nathan Howard, Alex Park, James Walker, Alexander Gruenstein

Reducing Domain Mismatch in Self-Supervised Speech Pre-training
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Yu Zhang, Nicolás Serrano

On-the-Fly ASR Corrections with Audio Exemplars
Golan Pundak, Tsendsuren Munkhdalai, Khe Chai Sim

A Language Agnostic Multilingual Streaming On-Device ASR System
Bo Li, Tara Sainath, Ruoming Pang*, Shuo-Yiin Chang, Qiumin Xu, Trevor Strohman, Vince Chen, Qiao Liang, Heguang Liu, Yanzhang He, Parisa Haghani, Sameer Bidichandani

XTREME-S: Evaluating Cross-Lingual Speech Representations
Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson

Towards Disentangled Speech Representations
Cal Peyser, Ronny Huang, Andrew Rosenberg, Tara Sainath, Michael Picheny, Kyunghyun Cho

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition
Shaojin Ding, Rajeev Rikhye, Qiao Liang, Yanzhang He, Quan Wang, Arun Narayanan, Tom O’Malley, Ian McGraw

A Universally-Deployable ASR Frontend for Joint Acoustic Echo Cancellation, Speech Enhancement, and Voice Separation
Tom O’Malley, Arun Narayanan, Quan Wang

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks
Lev Finkelstein, Heiga Zen, Norman Casagrande, Chun-an Chan, Ye Jia, Tom Kenter, Alex Petelin, Jonathan Shen*, Vincent Wan, Yu Zhang, Yonghui Wu, Robert Clark

A Scalable Model Specialization Framework for Training and Inference Using Submodels and Its Application to Speech Model Personalization
Fadi Biadsy, Youzheng Chen, Xia Zhang, Oleg Rybakov, Andrew Rosenberg, Pedro Moreno

Text-Driven Separation of Arbitrary Sounds
Kevin Kilgour, Beat Gfeller, Qingqing Huang, Aren Jansen, Scott Wisdom, Marco Tagliasacchi

Workshops, Tutorials & Special Sessions

The VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22)
Organizers include: Arsha Nagrani

Self-Supervised Representation Learning for Speech Processing
Organizers include: Tara Sainath

Learning from Weak Labels
Organizers include: Ankit Shah

RNN Transducers for Named Entity Recognition with Constraints on Alignment for Understanding Medical Conversations
Authors: Hagen Soltau, Izhak Shafran, Mingqiu Wang, Laurent El Shafey

Listening with Googlears: Low-Latency Neural Multiframe Beamforming and Equalization for Hearing Aids
Authors: Samuel Yang, Scott Wisdom, Chet Gnegy, Richard F. Lyon, Sagar Savla

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset
Authors: Michael Chinen, Jan Skoglund, Chandan K. A. Reddy, Alessandro Ragano, Andrew Hines

Incremental Layer-Wise Self-Supervised Learning for Efficient Unsupervised Speech Domain Adaptation On Device
Authors: Zhouyuan Huo, Dongseong Hwang, Khe Chai Sim, Shefali Garg, Ananya Misra, Nikhil Siddhartha, Trevor Strohman, Françoise Beaufays

Trustworthy Speech Processing
Organizers include: Shrikanth Narayanan



*Work done while at Google.  

Categories
Misc

Text Normalization and Inverse Text Normalization with NVIDIA NeMo

Text normalization (TN) converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech (TTS). TN…

Text normalization (TN) converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech (TTS). TN ensures that TTS can handle all input texts without skipping unknown symbols. For example, “$123” is converted to “one hundred and twenty-three dollars.”

To learn more about the technology behind this post, tune into the author’s presentation at INTERSPEECH 2022 on Monday, September 19, 11:00–13:00 (KST), Virtual Poster: Speech Synthesis: Linguistic processing, paradigms, and other topics II and at 21:00–23:00 (KST), Other Topics in Speech Recognition.

Inverse text normalization (ITN) is a part of the automatic speech recognition (ASR) post-processing pipeline. ITN converts ASR model output into its written form to improve text readability. For example, the ITN module replaces “one hundred and twenty-three dollars” transcribed by an ASR model with “$123.”

ITN not only improves readability but also boosts the performance of downstream tasks such as neural machine translation or named entity recognition, as such tasks use written text during training.

ITN is the post-processing step after ASR, while TN is the preprocessing step before TTS. ITN converts ASR output “on may third we paid one hundred and twenty three dollars” to a written form “on may 3 we paid $123,” while TN reverts the process and outputs the original spoken text.
Figure 1. TN and ITN in the conversational AI pipeline

TN and ITN tasks face several challenges:

  • Labeled data is scarce and difficult to collect.
  • There is a low tolerance for unrecoverable errors, as TN and ITN errors cascade down to subsequent models. TN and ITN errors that alter the input semantics are called unrecoverable.

TN and ITN systems support a wide variety of semiotic classes, that is, words or tokens where the spoken form differs from the written form, requiring normalization. Examples are dates, decimals, cardinals, measures, and so on.

Many state-of-the-art TN systems in production are still rule-based using weighted finite state transducers (WFST). WFSTs are a form of finite-state machines used to graph relations between regular languages (or regular expressions). For this post, they can be defined by two major properties:

  • Mappings between accepted input and output expressions for text substitution
  • Path weighting to direct graph traversal

In case of ambiguity, the path with the smallest sum of weights is chosen. In Figure 2, “twenty-three” is transduced to “23″ instead of “20 3.”

In the diagram, the shortest path is selected to output “23” instead of “20 3.”
Figure 2. WFST lattice for input “twenty-three”

Currently, NVIDIA NeMo offers the following option for TN and ITN systems:

  1. Context-independent WFST-based TN and ITN grammars
  2. Context-aware WFST-based grammars + neural LM for TN
  3. Audio-based TN for speech datasets creation
  4. Neural TN and ITN

WFST-based grammar (systems 1, 2, and 3)

The NeMo Text Processing package is a Python framework that relies on the Python package Pynini to write and compile normalization grammars. For more information about the latest supported languages, see Language Support Matrix. For more information about how to extend or add your language grammar, see Grammar customization.

Pynini is a toolkit built on top of OpenFst, and it supports the export of the grammars into an OpenFST Archive File (FAR) (Figure 3). The FAR file can be used in a C++ production framework, which is based on Sparrowhawk.

NeMo TN and ITN uses WFST grammars based on Pynini for development, then exports them in .FAR files, and deploys them in the Sparrowhawk (C++) framework.
Figure 3. Schematic diagram of NeMo inverse text normalization development and deployment

Our initial version of TN/ITN system #1 does not take context into account, as that would make the rules significantly more complex, which requires extensive linguistic knowledge and deteriorates latency. If an input is ambiguous, for example, “1/4” in “The train leaves on 1/4” compared to “1/4 of a cup,” system #1 chooses the normalization deterministically without considering the context.

The system extends system #1 and incorporates context during normalization. The system outputs multiple normalization options in case of contextual ambiguity, which is rescored using a pretrained language model using Masked Language Model Scoring (Figure 4).

Given input “The train leaves on 1/4”, WFST grammars generate all possible normalization options, “The train leaves on one quarter,” “The train leaves on January fourth,” “The train leaves on one/four,” and “The train leaves on one divided by four.” Then, options with weights higher than the threshold values are disregarded. Here, the option “The train leaves on one/four” is dropped.  Finally, the LM scores the remaining options and selects “The train leaves on January fourth” as the best matching one.
Figure 4. WFST+LM shallow fusion pipeline
  1. WFST generates all possible normalization forms and assigns weights to each option.
  2. Pruning normalization options with weights higher than the threshold value “401.2″. In this example, we dropped “one/four”. It has a higher weight as it was not fully normalized.
  3. LM rescoring picks the best among the remaining options.

This approach is similar to shallow fusion for ASR and combines the benefits of the rule-based and neural system. The WFST still limits unrecoverable errors while the neural language model resolves contextual ambiguity without the need for extensive rules or hard-to-get data. For more information, see Text normalization.

Dataset Number of sentences Det WFST Duplex WFST + LM
EngConf 231 68.83 55.41 94.37
GoogleTN 7551 97.29 99.07 97.79
LibriTTS 7677 98.65 90.40 99.01
Table 1. Sentence accuracies on EngConf dataset using different language models for LM rescoring

Table 1 compares the WFST+LM approach in terms of sentence accuracy with the previous system #1 (DetWFST) and a purely neural-based system (Duplex) on three datasets. Later in this post, we provide more details on system #4.

Overall, the WFST+LM model is most effective, particularly on EngConf, a self-collected dataset with ambiguous examples.

Figure 5 shows how susceptible the three methods are to errors. While the neural method is most affected by unrecoverable errors, such as hallucinations or omissions, WFST+LM is least affected by those and class ambiguity.

The following Duplex, Det WFST, and WFST+LM error patterns are showcased: “Number error” (Duplex is affected and input “10001” got altered to “one hundred”, the rest of the models are not affected), “Unknown format” (all models are affected), “Hallucination” (Duplex changes “Mrs.” to “m r e”, the rest of the models are not affected), “Omission” (given input “10 1”, Duplex returns “one one”, i.e. omits “zero”, the rest of the models are not affected), “Class ambiguity” (DetWFST produces a wrong form “leaves on one quarter” for input “leaves on 1/4”, the rest of the models are less affected by such error), “Smart URL splitting” (DetWFST produces “w e A r e s c dot com” for input “WeAreSC.com”, the rest of the models are less affected by such error).
Figure 5. Error patterns for context-free WFST, Duplex, and WFST+LM systems

Audio-based TN (system 3)

Text normalization also comes in handy during the creation of new speech datasets. For instance, “six two seven” and “six twenty-seven” are both valid normalization options of ”627”. However, you must select the option that best reflects what is actually said in the corresponding audio. Audio-based text normalization provides such functionality (Figure 6).

Given input “627”, audio-based TN outputs all possible normalization options, for example, “six hundred twenty seven,” “six twenty seven,” “six two seven,” and so on Then character error rate (CER) is calculated to compare the ASR transcript of the corresponding audio with each normalized option. The option with the lowest CER is selected as the final output.
Figure 6. Example of audio-based normalization resolution

Neural TN and ITN model (system 4)

One significant advantage of neural systems compared to rule-based systems is they are easy to scale if training data for a new language exists. Rule-based systems require much effort to create and may work slowly on some inputs due to combinatorial bursts.

As an alternative to the WFST solution, NeMo hosts a seq2seq Duplex model for TN/ITN and a tagger-based neural model for ITN.

Duplex TN and ITN

Duplex TN and ITN is a neural-based system that can do both TN and ITN. At a high level, the system consists of two components:

  • DuplexTaggerModel:  A transformer-based tagger for identifying semiotic spans in the input (for example, spans about times, dates, or monetary amounts). [NEED LINK]
  • DuplexDecoderModel: A transformer-based seq2seq model for decoding the semiotic spans into their appropriate forms (for example, spoken forms for TN and written forms for ITN).

The term duplex refers to the fact that this system can be trained to do both TN and ITN. However, you can also specifically train the system for only one of the tasks.

Thutmose tagger

The Duplex model is a sequence-to-sequence model. Unfortunately, such neural models are prone to hallucinations that could lead to unrecoverable errors.

The Thutmose Tagger model regards ITN as a tagging task and mitigates hallucination issues (Figures 7 and 8). Thutmose is a single-pass token classifier model that assigns a replacement fragment to every input token or marks it for deletion or copying without changes.

NeMo provides a method of dataset preparation, based on granular alignment of ITN examples. The model is trained on the Google Text Normalization dataset and achieves state-of-the-art sentence accuracy on both English and Russian test sets.

Tables 2 and 3 summarize evaluation results for two metrics:

  • Sentence accuracy: An automatic metric that matches each prediction with multiple possible variants of the reference. All errors are divided into two groups: digit error and other error. Digit error occurs when at least one digit differs from the closest reference variant. Other error means a non-digit error is present in the prediction, for example, a punctuation or letter mismatch.
  • Word error rate (WER): An automatic metric commonly used in ASR.
Table 2. Performance metrics (percentage) on English
Test set Metric  Duplex model  Thutmose (BERT)  Thutmose (d-BERT)
Default Sent. acc.  97.31  97.43  97.36
Digit error  0.35  0.31  0.38
Other error  2.34  2.26  2.26
WER  2.9  3.7  3.74
Hard Sent. acc.  85.34  85.17  84.71
Digit error  3.12  3.13  3.06
Other error  11.54  11.70  12.23
WER  9.34  9.02  9.10

d-BERT stands for distilBERT.
Default is the default Google Text Normalization test set.
Hard is a test set with sampling of at least 1,000 examples for each semiotic class.

Table 3. Performance metrics (percentage) on Russian
Test set Metric  Duplex model  Thutmose (BERT)  Thutmose (d-BERT)
Default Sent. acc.  92.34  93.45  92.72
Digit error  0.51  0.43  0.52
Other error  7.15  6.11  6.75
WER  3.63  2.94  3.67
Hard Sent. acc.  81.02  84.03  81.75
Digit error  3.24  3.08  3.77
Other error  15.74  12.90  14.48
WER  11.76  7.07  8.05

One-to-one correspondence between tags and input words improves the interpretability of the model’s predictions, simplifies debugging, and enables post-processing corrections. The model is simpler than sequence-to-sequence models and easier to optimize in production settings.

Thutmose tagger inference pipeline. The model takes as input a sequence of spoken-domain words, passes them through a BERT encoder and a classification head. It assigns a tag to each input word. After a simple post-processing step, the final written-domain output is generated.
Figure 7. ITN as tagging: inference example

The sequence of input words is processed by the BERT-based token classifier, giving the output tag sequence. Simple deterministic post-processing gives the final output.

The following Thutmose and Duplex error patterns are showcased: “Duplication due to alignment mistakes is common error pattern for Thutmose, for example “million million.” Duplex error patterns include hallucinations and overconfident choice of more frequent phrase even when it is not supported by the input, for example it predicts “air canada 777” instead of “air canada 773.”
Figure 8. Examples of errors: (left) Thutmose tagger, (right) Duplex model

Conclusion

Text normalization and inverse text normalization are crucial for conversational systems and considerably affect users’ experience. This post introduced a novel way of handling TN task by combining the benefits of WFST and pretrained language models and a new neural tagging-based approach for tackling ITN task.

For more information, including code examples, tutorials, and documentation for the TN/ITN solutions discussed in this post, see the NVIDIA/NeMo GitHub repo.

Categories
Misc

Dynamic Scale Weighting Through Multiscale Speaker Diarization

Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question “Who spoke when?”. It makes a clear…

Speaker diarization is the process of segmenting audio recordings by speaker labels and aims to answer the question “Who spoke when?”. It makes a clear distinction when it is compared with speech recognition.

To learn more about the technology behind this post, tune into the author’s presentation at INTERSPEECH 2022 on Thursday, September 22, 13:30-15:30 (KST), On-Site Poster: Speaker Recognition and Diarization.

Before you perform speaker diarization, you know “what is spoken” but you don’t know “who spoke it”. Therefore, speaker diarization is an essential feature for a speech recognition system that enriches the transcription with speaker labels. That is, conversational speech recordings can never be considered to be fully transcribed without a speaker diarization process because transcriptions without speaker labels cannot inform you who is speaking to whom.

Diagram shows that a box named “Automatic Speech Recognition” produces transcribed words “hey how are you quite busy” but those words are all in the same gray color. After the speech signal waveform goes through a Speaker Diarization, “hey”,“quite”, “busy” are colored in green and “how”, “are”, “you” are colored in blue.
Figure 1. Speaker diarization is the task of partitioning audio recordings into speaker-homogeneous regions

Speaker diarization must produce accurate timestamps as speaker turns can be extremely short in conversational settings. We often use short back-channel words such as “yes”, “uh-huh,” or “oh.” These words are challenging for machines to transcribe and identify the speaker. 

While segmenting audio recordings in terms of speaker identity, speaker diarization requires fine-grained decisions on relatively short segments, ranging from a few tenths of a second to several seconds. Making accurate, fine-grained decisions on such short audio segments is challenging because it is less likely to capture reliable speaker traits.

In this post, we discuss how this problem can be addressed by introducing a new technique called the multi-scale approach and multiscale diarization decoder (MSDD) to handle multi-scale inputs.

Mechanism of multi-scale segmentation

Extracting long audio segments is desirable in terms of the quality of speaker characteristics. However, the length of audio segments also limits the granularity, which leads to a coarse unit length for speaker label decisions. Speaker diarization systems are challenged by a trade-off between temporal resolution and the fidelity of the speaker representation, as shown by the curve shown in Figure 2.

During the speaker feature extraction process in the speaker diarization pipeline, the temporal resolution is inevitably sacrificed by taking a long speech segment to obtain high-quality speaker representation vectors. In plain and simple language, if you try to be accurate on voice characteristics, then you have to look into a longer span of time.

At the same time, if you look into a longer span of time, you have to make a decision on a fairly long span of time. This leads to coarse decisions (temporal resolution is low). Think about the fact that even human listeners cannot accurately tell who is speaking if only half a second of recorded speech is given.

In most diarization systems, an audio segment length ranges from 1.5~3.0 seconds becausesuch numbers make a good compromise between the quality of speaker characteristics and temporal resolution. This type of segmentation method is known as a single-scale approach.

Even with an overlap technique, the single-scale segmentation limits the temporal resolution to 0.75~1.5 seconds, which leaves room for improvement in terms of temporal accuracy.

Having a coarse temporal resolution not only deteriorates the performance of diarization but also decreases speaker counting accuracy since short speech segments are not captured properly. More importantly, such coarse temporal resolution in the speaker timestamps makes the matching between the decoded ASR text and speaker diarization result more error-prone.  

To tackle the problem, we proposed a multi-scale approach, which is a way to cope with such a trade-off by extracting speaker features from multiple segment lengths and then combining the results from multiple scales. The multi-scale technique achieves state-of-the-art accuracy on the most popular speaker diarization benchmark datasets. It is already part of the open-source conversational AI toolkit NVIDIA NeMo.

Figure 2 shows the key technical solutions of multi-scale speaker diarization.

On the left, multiple bars in different lengths are drawn below an example picture of speech signal waveform. On the right, a curve showing trade-off between two quantities “Fidelity of speaker representations” and “Temporal resolution”. A circle named “Multiscale approach” is drawn above the trade-off curve showing that “Multiscale approach” can get high-level of both quantities at the same time.
Figure 2. Corresponding trade-off curve on temporal resolution and fidelity of speaker representation

The multi-scale approach is fulfilled by employing multi-scale segmentation and extracting speaker embeddings from each scale. On the left side of Figure 2, four different scales in a multi-scale segmentation approach are performed.

During the segment affinity calculation process, all the information from the longest scale to the shortest scale is combined, yet a decision is made only for the shortest segment range. When combining the features from each scale, the weight of each scale largely affects the speaker diarization performance.

Multiscale diarization pipeline with neural models

Because scale weights largely determine the accuracy of the speaker diarization system, the scale weights should be set to have the maximized speaker diarization performance.

We came up with a novel multi-scale diarization system called multiscale diarization decoder (MSDD) that dynamically determines the importance of each scale at each time-step.

Speaker diarization systems rely on the speaker characteristics captured by audio feature vectors called speaker embeddings. The speaker embedding vectors are extracted by a neural model to generate a dense floating point number vector from a given audio signal.

MSDD takes the multiple speaker embedding vectors from multiple scales and then estimates desirable scale weights. Based on the estimated scale weights, speaker labels are generated. The proposed system weighs more on the large scale if the input signals are considered to have more accurate information on certain scales.

Figure 3 shows the data flow of the proposed multiscale speaker diarization system. Multi-scale segments are extracted from audio input, and corresponding speaker embedding vectors for multi-scale audio input are generated by using the speaker embedding extractor (TitaNet).

Data-flow starts from Audio input, then goes to Embedding Extractor, Clustering Initialization. Then, the signal is split into boxes named Multi-scale Cosine Similarity and Scale Weight Calculation, then merged again at a box named Sequence Model. Lastly, the last box outputs Speaker Labels.
Figure 3. Data-flow of the proposed multi-scale speaker diarization system

The extracted multi-scale embeddings are processed by clustering algorithm to provide an initializing clustering result to the MSDD module. The MSDD module uses cluster-average speaker embedding vectors to compare these with input speaker embedding sequences. The scale weights for each step are sestimated to weigh the importance of each scale.

Finally, the sequence model is trained to output speaker label probabilities for each speaker.

MSDD mechanism

Diagram of input speech embeddings vectors in green, and clustered speaker embeddings are colored in blue for speaker 1 and red for speaker 2. All these green, blue and red speaker embedding vectors are fed into a neural network model which has a couple of 1-D filter layers followed by linear layer and softmax layer.
Figure 4. Scale weights calculated from a 1-D CNN in MSDD

In Figure 4, the 1-D filter captures the context from the input embeddings and cluster average embeddings.

Diagram shows how context vector is calculated. The speaker embedding vectors from the input signal is in green and cosine similarity values are calculated for both input-speaker1 (blue) and input-speaker2 (red) pairs. These cosine similarity values are then multiplied by scale-weights then becomes a context vector which is drawn at the top.
Figure 5. Context vector for MSDD

In Figure 5, cosine similarity values from each speaker and each scale are weighted by the scale weights to form a weighted cosine similarity vector.

The neural network model MSDD is trained to take advantage of a multi-scale approach by dynamically calculating the weight of each scale. MSDD takes the initial clustering results and compares the extracted speaker embeddings with the cluster-average speaker representation vectors.

Most importantly, the weight of each scale at each time step is determined through a scale weighting mechanism where the scale weights are calculated from a 1-D convolutional neural networks (CNNs) applied to the multi-scale speaker embedding inputs and the cluster average embeddings (Figure 3).

The estimated scale weights are applied to cosine similarity values calculated for each speaker and each scale. Figure 5 shows the process of calculating the context vector by applying the estimated scale weights on cosine similarity calculated (Figure 4) between cluster-average speaker embedding and input speaker embeddings.

Finally, each context vector for each step is fed to a multi-layer LSTM model that generates per-speaker speaker existence probability. Figure 6 shows how speaker label sequences are estimated by the LSTM model and context vector input.

A picture showing layers of neural networks. A context vector is fed to a linear layer than it goes through two layers of LSTMs and then goes through another layer to finally generate sigmoid values.
Figure 6.  Sequence modeling using LSTM

Figure 6, sequence modeling using LSTM takes the context vector input and generates speaker labels. The output of MSDD is the probability values of speaker existence at each timestep for two speakers.

The proposed speaker diarization system is designed to support the following features:

  • Flexible number of speakers
  • Overlap-aware diarization
  • Pretrained speaker embedding model

Flexible number of speakers

MSDD employs pairwise inference to diarize conversation with arbitrary numbers of speakers. For example, if there are four speakers, six pairs are extracted, and inference results from MSDD are averaged to obtain results for each of the four speakers.

Overlap-aware diarization

MSDD independently estimates the probability of two speaker labels of two speakers at each step (Figure 6). This enables overlap detection where two speakers are speaking at the same time.

Pretrained speaker embedding model

MSDD is based on the pretrained embedding extractor (TitaNet) model. By using a pretrained speaker model, you can use the neural network weights learned from a relatively large amount of single-speaker speech data.

In addition, MSDD is designed to be optimized with a pretrained speaker to fine-tune the entire speaker diarization system on a domain-specific diarization dataset.

Experimental results and quantitative benefits

The proposed MSDD system has several quantitative benefits: superior temporal resolution and improved accuracy.

Superior temporal resolution

While the single-scale clustering diarizer shows the best performance at a 1.5-second segment length where the unit decision length is 0.75 seconds (half-overlap), the proposed multi-scale approach has a unit decision length of 0.25 seconds. The temporal resolution can be even more enhanced by using a shorter shift length that requires more steps and resources.

Figure 2 shows the concept of the multi-scale approach and the unit decision length of 0.5 seconds. Merely applying 0.5-second segment length to a single-scale diarizer significantly drops the diarization performance due to the degraded fidelity of speaker features.

Improved accuracy

Diarization error rate (DER) is calculated by comparing hypothesis timestamps and ground-truth timestamps. Figure 7 shows the quantified performance of the multi-scale diarization approach over the state-of-the-art and single-scale clustering methods.

Bar plots showing diarization error rate for three different datasets. Left, “Landini et al”, shows 4.4% for CallHome and 2.2 % for AMI-MH-test. Middle, “Single-scale approach” shows 5.3%, 1.5%, 1.8% for CallHome, CH109, AMI-MH-test, respectively. Farthest right, “Multi-scale Approach” shows 4.0%, 0.6%, 1.1%  for CallHome, CH109, AMI-MH-test, respectively.
Figure 7. Quantitative evaluation of the previous state-of-the-art result (Landini et al. 2022), single-scale clustering method (prior work), and multi-scale approach (proposed) on three different datasets

The proposed MSDD approach can reduce DER up to 60% on two-speaker datasets when compared to the single-scale clustering diarizer. 

Conclusion

The proposed system has the following benefits:

  • This is the first neural network architecture that applies a multi-scale weighting concept with sequence model (LSTM) based speaker label estimation.
  • The weighing scheme is integrated in a single inference session and does not require fusion of multiple diarization results as in other speaker diarization systems.
  • The proposed multi-scale diarization system enables overlap-aware diarization which cannot be achieved with traditional clustering-based diarization systems.
  • Because the decoder is based on a clustering-based initialization, the diarization system can deal with a flexible number of speakers. This indicates that you can train the proposed model on two-speaker datasets and then use it for diarizing two or more speakers.
  • While having all previously mentioned benefits, the proposed approach shows a superior diarization performance compared to the previously published results.

There are two future areas of research regarding the proposed system:

  • We plan to implement a streaming version of the proposed system by implementing diarization decoder based on short-term window-based clustering.
  • The end-to-end optimization from speaker embedding extractor to diarization decoder can be investigated to improve the speaker diarization performance.

For more information, see Multiscale Speaker Diarization with Dynamic Scale Weighting or see the Interspeech 2022 session.

Categories
Offsites

Robust Online Allocation with Dual Mirror Descent

The emergence of digital technologies has transformed decision making across commercial sectors such as airlines, online retailing, and internet advertising. Today, real-time decisions need to be repeatedly made in highly uncertain and rapidly changing environments. Moreover, organizations usually have limited resources, which need to be efficiently allocated across decisions. Such problems are referred to as online allocation problems with resource constraints, and applications abound. Some examples include:

  • Bidding with Budget Constraints: Advertisers increasingly purchase ad slots using auction-based marketplaces such as search engines and ad exchanges. A typical advertiser can participate in a large number of auctions in a given month. Because the supply in these marketplaces is uncertain, advertisers set budgets to control their total spend. Therefore, advertisers need to determine how to optimally place bids while limiting total spend and maximizing conversions.
  • Dynamic Ad Allocation: Publishers can monetize their websites by signing deals with advertisers guaranteeing a number of impressions or by auctioning off slots in the open market. To make this choice, publishers need to trade off, in real-time, the short-term revenue from selling slots in the open market and the long-term benefits of delivering good quality spots to reservation ads.
  • Airline Revenue Management: Planes have a limited number of seats that need to be filled up as much as possible before a flight’s departure. But demand for flights changes over time and airlines would like to sell airline tickets to the customers who are willing to pay the most. Thus, airlines have increasingly adopted sophisticated automated systems to manage the pricing and availability of airline tickets.
  • Personalized Retailing with Limited Inventories: Online retailers can use real-time data to personalize their offerings to customers who visit their store. Because product inventory is limited and cannot be easily replenished, retailers need to dynamically decide which products to offer and at what price to maximize their revenue while satisfying their inventory constraints.

The common feature of these problems is the presence of resource constraints (budgets, contractual obligations, seats, or inventory, respectively in the examples above) and the need to make dynamic decisions in environments with uncertainty. Resource constraints are challenging because they link decisions across time — e.g., in the bidding problem, bidding too high early can leave advertisers with no budget, and thus missed opportunities later. Conversely, bidding too conservatively can result in a low number of conversions or clicks.

Two central resource allocation problems faced by advertisers and publishers in internet advertising markets.

In this post, we discuss state-of-the-art algorithms that can help maximize goals in dynamic, resource-constrained environments. In particular, we have recently developed a new class of algorithms for online allocation problems, called dual mirror descent, that are simple, robust, and flexible. Our papers have appeared in Operations Research, ICML’20, and ICML’21, and we have ongoing work to continue progress in this space. Compared to existing approaches, dual mirror descent is faster as it does not require solving auxiliary optimization problems, is more flexible because it can handle many applications across different sectors with minimal modifications, and is more robust as it enjoys remarkable performance under different environments.

Online Allocation Problems
In an online allocation problem, a decision maker has a limited amount of total resources (B) and receives a certain number of requests over time (T). At any point in time (t), the decision maker receives a reward function (ft) and resource consumption function (bt), and takes an action (xt). The reward and resource consumption functions change over time and the objective is to maximize the total reward within the resource constraints. If all the requests were known in advance, then an optimal allocation could be obtained by solving an offline optimization problem for how to maximize the reward function over time within the resource constraints1.

The optimal offline allocation cannot be implemented in practice because it requires knowing future requests. However, this is still useful for framing the goal of online allocation problems: to design an algorithm whose performance is as close to optimal as possible without knowing future requests.

Achieving the Best of Many Worlds with Dual Mirror Descent
A simple, yet powerful idea to handle resource constraints is introducing “prices” for the resources, which enables accounting for the opportunity cost of consuming resources when making decisions. For example, selling a seat on a plane today means it can’t be sold tomorrow. These prices are useful as an internal accounting system of the algorithm. They serve the purpose of coordinating decisions at different moments in time and allow decomposing a complex problem with resource constraints into simpler subproblems: one per time period with no resource constraints. For example, in a bidding problem, the prices capture an advertiser’s opportunity cost of consuming one unit of budget and allow the advertiser to handle each auction as an independent bidding problem.

This reframes the online allocation problem as a problem of pricing resources to enable optimal decision making. The key innovation of our algorithm is using machine learning to predict optimal prices in an online fashion: we choose prices dynamically using mirror descent, a popular optimization algorithm for training machine learning predictive models. Because prices for resources are referred to as “dual variables” in the field of optimization, we call the resulting algorithm dual mirror descent.

The algorithm works sequentially by assuming uniform resource consumption over time is optimal and updating the dual variables after each action. It starts at a moment in time (t) by taking an action (xt) that maximizes the reward minus the opportunity cost of consuming resources (shown in the top gray box below). The action (e.g., how much to bid or which ad to show) is implemented if there are enough resources available. Then, the algorithm computes the error in the resource consumption (gt), which is the difference between uniform consumption over time and the actual resource consumption (below in the third gray box). A new dual variable for the next time period is computed using mirror descent based on the error, which then informs the next action. Mirror descent seeks to make the error as close as possible to zero, improving the accuracy of its estimate of the dual variable, so that resources are consumed uniformly over time. While the assumption of uniform resource consumption may be surprising, it helps avoid missing good opportunities and often aligns with commercial goals so is effective. Mirror descent also allows a variety of update rules; more details are in the paper.

An overview of the dual mirror descent algorithm.

By design, dual mirror descent has a self-correcting feature that prevents depleting resources too early or waiting too long to consume resources and missing good opportunities. When a request consumes more or less resources than the target, the corresponding dual variable is increased or decreased. When resources are then priced higher or lower, future actions are chosen to consume resources more conservatively or aggressively.

This algorithm is easy to implement, fast, and enjoys remarkable performance under different environments. These are some salient features of our algorithm:

  • Existing methods require periodically solving large auxiliary optimization problems using past data. In contrast, this algorithm does not need to solve any auxiliary optimization problem and has a very simple rule to update the dual variables, which, in many cases, can be run in linear time complexity. Thus, it is appealing for many real-time applications that require fast decisions.
  • There are minimal requirements on the structure of the problem. Such flexibility allows dual mirror descent to handle many applications across different sectors with minimal modifications. Moreover, our algorithms are flexible since they accommodate different objectives, constraints, or regularizers. By incorporating regularizers, decision makers can include important objectives beyond economic efficiency, such as fairness.
  • Existing algorithms for online allocation problems are tailored for either adversarial or stochastic input data. Algorithms for adversarial inputs are robust as they make almost no assumptions on the structure of the data but, in turn, obtain performance guarantees that are too pessimistic in practice. On the other hand, algorithms for stochastic inputs enjoy better performance guarantees by exploiting statistical patterns in the data but can perform poorly when the model is misspecified. Dual mirror descent, however, attains performance close to optimal in both stochastic and adversarial input models while being oblivious to the structure of the input model. Compared to existing work on simultaneous approximation algorithms, our method is more general, applies to a wide range of problems, and requires no forecasts. Below is a comparison of our algorithm to other state-of-the-art methods. Results are based on synthetic data for an ad allocation problem.
Performance of dual mirror descent, a training based method, and an adversarial method relative to the optimal offline solution. Lower values indicate performance closer to the optimal offline allocation. Results are generated using synthetic experiments based on public data for an ad allocation problem.

Conclusion
In this post we introduced dual mirror descent, an algorithm for online allocation problems that is simple, robust, and flexible. It is particularly notable that after a long line of work in online allocation algorithms, dual mirror descent provides a way to analyze a wider range of algorithms with superior robustness priorities compared to previous techniques. Dual mirror descent has a wide range of applications across several commercial sectors and has been used over time at Google to help advertisers capture more value through better algorithmic decision making. We are also exploring further work related to mirror descent and its connections to PI controllers.

Acknowledgements
We would like to thank our co-authors Haihao Lu and Balu Sivan, and Kshipra Bhawalkar for their exceptional support and contributions. We would also like to thank our collaborators in the ad quality team and market algorithm research.


1Formalized in the equation below: 

Categories
Misc

Meet the Omnivore: Christopher Scott Constructs Architectural Designs, Virtual Environments With NVIDIA Omniverse

Growing up in a military family, Christopher Scott moved more than 30 times, which instilled in him “the ability to be comfortable with, and even motivated by, new environments,” he said.

The post Meet the Omnivore: Christopher Scott Constructs Architectural Designs, Virtual Environments With NVIDIA Omniverse appeared first on NVIDIA Blog.

Categories
Offsites

PaLI: Scaling Language-Image Learning in 100+ Languages

Advanced language models (e.g., GPT, GLaM, PaLM and T5) have demonstrated diverse capabilities and achieved impressive results across tasks and languages by scaling up their number of parameters. Vision-language (VL) models can benefit from similar scaling to address many tasks, such as image captioning, visual question answering (VQA), object recognition, and in-context optical-character-recognition (OCR). Increasing the success rates for these practical tasks is important for everyday interactions and applications. Furthermore, for a truly universal system, vision-language models should be able to operate in many languages, not just one.

In “PaLI: A Jointly-Scaled Multilingual Language-Image Model”, we introduce a unified language-image model trained to perform many tasks and in over 100 languages. These tasks span vision, language, and multimodal image and language applications, such as visual question answering, image captioning, object detection, image classification, OCR, text reasoning, and others. Furthermore, we use a collection of public images that includes automatically collected annotations in 109 languages, which we call the WebLI dataset. The PaLI model pre-trained on WebLI achieves state-of-the-art performance on challenging image and language benchmarks, such as COCO-Captions, CC3M, nocaps, TextCaps, VQAv2, OK-VQA, TextVQA and others. It also outperforms prior models’ multilingual visual captioning and visual question answering benchmarks.

Overview
One goal of this project is to examine how language and vision models interact at scale and specifically the scalability of language-image models. We explore both per-modality scaling and the resulting cross-modal interactions of scaling. We train our largest model to 17 billion (17B) parameters, where the visual component is scaled up to 4B parameters and the language model to 13B. 

The PaLI model architecture is simple, reusable and scalable. It consists of a Transformer encoder that processes the input text, and an auto-regressive Transformer decoder that generates the output text. To process images, the input to the Transformer encoder also includes “visual words” that represent an image processed by a Vision Transformer (ViT). A key component of the PaLI model is reuse, in which we seed the model with weights from previously-trained uni-modal vision and language models, such as mT5-XXL and large ViTs. This reuse not only enables the transfer of capabilities from uni-modal training, but also saves computational cost.

The PaLI model addresses a wide range of tasks in the language-image, language-only and image-only domain using the same API (e.g., visual-question answering, image captioning, scene-text understanding, etc.). The model is trained to support over 100 languages and tuned to perform multilingually for multiple language-image tasks.

Dataset: Language-Image Understanding in 100+ Languages
Scaling studies for deep learning show that larger models require larger datasets to train effectively. To unlock the potential of language-image pretraining, we construct WebLI, a multilingual language-image dataset built from images and text available on the public web.

WebLI scales up the text language from English-only datasets to 109 languages, which enables us to perform downstream tasks in many languages. The data collection process is similar to that employed by other datasets, e.g. ALIGN and LiT, and enabled us to scale the WebLI dataset to 10 billion images and 12 billion alt-texts.

In addition to annotation with web text, we apply the Cloud Vision API to perform OCR on the images, leading to 29 billion image-OCR pairs. We perform near-deduplication of the images against the train, validation and test splits of 68 common vision and vision-language datasets, to avoid leaking data from downstream evaluation tasks, as is standard in the literature. To further improve the data quality, we score image and alt-text pairs based on their cross-modal similarity, and tune the threshold to keep only 10% of the images, for a total of 1 billion images used for training PaLI.

Sampled images from WebLI associated with multilingual alt-text and OCR. The second image is by jopradier (original), used under the CC BY-NC-SA 2.0 license. Remaining images are also used with permission.
Statistics of recognized languages from alt-text and OCR in WebLI.
Image-text pair counts of WebLI and other large-scale vision-language datasets, CLIP, ALIGN and LiT.

Training Large Language-Image Models
Vision-language tasks require different capabilities and sometimes have diverging goals. Some tasks inherently require localization of objects to solve the task accurately, whereas some other tasks might need a more global view. Similarly, different tasks might require either long or compact answers. To address all of these objectives, we leverage the richness of the WebLI pre-training data and introduce a mixture of pre-training tasks, which prepare the model for a variety of downstream applications. To accomplish the goal of solving a wide variety of tasks, we enable knowledge-sharing between multiple image and language tasks by casting all tasks into a single generalized API (input: image + text; output: text), which is also shared with the pretraining setup. The objectives used for pre-training are cast into the same API as a weighted mixture aimed at both maintaining the ability of the reused model components and training the model to perform new tasks (e.g., split-captioning for image description, OCR prediction for scene-text comprehension, VQG and VQA prediction).

The model is trained in JAX with Flax using the open-sourced T5X and Flaxformer framework. For the visual component, we introduce and train a large ViT architecture, named ViT-e, with 4B parameters using the open-sourced BigVision framework. ViT-e follows the same recipe as the ViT-G architecture (which has 2B parameters). For the language component, we concatenate the dense token embeddings with the patch embeddings produced by the visual component, together as the input to the multimodal encoder-decoder, which is initialized from mT5-XXL. During the training of PaLI, the weights of this visual component are frozen, and only the weights of the multimodal encoder-decoder are updated.

Results
We compare PaLI on common vision-language benchmarks that are varied and challenging. The PaLI model achieves state-of-the-art results on these tasks, even outperforming very large models in the literature. For example, it outperforms the Flamingo model, which is several times larger (80B parameters), on several VQA and image-captioning tasks, and it also sustains performance on challenging language-only and vision-only tasks, which were not the main training objective.

PaLI (17B parameters) outperforms the state-of-the-art approaches (including SimVLM, CoCa, GIT2, Flamingo, BEiT3) on multiple vision-and-language tasks. In this plot we show the absolute score differences compared with the previous best model to highlight the relative improvements of PaLI. Comparison is on the official test splits when available. CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.

<!–

PaLI (17B parameters) outperforms the state-of-the-art approaches (including SimVLM, CoCa, GIT2, Flamingo, BEiT3) on multiple vision-and-language tasks. In this plot we show the absolute score differences compared with the previous best model to highlight the relative improvements of PaLI. Comparison is on the official test splits when available. CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.

–>

Model Scaling Results
We examine how the image and language model components interact with each other with regards to model scaling and where the model yields the most gains. We conclude that scaling both components jointly results in the best performance, and specifically, scaling the visual component, which requires relatively few parameters, is most essential. Scaling is also critical for better performance across multilingual tasks.

Scaling both the language and the visual components of the PaLI model contribute to improved performance. The plot shows the score differences compared to the PaLI-3B model: CIDEr score is used for evaluation of the image captioning tasks, whereas VQA tasks are evaluated by VQA Accuracy.
Multilingual captioning greatly benefits from scaling the PaLI models. We evaluate PaLI on a 35-language benchmark Crossmodal-3600. Here we present the average score over all 35 languages and the individual score for seven diverse languages.

Model Introspection: Model Fairness, Biases, and Other Potential Issues
To avoid creating or reinforcing unfair bias within large language and image models, important first steps are to (1) be transparent about the data that were used and how the model used those data, and (2) test for model fairness and conduct responsible data analyses. To address (1), our paper includes a data card and model card. To address (2), the paper includes results of demographic analyses of the dataset. We consider this a first step and know that it will be important to continue to measure and mitigate potential biases as we apply our model to new tasks, in alignment with our AI Principles.

Conclusion
We presented PaLI, a scalable multi-modal and multilingual model designed for solving a variety of vision-language tasks. We demonstrate improved performance across visual-, language- and vision-language tasks. Our work illustrates the importance of scale in both the visual and language parts of the model and the interplay between the two. We see that accomplishing vision and language tasks, especially in multiple languages, actually requires large scale models and data, and will potentially benefit from further scaling. We hope this work inspires further research in multi-modal and multilingual models.

Acknowledgements
We thank all the authors who conducted this research Soravit (Beer) Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari,Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut. We also thank Claire Cui, Slav Petrov, Tania Bedrax-Weiss, Joelle Barral, Tom Duerig, Paul Natsev, Fernando Pereira, Jeff Dean, Jeremiah Harmsen, Zoubin Ghahramani, Erica Moreira, Victor Gomes, Sarah Laszlo, Kathy Meier-Hellstern, Susanna Ricco, Rich Lee, Austin Tarango, Emily Denton, Bo Pang, Wei Li, Jihyung Kil, Tomer Levinboim, Julien Amelot, Zhenhai Zhu, Xiangning Chen, Liang Chen, Filip Pavetic, Daniel Keysers, Matthias Minderer, Josip Djolonga, Ibrahim Alabdulmohsin, Mostafa Dehghani, Yi Tay, Elizabeth Adkison, James Cockerille, Eric Ni, Anna Davies, and Maysam Moussalem for their suggestions, improvements and support. We thank Tom Small for providing visualizations for the blogpost.

Categories
Misc

Top HPC Sessions at GTC 2022

Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions.  

Learn about new CUDA features, digital twins for weather and climate, quantum circuit simulations, and much more with these GTC 2022 sessions.