Categories
Misc

Predicting Credit Defaults Using Time-Series Models with Recursive Neural Networks and XGBoost

Person holding a smartphone and a credit card over some papers.Today’s machine learning (ML) solutions are complex and rarely use just a single model. Training models effectively requires large, diverse datasets that may…Person holding a smartphone and a credit card over some papers.

Today’s machine learning (ML) solutions are complex and rarely use just a single model. Training models effectively requires large, diverse datasets that may require multiple models to predict effectively. Also, deploying complex multi-model ML solutions in production can be a challenging task. A common example is when compatibility issues with different frameworks can lead to delayed insights.

A solution that easily serves various combinations of deep neural nets and tree-based models and that is framework-agnostic would help simplify deployment and scale ML solutions as they take on multiple layers.

In this post, I discuss how to leverage the versatility of NVIDIA software to handle different types of models and integrate them into your application. I demonstrate how NVIDIA RAPIDS supports data preparation and ML training for large datasets and how NVIDIA Triton Inference Server seamlessly serves both deep neural nets by PyTorch and tree-based models by XGBoost when predicting credit default.

Using the American Express Default Prediction competition as an example, I explain how the multi-model solution can be deployed on either a GPU or a CPU. GPU deployment results in significantly faster inference times. This solution was one of the top 10 solutions out of 4,874 teams in the Kaggle American Express Default Prediction competition.

Future credit default predictions

Credit default prediction is central to managing risk in a consumer lending business. American Express, the largest payment card issuer in the world, provided an industrial-scale dataset that includes time-series behavioral data and anonymized customer profile information. This dataset is highly representative of real-world scenarios: it is large, contains both numerical and categorical columns, and presents a time-series problem.

The key to solving this business problem successfully is to uncover the temporal patterns within the data.

Why focus on trees and neural nets?

Tree-based models and deep neural networks are widely considered to be the most popular choices for ML practitioners.

Tree-based models, such as XGBoost, are mostly used for tabular datasets because they can handle noisy, redundant features and make it easy to interpret and understand the logic behind the predictions.

Deep neural networks, on the other hand, excel at learning long-term temporal dependencies and sequential patterns in data. They also can automatically extract features from raw data. Recently, deep neural networks have been widely used to generate high-quality new data, by exploiting their ability to learn the distribution from the existing data.

Now, I describe how my team used this technique in our American Express Default Fault Prediction solution.

Essential tools for data preparation and deployment

When you’re preparing a complex ML model, there are many steps to prepare, train, and deploy an effective model. RAPIDS and Triton Inference Server both support key phases in the ML process.

RAPIDS is a suite of open-source software libraries and APIs designed to accelerate data science workflows on GPUs. It includes a variety of tools and libraries for data preprocessing, ML, and visualization. In this case, it supports data preprocessing and exploratory data analysis at the beginning of the workflow.

To support deployment, NVIDIA Triton is a high-performance, multi-model inference server supporting both GPU and CPU that enables the easy deployment of models from a variety of frameworks, such as TensorFlow, PyTorch, and ONNX.

NVIDIA Triton supports tree-based models such as XGBoost, LightGBM, and more through its Forest Inference Library backend. This made it a great fit for the model that we were coding.

Problem overview

The aim is to predict if a customer will default on their credit card balance in the future, using their past monthly customer profile data. The binary target variable, default or no default, is determined by whether a customer pays back their outstanding credit card balance within 120 days of the statement date.

Figure 1 shows an overview of the problem and dataset, highlighting the key aspects of credit default prediction and the characteristics of the dataset. The test dataset is massive with 900K customers, 11M rows, and 191 columns, including both numerical and categorical features.

The goal was to create a model that can predict the categorical binary variable based on the other variables in the dataset and a mission-critical time value to manage. Before modeling, this large dataset requires significant feature engineering, making it an ideal candidate for data preparation with RAPIDS cuDF. The size of the dataset poses further challenges for real-time inferencing, which the high-performance NVIDIA Triton server addresses.

Diagram shows a customer with blocks for monthly profiles, an ML model block, and a
Figure 1. Problem overview: American Express Default Prediction competition

Approach

We broke the model development process into a series of steps:

  • Dataset preparation
  • Feature engineering
  • Dataset exploration
  • Autoregressive recursive neural network (RNN) model
  • Dataset performance

Dataset preparation

The given American Express data is a time series consisting of multiple profiles of a customer sorted by the customer ID and timestamp:

  • Each row in the dataset represents a customer’s profile for 1 month.
  • Each customer has 13 consecutive rows in the data, which represent their profiles in 13 consecutive months.
  • There are 214 columns in the data.
  • Columns are anonymized, except for customer_id and month, and fall into the following general categories: Delinquency, Spend, Payment, Balance, Risk

Table 1 summarizes the number of columns in each category. There is no information on what each column means other than its category. These anonymized columns are mostly floating-point numbers.

  customer_id month Delinquency Spend Payment Balance Risk
# of columns 1 1 106 25 10 43 28
Table 1. Number of columns in each category

For training data, the ground truths, default or not, are stored in another tabular data where each customer corresponds to one row. There are two columns: customer_id and default.

Feature engineering with RAPIDS cuDF

The project began with feature engineering to prepare the time-series dataset for the model. Time-series data notoriously requires massaging and becomes unwieldy for CPU-powered data science solutions. To run the data preparation efficiently for this phase, we harnessed the power of GPUs using RAPIDS cuDF.

We focused on slimming down the dataset to the most important data points by reducing the number of months for each customer profile to the last month in the set. The last month has the highest relevance in predicting future default events and reduces the number of rows to 900K. This is quickly done by drop_duplicates(keep=’last’) in cuDF.

RAPIDS cuDF can further help accelerate by creating differential and aggregate features for each customer_id value. While RAPIDS cuDF was used to engineer additional features in the preceding notebook, I disregarded those to maintain the simplicity of this single-GPU walkthrough.

Diagram shows blocks for monthly customer profiles. Customer profiles point to the RAPIDS cuDF block, which conducts feature engineering on the time-series data. Customer profiles also point to the PyTorch autoregressive RNN block, which generates the future profiles. Engineered features are combined with the generated future profiles and make the final prediction, Default or not default.
Figure 2. Solution overview

Exploring the dataset

Each customer profile’s features were measured through month 13, while the date of default checking was month 31. Given the 17-month time gap in the dataset for defaulting customers, the team generated new customer profiles for the missing months (months 14 to months 30) to improve the model’s prediction rate. The team was inspired to implement the autoregressive RNN technique after exploring the data visually.

For more information about identifying the autoregressive RNN technique, see the Amex EDA Evolvement of Numeric Features Over Time notebook.

When exploring the dataset, the team visualized the data in a chart. Figure 3 plots the trends of a subset of columns over time. The x-axis is the month, and the y-axis shows the column name.

For example, the top-left subplot shows how column D_53, the 53rd column of the delinquency category, varies over months. The red dashed lines are the averaged values of column D_53 of positive samples, where Default=True and the green solid lines are averaged values of negative samples, respectively.

For most columns, there are obvious temporal patterns. A model can be trained to extrapolate and predict future values of columns. With this model, generating additional values for the dataset helps improve the model’s predictability.

Multiple charts, each chart showing one feature’s trend and comparing performing and default.
Figure 3. Trends of columns over time (source: Amex EDA evolvement of numeric features over time)

When you’re planning for enhancing the dataset, another key is that the patterns can be different for each column. Linear trends, nonlinear trends, wiggles, and more variations are all observable. The data generation model must be versatile, flexible to learn, and generalize all these different patterns.

Generating new profiles with the autoregressive RNN model

Based on these data characteristics, our team proposed an autoregressive RNN model to learn all these patterns simultaneously. Autoregressive means that the output of the current time step is the input of the next time step. Figure 4 shows how autoregressive generation works.

GIF showing how autoregressive RNN works. The input is a sequence. It goes through several RNN layers to predict the next item of the sequence. This predicted item is appended to the input sequence and it repeats the process again. In this way, the autoregressive RNN generates a new sequence one item at a time.
Figure 4. Animation of autoregressive generation (source: WaveNet)

The input of the RNN model is the customer’s profile for the current month, including all 214 columns. The output of the RNN model is the predicted customer profile for the next month. The autoregressive RNN used in this approach is trained in a self-supervised manner, meaning it only employs customer profiles for training and does not require the “default or not” target column.

This self-supervised training to enhance datasets enables you to use a large amount of unlabeled data—a significant advantage in real-world applications as labeled data is often difficult and expensive to obtain.

Performance of the new dataset

The RNN can accurately predict future profiles. The root of the mean squared error is used between the ground truth profiles and predicted profiles. Compare the RNN against a simple baseline and assume that future profiles are the same as the last observed profile.

  Autoregressive RNN Last observed profile baseline
RMSE of all 214 columns 0.019 0.03
Table 2. Compare RMSE values for autoregressive RNN and the baseline (smaller is better)

In Table 2, the RNN reduced RMSE from 0.03 to 0.019 (a 33% improvement). This is a significant enhancement to the original dataset.

Figure 2 showed that the last step to producing the dataset for training is to combine the last profiles and generated profiles into one matrix. Do this using the joins function in RAPIDS cuDF and feed them to the downstream XGBoost classifier to predict defaults. The generated profiles greatly enhance the performance of the model.

Table 3 shows that, by combining the most recent profiles and generated future profiles, the XGBoost classifier can more accurately predict future default by 0.003. This is a significant improvement for default detection problems. Such improvement could move the solution rank up by hundreds of places in the American Express default prediction competition!

  xgb trained on last profiles xgb trained on last profiles and
autoregressive RNN-generated profiles
American Express metric for predicting default 0.7797 0.7830
Table 3. Evaluating default predictions by comparing XGBoost with and without autoregressive RNN-generated features (larger is better)

Deploy models to Triton Inference Server

Figure 5 shows that, during inference, NVIDIA Triton Inference Server enables the hosting of both the autoregressive RNN model implemented in PyTorch and the tree-based models implemented in XGBoost, on either CPU or GPU.

First, save the pretrained PyTorch RNN models and XGBoost models in separate folders with correct folder hierarchies. Next, write configuration files for each model. The PyTorch model processes the batch of input data to generate the future profiles, after which the concatenated profiles are input into the XGBoost model for further inference.

In under 6 seconds, 115K customer profiles have been inferred with this rnn-xgb pipeline on a single GPU.

Block diagram shows the Triton Inference Server pipeline. Input data goes through the autoregressive RNN model to predict future profiles, which are concatenated with the input. The combined data goes through the XGBoost model to get the final prediction.
Figure 5. Autoregressive RNN and XGBoost models with NVIDIA Triton Inference Server

The inference time is blazing fast even with this complicated model pipeline including autoregressive RNN for 13 time-steps and XGBoost for classification. Running the NVIDIA Triton Inference Server pipeline on 11.3 million American Express customer profiles takes only 45 seconds on a single NVIDIA V100 GPU.

Summary

The proposed solution for credit default prediction shown in this post successfully leverages the power of both deep neural nets and tree models to improve the accuracy of predictions by supplementing data.

Data processing of the time-series data was easier and faster with RAPIDS cuDF. The deployment of the models was made seamless with Triton Inference Server, which can host both deep neural nets and tree models on either CPU or GPU. This makes it a powerful tool for real-time inference.

This demo also highlights the potential of applying the high-performance computing power of GPUs and Triton Inference Server in credit default prediction, opening avenues for further exploration and improvement in the financial services field.

Here are the notebooks used in this post:

Categories
Offsites

Evaluating speech synthesis in many languages with SQuId

Previously, we presented the 1,000 languages initiative and the Universal Speech Model with the goal of making speech and language technologies available to billions of users around the world. Part of this commitment involves developing high-quality speech synthesis technologies, which build upon projects such as VDTTS and AudioLM, for users that speak many different languages.

After developing a new model, one must evaluate whether the speech it generates is accurate and natural: the content must be relevant to the task, the pronunciation correct, the tone appropriate, and there should be no acoustic artifacts such as cracks or signal-correlated noise. Such evaluation is a major bottleneck in the development of multilingual speech systems.

The most popular method to evaluate the quality of speech synthesis models is human evaluation: a text-to-speech (TTS) engineer produces a few thousand utterances from the latest model, sends them for human evaluation, and receives results a few days later. This evaluation phase typically involves listening tests, during which dozens of annotators listen to the utterances one after the other to determine how natural they sound. While humans are still unbeaten at detecting whether a piece of text sounds natural, this process can be impractical — especially in the early stages of research projects, when engineers need rapid feedback to test and restrategize their approach. Human evaluation is expensive, time consuming, and may be limited by the availability of raters for the languages of interest.

Another barrier to progress is that different projects and institutions typically use various ratings, platforms and protocols, which makes apples-to-apples comparisons impossible. In this regard, speech synthesis technologies lag behind text generation, where researchers have long complemented human evaluation with automatic metrics such as BLEU or, more recently, BLEURT.

In “SQuId: Measuring Speech Naturalness in Many Languages“, to be presented at ICASSP 2023, we introduce SQuId (Speech Quality Identification), a 600M parameter regression model that describes to what extent a piece of speech sounds natural. SQuId is based on mSLAM (a pre-trained speech-text model developed by Google), fine-tuned on over a million quality ratings across 42 languages and tested in 65. We demonstrate how SQuId can be used to complement human ratings for evaluation of many languages. This is the largest published effort of this type to date.

Evaluating TTS with SQuId

The main hypothesis behind SQuId is that training a regression model on previously collected ratings can provide us with a low-cost method for assessing the quality of a TTS model. The model can therefore be a valuable addition to a TTS researcher’s evaluation toolbox, providing a near-instant, albeit less accurate alternative to human evaluation.

SQuId takes an utterance as input and an optional locale tag (i.e., a localized variant of a language, such as “Brazilian Portuguese” or “British English”). It returns a score between 1 and 5 that indicates how natural the waveform sounds, with a higher value indicating a more natural waveform.

Internally, the model includes three components: (1) an encoder, (2) a pooling / regression layer, and (3) a fully connected layer. First, the encoder takes a spectrogram as input and embeds it into a smaller 2D matrix that contains 3,200 vectors of size 1,024, where each vector encodes a time step. The pooling / regression layer aggregates the vectors, appends the locale tag, and feeds the result into a fully connected layer that returns a score. Finally, we apply application-specific post-processing that rescales or normalizes the score so it is within the [1, 5] range, which is common for naturalness human ratings. We train the whole model end-to-end with a regression loss.

The encoder is by far the largest and most important piece of the model. We used mSLAM, a pre-existing 600M-parameter Conformer pre-trained on both speech (51 languages) and text (101 languages).

The SQuId model.

To train and evaluate the model, we created the SQuId corpus: a collection of 1.9 million rated utterances across 66 languages, collected for over 2,000 research and product TTS projects. The SQuId corpus covers a diverse array of systems, including concatenative and neural models, for a broad range of use cases, such as driving directions and virtual assistants. Manual inspection reveals that SQuId is exposed to a vast range of of TTS errors, such as acoustic artifacts (e.g., cracks and pops), incorrect prosody (e.g., questions without rising intonations in English), text normalization errors (e.g., verbalizing “7/7” as “seven divided by seven” rather than “July seventh”), or pronunciation mistakes (e.g., verbalizing “tough” as “toe”).

A common issue that arises when training multilingual systems is that the training data may not be uniformly available for all the languages of interest. SQuId was no exception. The following figure illustrates the size of the corpus for each locale. We see that the distribution is largely dominated by US English.

Locale distribution in the SQuId dataset.

How can we provide good performance for all languages when there are such variations? Inspired by previous work on machine translation, as well as past work from the speech literature, we decided to train one model for all languages, rather than using separate models for each language. The hypothesis is that if the model is large enough, then cross-locale transfer can occur: the model’s accuracy on each locale improves as a result of jointly training on the others. As our experiments show, cross-locale proves to be a powerful driver of performance.

Experimental results

To understand SQuId’s overall performance, we compare it to a custom Big-SSL-MOS model (described in the paper), a competitive baseline inspired by MOS-SSL, a state-of-the-art TTS evaluation system. Big-SSL-MOS is based on w2v-BERT and was trained on the VoiceMOS’22 Challenge dataset, the most popular dataset at the time of evaluation. We experimented with several variants of the model, and found that SQuId is up to 50.0% more accurate.

SQuId versus state-of-the-art baselines. We measure agreement with human ratings using the Kendall Tau, where a higher value represents better accuracy.

To understand the impact of cross-locale transfer, we run a series of ablation studies. We vary the amount of locales introduced in the training set and measure the effect on SQuId’s accuracy. In English, which is already over-represented in the dataset, the effect of adding locales is negligible.

SQuId’s performance on US English, using 1, 8, and 42 locales during fine-tuning.

However, cross-locale transfer is much more effective for most other locales:

SQuId’s performance on four selected locales (Korean, French, Thai, and Tamil), using 1, 8, and 42 locales during fine-tuning. For each locale, we also provide the training set size.

To push transfer to its limit, we held 24 locales out during training and used them for testing exclusively. Thus, we measure to what extent SQuId can deal with languages that it has never seen before. The plot below shows that although the effect is not uniform, cross-locale transfer works.

SQuId’s performance on four “zero-shot” locales; using 1, 8, and 42 locales during fine-tuning.

When does cross-locale operate, and how? We present many more ablations in the paper, and show that while language similarity plays a role (e.g., training on Brazilian Portuguese helps European Portuguese) it is surprisingly far from being the only factor that matters.

Conclusion and future work

We introduce SQuId, a 600M parameter regression model that leverages the SQuId dataset and cross-locale learning to evaluate speech quality and describe how natural it sounds. We demonstrate that SQuId can complement human raters in the evaluation of many languages. Future work includes accuracy improvements, expanding the range of languages covered, and tackling new error types.

Acknowledgements

The author of this post is now part of Google DeepMind. Many thanks to all authors of the paper: Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa.

Categories
Misc

Taking AI to School: A Conversation With MIT’s Anant Agarwal

In the latest episode of NVIDIA’s AI Podcast, Anant Agarwal, founder of edX and Chief Platform Officer at 2U, shared his vision for the future of online education and how  AI is revolutionizing the learning experience. Agarwal, a strong advocate for Massive Open Online Courses, or MOOCs, discussed the importance of accessibility and quality in Read article >

Categories
Misc

What Is Photogrammetry?

Thanks to “street views,” modern mapping tools can be used to scope out a restaurant before deciding to go there, better navigate directions by viewing landmarks in the area or simulate the experience of being on the road. The technique for creating these 3D views is called photogrammetry — the process of capturing images and Read article >

Categories
Misc

NYU, NVIDIA Collaborate on Large Language Model to Predict Patient Readmission

Getting discharged from the hospital is a major milestone for patients — but sometimes, it’s not the end of their road to recovery. Nearly 15% of hospital patients in the U.S. are readmitted within 30 days of their initial discharge, which is often associated with worse outcomes and higher costs for both patients and hospitals. Read article >

Categories
Misc

Develop Physics-Informed Machine Learning Models with Graph Neural Networks

NVIDIA Modulus is a framework for building, training, and fine-tuning deep learning models for physical systems, otherwise known as physics-informed machine…

NVIDIA Modulus is a framework for building, training, and fine-tuning deep learning models for physical systems, otherwise known as physics-informed machine learning (physics-ML) models. Modulus is available as OSS (Apache 2.0 license) to support the growing physics-ML community. 

The latest Modulus software update, version 23.05, brings together new capabilities, empowering the research community and industries to develop research into enterprise-grade solutions through open-source collaboration. 

Two major components of this update are 1) supporting new network architectures that include graph neural networks (GNNs) and recurrent neural networks (RNNs), and 2) improving the ease of use for AI practitioners.

Graph neural network support 

GNNs are transforming how researchers are addressing challenges involving intricate graph structures, such as those encountered in physics, biology, and social networks. By leveraging the structure of graphs, GNNs are capable of learning and making predictions based on the relationships among nodes in a graph.

Through the application of GNNs, researchers can model systems to be represented as graphs or meshes. This capability is useful in applications such as computational fluid dynamics, molecular dynamics simulations, and material science. 

Using GNNs, researchers can better understand the behavior of complex systems with complex geometries, and generate more accurate predictions based on the learned patterns and interactions within the data.

The latest version of NVIDIA Modulus includes support for GNNs. This enables you to develop your own GNN-based models for specific use cases. Modulus includes recipes that use the MeshGraphNet architecture based on the work presented in Learning Mesh-Based Simulation with Graph Networks. Such architectures can now be trained in Modulus, which includes a MeshGraphNet model pretrained on a parameterized vortex shedding dataset. This pretrained model is available through NVIDIA NGC.

Modulus also includes the GraphCast architecture proposed in GraphCast: Learning Skillful Medium-Range Global Weather Forecasting. GraphCast is a novel GNN-based architecture for global weather forecasting. It significantly improves on some existing models by effectively capturing spatio-temporal relationships in weather data. The weather data is modeled as a graph, where nodes represent Earth grid cells. This graph-based representation enables the model to capture both local and non-local dependencies in the data.

The architecture of GraphCast consists of four main components: embedder, encoder, processor, and decoder. The embedder component embeds the input features into a latent representation. The encoder maps the local regions of the grid’s latent features into nodes of a multi-mesh graph representation. The processor updates each multi-mesh node using learned message-passing.

Finally, the decoder maps the processed multi-mesh features back onto the grid representation. The multi-mesh is a set of icosahedral meshes with increasing resolution, providing uniform resolution across the globe. A recipe for training GraphCast on the ERA-5 dataset with support for data parallelism is provided in Modulus-Launch. Figure 1 shows the out-of-sample prediction for the 2-meter temperature using a GraphCast model trained in Modulus on a 34-variable subset of the ERA-5 dataset.

GIF showing out-of-sample prediction results for the 2-meter temperature using a GraphCast model trained in Modulus on a 34-variable subset of the ERA-5 dataset. Starting from an initial condition for the temperature on 2018-01-01, the temperature is predicted for two months with one-day intervals.
Figure 1. The out-of-sample prediction for the 2-meter temperature using a GraphCast model trained in Modulus on a 34-variable subset of the ERA-5 dataset

The GraphCast implementation supports gradient checkpointing for reducing the memory overhead. It also provides several optimizations including CuGraphOps support, fused layer norm and Adam optimizer using Apex, efficient edge feature updates, and more. 

Recurrent neural network support

Time-series prediction is a key task in many domains. The application of deep learning architectures—particularly RNNs, long short-term memory networks (LSTMs), and similar networks—has significantly enhanced the predictive capabilities. 

These models are unique in their ability to capture temporal dependencies and learn complex patterns over time, making them well suited for forecasting time varying relationships. In physics-ML, these models are critical in predicting dynamic physical systems’ evolution, enabling better simulations, understanding of complex natural phenomena, and aiding in discoveries. 

The latest version of Modulus has added support for RNN type layers and models. This enables you to use RNNs for 2D spatial domains and 3D spatial domains in model prediction workflows. Figure 2 shows a comparison between the predictions of a RNN model in Modulus and the ground truth for a Gray-Scott system.

GIF showcasing time-series prediction for a 3D Gray-Scott system using RNNs in NVIDIA Modulus.
Figure 2. 3D transient predictions on a Gray-Scott system in NVIDIA Modulus 

Modules for ease of use

The Modulus codebase has been re-architected into modules to facilitate ease of use. This is in line with PyTorch, which has become one of the most popular deep learning frameworks for researchers over the past few years, given the ease of use. 

The core Modulus module consists of the core framework and algorithms for physics-ML models. The Modulus-Launch module consists of optimized training recipes for accelerating PyTorch-like workflows for training models. This module enables AI researchers to have a PyTorch-like experience. NVIDIA Modulus Sym is a module based on the symbolic partial differential equation (PDE) that domain experts can use to train PDE-based physics-ML models.

One key feature of modern deep learning frameworks is their interoperability. This Modulus release makes it easier for AI developers to bring PyTorch models into Modulus and vice versa. This helps to ensure that models can be shared and reused across different platforms and environments. 

For more details about all the new features in Modulus 23.05, see the Modulus release notes.

Start using GNNs for physics-ML today

To learn more and get started with NVIDIA Modulus, see the NVIDIA Deep Learning Institute course, Introduction to Physics-Informed Machine Learning with Modulus. Kick-start your Modulus experience with the LaunchPad for Modulus free hands-on lab. With short-term access provided, there is no need to set up your own compute environment.

To try Modulus in your own environment, download the latest Modulus container or install the Modulus pip wheels. To customize and contribute to the Modulus open-source framework, visit the NVIDIA/modulus repo on GitHub.

Categories
Offsites

Visual captions: Using large language models to augment video conferences with dynamic visuals

Recent advances in video conferencing have significantly improved remote video communication through features like live captioning and noise cancellation. However, there are various situations where dynamic visual augmentation would be useful to better convey complex and nuanced information. For example, when discussing what to order at a Japanese restaurant, your friends could share visuals that would help you feel more confident about ordering the “Sukiyaki”. Or when talking about your recent family trip to San Francisco, you may want to show a photo from your personal album.

In “Visual Captions: Augmenting Verbal Communication With On-the-fly Visuals”, presented at ACM CHI 2023, we introduce a system that uses verbal cues to augment synchronous video communication with real-time visuals. We fine-tuned a large language model to proactively suggest relevant visuals in open-vocabulary conversations using a dataset we curated for this purpose. We open sourced Visual Captions as part of the ARChat project, which is designed for rapid prototyping of augmented communication with real-time transcription.

Visual Captions facilitates verbal communication with real-time visuals. The system is even robust against typical mistakes that may often appear in real-time speech-to-text transcription. For example, out of context, the transcription model misunderstood the word “pier” as “pair”, but Visual Captions still recommends images of the Santa Monica Pier.

Design space for augmenting verbal communication with dynamic visuals

We invited 10 internal participants, each with various technical and non-technical backgrounds, including software engineers, researchers, UX designers, visual artists, students, etc., to discuss their particular needs and desires for a potential real-time visual augmentation service. In two sessions, we introduced low-fidelity prototypes of the envisioned system, followed by video demos of the existing text-to-image systems. These discussions informed a design space with eight dimensions for visual augmentation of real-time conversations, labeled below as D1 to D8.

Visual augmentations could be synchronous or asynchronous with the conversation (D1: Temporal), could be used for both expressing and understanding speech content (D2: Subject), and could be applied using a wide range of different visual content, visual types, and visual sources (D3: Visual). Such visual augmentation might vary depending on the scale of the meetings (D4: Scale) and whether a meeting is in co-located or remote settings (D5: Space). These factors also influence whether the visuals should be displayed privately, shared between participants, or public to everyone (D6: Privacy). Participants also identified different ways in which they would like to interact with the system while having conversations (D7: Initiation). For example, people proposed different levels of “proactivity”, which indicates the degree to which users would like the model to take the initiative. Finally, participants envisioned different methods of interaction, for example, using speech or gestures for input. (D8: Interaction).

Design space for augmenting verbal communication with dynamic visuals.

Informed by this initial feedback, we designed Visual Captions to focus on generating synchronous visuals of semantically relevant visual content, type, and source. While participants in these initial exploratory sessions were participating in one-to-one remote conversations, deployment of Visual Captions in the wild will often be in one-to-many (e.g., an individual giving a presentation to an audience) and many-to-many scenarios (e.g., a discussion among multiple people in a meeting).

Because the visual that best complements a conversation depends strongly on the context of the discussion, we needed a training set specific to this purpose. So, we collected a dataset of 1595 quadruples of language (1), visual content (2), type (3), and source (4) across a variety of contexts, including daily conversations, lectures, and travel guides. For example, “I would love to see it!” corresponds to visual content of “face smiling”, a visual type of “emoji”, and visual source of “public search”. “Did she tell you about our trip to Mexico?” corresponds to visual content of “a photo from the trip to Mexico”, a visual type of “photo”, and visual source of “personal album”. We publicly released this VC1.5K dataset for the research community.

Visual intent prediction model

To predict what visuals could supplement a conversation, we trained a visual intent prediction model based on a large language model using the VC1.5K dataset. For training, we parsed each visual intent into the format of “<Visual Type> of <Visual Content> from <Visual Source>“.

{"prompt": "<Previous Two Sentences> →", 
  "completion": 
"<Visual Type 1> of "<Visual Type 1> from "<Visual Source 1>;
 <Visual Type 2> of "<Visual Type 2> from "<Visual Source 2>; 
  ... 𝑛"}

Using this format, this system can handle open-vocabulary conversations and contextually predict visual content, visual source, and visual type. Anecdotally, we found that it outperforms keyword-based approaches, which fail to handle open-vocabulary examples like “Your aunt Amy will be visiting this Saturday,” and cannot suggest relevant visual types or visual sources.

Examples of visual intent predictions by our model.

We used 1276 (80%) examples from the VC1.5K dataset for fine-tuning the large language model and the remaining 319 (20%) examples as test data. We measured the performance of the fine-tuned model with the token accuracy metric, i.e., the percentage of tokens in a batch that were correctly predicted by the model. During training, our model reached a training token accuracy of 97% and a validation token accuracy of 87%.

Performance

To evaluate the utility of the trained Visual Captions model, we invited 89 participants to perform 846 tasks. They were asked to provide feedback on a scale of “1 — Strongly Disagree” to “7 — Strongly Agree” for six qualitative statements. Most participants preferred to have the visual during a conversation (Q1, 83% ≥ 5–Somewhat Agree). Moreover, they considered the displayed visuals to be useful and informative (Q2, 82% ≥ 5–Somewhat Agree), high-quality (Q3, 82% ≥ 5–Somewhat Agree), and relevant to the original speech (Q4, 84% ≥ 5–Somewhat Agree). Participants also found the predicted visual type (Q5, 87% ≥ 5–Somewhat Agree) and visual source (Q6, 86% ≥ 5–Somewhat Agree) to be accurate given the context of the corresponding conversation.

Technical evaluation results of the visual prediction model rated by study participants.

With this fine-tuned visual intent prediction model, we developed Visual Captions on the ARChat platform, which can add new interactive widgets directly on the camera streams of video conferencing platforms, such as Google Meet. As shown in the system workflow below, Visual Captions automatically captures the user’s speech, retrieves the last sentences, feeds them into the visual intent prediction model every 100 ms, retrieves relevant visuals, and then suggests visuals in real time.

System workflow of Visual Captions.

Visual Captions provides three levels of proactivity when suggesting visuals:

  • Auto-display (high-proactivity): The system autonomously searches and displays visuals publicly to all meeting participants. No user interaction required.
  • Auto-suggest (medium-proactivity): The suggested visuals are shown in a private scrolling view. A user then clicks a visual to display it publicly. In this mode, the system is proactively recommending visuals, but the user decides when and what to display.
  • On-demand-suggest (low-proactivity): The system will only suggest visuals if a user presses the spacebar.

Quantitative and qualitative evaluation: User studies

We evaluated Visual Captions in both a controlled lab study (n = 26) and in-the-wild deployment studies (n = 10). Participants found that real-time visuals facilitated live conversations by helping explain unfamiliar concepts, resolve language ambiguities, and make conversations more engaging. Participants also reported different preferences for interacting with the system in-situ, and that varying levels of proactivity were preferred in different social scenarios.

Participants’ Task Load Index and Likert scale ratings (from 1 – Strongly Disagree to 7 – Strongly Agree) of four conversations without Visual Captions (“No VC”) and the three Visual Captions modes: auto-display, auto-suggest, and on-demand suggest.

Conclusions and future directions

This work proposes a system for real-time visual augmentation of verbal communication, called Visual Captions, that was trained using a dataset of 1595 visual intents collected from 246 participants, covering 15 topic categories. We publicly release the training dataset, VC1.5K to the research community to support further research in this space. We have also deployed Visual Captions in ARChat, which facilitates video conferences in Google Meet by transcribing meetings and augmenting the camera video streams.

Visual Captions represents a significant step towards enhancing verbal communication with on-the-fly visuals. By understanding the importance of visual cues in everyday conversations, we can create more effective communication tools and improve how people connect.

Acknowledgements

This work is a collaboration across multiple teams at Google. Key contributors to the project include Xingyu “Bruce” Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Peggy Chi, Alex Olwal, and Ruofei Du.

We would like to extend our thanks to those on the ARChat team who provided assistance, including Jason Mayes, Max Spear, Na Li, Jun Zhang, Jing Jin, Yuan Ren, Adarsh Kowdle, Ping Yu, Darcy Philippon, and Ezgi Oztelcan. We would also like to thank the many people with whom we’ve had insightful discussions and those who provided feedback on the manuscript, including Eric Turner, Yinda Zhang, Feitong Tan, Danhang Tang, and Shahram Izadi. We would also like to thank our CHI reviewers for their insightful feedback.

Categories
Misc

Unlocking Speech AI Technology for Global Language Users: Top Q&As

Voice-enabled technology is becoming ubiquitous. But many are being left behind by an anglocentric and demographically biased algorithmic world. Mozilla Common…

Voice-enabled technology is becoming ubiquitous. But many are being left behind by an anglocentric and demographically biased algorithmic world. Mozilla Common Voice (MCV) and NVIDIA are collaborating to change that by partnering on a public crowdsourced multilingual speech corpus—now the largest of its kind in the world—and open-source pretrained models. It is now easier than ever before to develop automatic speech recognition (ASR) technology that works for speakers of many languages. 

This post summarizes the top questions asked during Unlocking Speech AI Technology for Global Language Users, a recorded talk from the Speech AI Summit 2022 featuring EM Lewis-Jong of Mozilla Common Voice and Caroline de Brito Gottlieb of NVIDIA. 

Do multilingual NVIDIA NeMo open-source models exist?

Caroline de Brito Gottlieb: To make Speech AI more accessible and serve a global community, we first need to understand how the world uses language. Monolingualism is an anomaly worldwide, so researchers at NVIDIA are focused on creating state-of-the-art AI for multilingual contexts. 

Through NeMo, NVIDIA has released its first model for multilingual and code-switched/code-mixed speech recognition, which can transcribe audio samples into English, Latin/North American Spanish, as well as both English and Spanish used in the same sentence—a phenomenon called code-switching, or code-mixing. NVIDIA will soon have a multilingual model on NeMo for Indic languages as well.

The switching or mixing of codes is very common in multilingual communities and communities speaking multiple dialects or varieties of the same language. This poses unique challenges for existing speech AI solutions. However, the open-source NeMo model is an important step toward AI that accurately reflects and supports how global communities actually use speech in real-world contexts. 

Do datasets extend beyond “language” to include domain-specific vocabulary? For example, finance and healthcare datasets may differ. 

EM Lewis-Jong: Domains represented within the corpora on MCV have been historically driven by communities who choose to create datasets through the platform. That means different languages have varied domains represented in their datasets—some might be heavy on news and media, whereas others might contain more educational text. If you want to enhance domain-specific coverage in a Common Voice dataset, simply go through the process of adding text into the platform through GitHub or the Sentence Collector tool. All domains are welcome.

MCV is actively rebuilding and expanding the Sentence Collector tool to make it easier to ingest large volumes of text, and tag them appropriately. Expect to see these changes in April 2023. Also, the team has been collaborating closely with NVIDIA and other data partners to ensure the metadata schema is as interoperable as possible. Domain tagging the Common Voice corpora is a big part of that.

Caroline de Brito Gottlieb: Accounting for domain-specific language is a critical challenge, in particular when applying AI solutions across industries. That is why NVIDIA Riva offers multiple techniques, such as word boosting and vocabulary extension, for customizing ASR models to improve the recognition of specific words

Our team primarily thinks of domain as a matter of vocabulary and terminology. This alone is a big challenge, given the different levels of specialized terminology and acronyms like GPU, FTP, and more. But it is also important to collect domain-specific data beyond just individual words to capture grammatical or structural differences; for example, the way negation is expressed in clinical practice guidelines. Designing and curating domain-specific datasets is an active area of collaboration between Common Voice and NVIDIA, and we’re excited to see progress in domain-specific ASR for languages beyond English. 

How do you differentiate varied versions of Spanish, English, Portuguese, and other languages across geographies?

EM Lewis-Jong: Historically, MCV didn’t have a great system for differentiating between varied versions of a language. Communities chose between creating an entirely new dataset (organized by language), or they could use the accent field. In 2021, MCV did an intensive research exercise and discovered the following:

  1. Limited community awareness about variants: New communities without much context weren’t always sure about how to categorize themselves. Once they’d made the decision about whether to become a new language dataset or remain as an accent, it was difficult to change their minds.
  2. Dataset fragmentation: Diverse communities, such as those with large diaspora populations, may feel they need to split up entirely and set up a whole new language. This fragments the dataset and confuses contributors.
  3. Identity and experience: Some language communities and contributors make use of accent tags, but can feel marginalized and undermined by this. Talking about language is talking about power, and some people want to have the ability to identify their speech beyond ‘accent’ in ways that respect and represent them.
  4. Linguistic and orthographic diversity: Some communities felt there was no suitable arrangement for them, as their spoken language had multiple writing systems. Currently, MCV assumes a 1:1 relationship between spoken word and written word.

For these reasons, the team enabled a new category on the platform called Variant. This is intended to help communities systematically differentiate within languages, and especially to support large languages with a diverse range of speakers.

Where possible, MCV uses BCP-47 codes for tagging. BCP 47 is a flexible system that enables communities to pull out key information such as region, dialect, and orthography.

For example, the Kiswahili community might like to differentiate between Congolese Swahili and Chimwiini. Historically on the platform, this would be framed as an ‘accent’ difference—despite the fact that the variants have different vocabulary and grammar and would not be easily mutually intelligible. In other words, speakers might struggle to understand one another. 

Communities are now free to choose whether and how they make use of the variant tag. MCV is rolling this out to language communities in phases. The team produced new definitions around language, variant, and accent to act as helpful guidelines for communities. These are living definitions that will evolve with the MCV community. For more information, check out How We’re Making Common Voice Even More Linguistically Inclusive.

What are some examples of successfully deployed use cases?

EM Lewis-Jong: MCV is used by researchers, engineers, and data scientists at most of the world’s largest tech companies, as well as by academics, startups, and civil society. It is downloaded hundreds of thousands of times a year.

Some recent use cases the team is very excited about include the Kinyarwanda Mbaza chatbot, which provides COVID-19 guidance, Thai language health tracking wearables for the visually impaired, financial planning apps in Kiswahili like ChamaChat and agricultural health guidance for farmers in Kenya like LivHealth. 

Caroline de Brito Gottlieb: NeMo—which uses MCV, among other datasets—is also widely deployed. Tarteel AI is an AI-focused, faith-based startup focusing on religious and educational tech. The Tarteel team leveraged NVIDIA Riva and NeMo AI tooling to achieve state-of-the-art word error rate (WER) of 4% on Arabic transcription by fine-tuning an English ASR model on Arabic language data. This enabled Tarteel to develop the world’s first Quranic Arabic ASR, providing technology to support a community of 1.8 billion Muslim across the world in improving their Quran recitation through real-time feedback. 

In January 2023, Riva released an out-of-the-box Arabic ASR model that can be seamlessly customized for specific dialects, accents, and domains. Another use case on Singaporean English, or Singlish, is presented in Easy Speech AI Customization for Local Singaporean Voice.

How does Mozilla collect the diversity attributes of the Common Voice data set for a language, such as age and sex?

EM Lewis-Jong: MCV enables users to self-identify and associate their clips with relevant information: variant (if your language has them), accent (an important diversity attribute), sex, and age. This year MCV will expand these options for some demographic categories, in particular sex, to be more inclusive. 

This information will be associated with your clips, and then securely and safely pseudonymised before the dataset is released. You can tell MCV about your linguistic features in the usual contribution flow; however, for sensitive demographic attributes, you must create an account. 

What type of ASR model is best to use when fine-tuning a particular language?

Caroline de Brito Gottlieb: NeMo is a toolkit with pretrained models that enables you to fine-tune for your own language and specific use case. State-of-the-art pretrained NeMo models are freely available on NGC, the NVIDIA hub for GPU-optimized software, and HuggingFace. Check out the extensive tutorials that can all be run on Google Colab, and a full suite of example scripts supporting multi-GPU/multi-node training.

In addition to the languages already offered in NeMo ASR, community members have used NeMo to obtain state-of-the-art results for new languages, dialects, variants, and accents by fine-tuning NeMo base models. Much of that work has used NVIDIA pretrained English language ASR models, but I encourage you to try fine-tuning on a NeMo model for a language most related to the one you are working on. You can start by looking up the family and genealogical classification of a language in Glottolog

My native language, Yoruba, is not on MCV. What can be done to include it along with its different dialects?

EM Lewis-Jong: Anyone can add a new language to MCV. Reach out about adding your language.

There are two stages to the process: translating the site and collecting sentences.

Translating the site involves a Mozilla tool called Pontoon for translations. Pontoon has lots of languages, but if it doesn’t have yours you can request for your language to be added. Then, to make the language available on the Common Voice project, request the new language on GitHub. Get more details about site translation and how to use Pontoon.

Collecting sentences involves adding small numbers of sentences, or performing bulk imports using GitHub. Remember that sentences need to be CC0 (or public domain), or you can write your own. Learn more about sentence collection and using the Sentence Collector.

Does data augmentation factor into the need for more diversity? 

Caroline de Brito Gottlieb: Speech AI models need to be robust for diverse environmental factors and contextual variations, especially as the team scales up to more languages, communities, and therefore, contexts. However, authentic data is not always available to represent this diversity. 

Data augmentation is a powerful tool to enhance the size and variety of datasets by simulating speech data characteristics. When applied to training data, the resulting expanded or diversified dataset can help models generalize better to new scenarios and unseen data. 

When data augmentation techniques are applied to datasets used for testing, it enables understanding the model’s performance in an expanded variety of speech data contexts. NeMo offers various data augmentation techniques such as noise perturbation, speech perturbation, and time stretch augmentation, which can be applied to training and testing data.

Do the datasets in MCV support different accents, such as speaking German with a French accent?

EM Lewis-Jong: There are as many accents as there are speakers, and all are welcome.  As of December 2021, you can easily add multiple accents in your profile page.

Accents are not limited by what others have chosen. You can stipulate your accent on your own terms, making it easier for contributors to quickly identify their speech in a natural way. 

For example, if you’re a French speaker originally from Germany, who learned French in a Cote D’Ivoire context, you can add accents like ‘German’ and ‘Cote D’Ivoire’ to your French clip submissions. 

Summary

To create a healthier AI ecosystem, communities need to be meaningfully engaged in the data creation process. In addition, open-sourcing speech datasets and ASR models enables innovation for everyone. 

If you would like to contribute to the public crowdsourced multilingual speech corpus, check out NVIDIA NeMo on GitHub and Mozilla Common Voice to get involved. 

Categories
Misc

Fish-Farming Startup Casts AI to Make Aquaculture More Efficient, Sustainable

As a marine biology student, Josef Melchner always dreamed of spending his days cruising the oceans to find dolphins, whales and fish — but also “wanted to do something practical, something that would benefit the world,” he said. When it came time to choose a career, he dove head first into aquaculture. He’s now CEO Read article >

Categories
Misc

Technical Artist Builds Great Woolly Mammoth With NVIDIA Omniverse USD Composer This Week ‘In the NVIDIA Studio’

Keerthan Sathya, a senior technical artist specializing in 3D, emerged trium-elephant In the NVIDIA Studio this week with the incredibly detailed, expertly constructed, jaw-droppingly beautiful animation Tiny Mammoth.