Categories
Misc

Universal Scene Description as the Language of the Metaverse

Over the past few decades, the Internet has fundamentally changed the world and set in motion an enormous transformation in the way we consume and share…

Over the past few decades, the Internet has fundamentally changed the world and set in motion an enormous transformation in the way we consume and share information. The transformation is so complete that today, a quality web presence is vital for nearly all businesses, and interacting with the web is central to functioning effectively in the modern world. 

The web has evolved from static documents to dynamic applications involving rich interactive media. Yet despite the fact that we live in a 3D world, the web remains overwhelmingly two-dimensional.

Now we find ourselves at the threshold of the web’s next major advancement: the advent of the 3D Internet or metaverse. Instead of linking together 2D pages, the metaverse will link together virtual worlds. Websites will become interconnected 3D spaces akin to the world we live in and experience every day. 

Many of these virtual worlds will be digital twins reflecting the real world, linked and synchronized in real time. Others will be designed for entertainment, socializing, gaming, learning, collaboration, or commerce. 

No matter what the purpose of any individual site, what will make the entire metaverse a success will be the same thing that has made the 2D web so successful: universal interoperability based on open standards and protocols.

The most fundamental standard needed to create the metaverse is the description of a virtual world. At NVIDIA, we believe the first version of that standard already exists. It is Universal Scene Description (USD) — an open and extensible body of software for describing 3D worlds that was originally developed by Pixar to enable their complex animated film production workflows. 

Open sourced in 2015, USD is now being used in a wide range of industries not only in media and entertainment, but also spanning architecture, engineering, design, manufacturing, retail, scientific computing, and robotics, among others.

USD is more than a file format  

USD is a scene description: a set of data structures and APIs to create, represent, and modify virtual worlds. The representation is rich. It supports not only the basics of virtual worlds like geometry, cameras, lights, and materials, but also a wide variety of relationships among them, including property inheritance, instancing and specialization. 

It includes features necessary for scaling to large data sets like lazy loading and efficient retrieval of time-sampled data. It is tremendously extensible, allowing users to customize data schemas, input and output formats, and methods for finding assets. In short, USD covers the very broad range of requirements that Pixar found necessary to make its feature films.

Image showing the layered workflow for a factory assembly line simulation.
Figure 1. A visual representation of how USD enables layered workflows for industry-specific use cases

Layers are probably the single most innovative feature of USD. Conceptually, they have some similarities to layers in Adobe Photoshop: the final composite is the result of combining the effects of all the layers in order. But instead of modifying the pixels of an image like Photoshop layers, USD layers modify the properties of the composed scene. Most importantly, they provide a powerful mechanism for collaboration. 

Different users can modify the composed scene on different layers, and their edits will be non-destructive. The stronger layer will win out in the composition, but the data from the weaker layer remains accessible. Beyond direct collaboration, the ability that layers provide to non-destructively modify what others have done enables the kind of composability that has made the traditional web so successful.

Image showing the layers of a Brownstone room interior created with USD: the empty room, the staged room, different seating material covers, and alternate furniture layouts and colors.
Figure 2. The layers of a Brownstone room interior created with USD: the empty room, the staged room, different seating material covers, and alternate furniture layouts and colors

NVIDIA believes that USD should serve as the HTML of the metaverse: the declarative specification of the contents of a web site. But just as HTML evolved from the limited static documents of HTML 1 to the dynamic applications of HTML 5, it is clear that USD will need to evolve to meet the needs of the metaverse. To accelerate this evolution, NVIDIA has already made a number of additions to the USD ecosystem:

In the short term, NVIDIA is developing:

  • glTF interoperability: A glTF file format plugin will allow glTF assets to be referenced directly by USD scenes. This means that users who are already using glTF can take advantage of the composition and collaboration features of USD without having to alter their existing assets.
  • Geospatial schema (WGS84): NVIDIA is developing a geospatial schema and runtime behavior in USD to support the WGS84 standard for geospatial coordinates.  This will facilitate full-fidelity digital twin models that need to incorporate the curvature of the earth’s surface.
  • International character (UTF-8) support: NVIDIA is working with Pixar to add support for UTF-8 identifiers to USD, allowing for full interchange of content from all over the world.
  • USD compatibility testing and certification suite: To further accelerate USD development and adoption, NVIDIA is building an open source suite for USD compatibility testing and certification. Developers will be able to test their builds of USD and certify that their custom USD components produce an expected result.

In the longer term, NVIDIA is working with partners to fill some of the larger remaining gaps in USD:

  • High-speed incremental updates: USD was not designed for high-speed dynamic scene updates, but digital twin simulations will require this. NVIDIA is developing additional libraries on top of USD that enable much higher update rates to support real-time simulation. 
  • Real-time proceduralism: USD as it currently exists is almost entirely declarative.  Properties and values in the USD representation, for the most part, describe facts about the virtual world. NVIDIA has begun to augment this with a procedural graph-based execution engine called OmniGraph
  • Compatibility with browsers: Today, USD is C++/Python based, but web browsers are not. To be accessible by everyone, everywhere, virtual worlds will need to be capable of running inside web browsers. NVIDIA will be working to ensure that proper WebAssembly builds with JavaScript bindings are available to make USD an attractive development option when running inside of a browser is the best approach.
  • Real-time streaming of IoT data: Industrial virtual worlds and live digital twins require real-time streaming of IoT data. NVIDIA is working on building USD connections to IoT data streaming protocols.

Companies across industrial and manufacturing—including Ericsson, Kroger, and Volvo—are adopting USD to enable their 3D virtual worlds and asset projects.

Get started building virtual worlds with USD

NVIDIA Omniverse is a scalable computing platform for full-design-fidelity 3D simulation workflows and a toolkit for building USD-based metaverse applications. Omniverse was built from the ground up as a USD engine and open toolkit for building custom, interoperable 3D pipelines. 

You can access a wealth of USD resources from NVIDIA, available online for free. A good place to start is with NVIDIA’s hub of USD resources. To learn the basics of USD with examples in USDA and Python in a step-by-step web tutorial, sign up for the USD DLI course

Experimenting with USD is easy with precompiled USD binaries. These Windows/Linux distributions will help you get started developing tools that take advantage of USD or start using USDView from Omniverse Launcher. For Python developers, the easiest way to start reading and writing USD layers is with the usd-core Python Package.

If you’re looking for USD sample data, numerous sample USD scenes are available, including a physics-based marbles mini-game sample and an attic scene with MDL materials rendered in Omniverse. In addition, USD SimReady Content includes component models from various industries prepared for simulation workflows.

Learn more in the Omniverse Resource Center, which details how developers can build custom USD-based applications and extensions for the platform. 

Follow Omniverse on Instagram, Twitter, YouTube, and Medium for additional resources and inspiration. Check out the Omniverse forums and join our Discord Server and Twitch to chat with the community.

Enter the NVIDIA #ExtendOmniverse contest with an extension created in Omniverse Code for a chance to win an NVIDIA RTX GPU. Join NVIDIA at SIGGRAPH 2022 to learn more about the latest Omniverse announcements and watch the Special Address on demand. And don’t miss the global premiere of the documentary, The Art of Collaboration: NVIDIA, Omniverse, and GTC on August 10 at 10 AM, Pacific time.

Categories
Misc

NVIDIA Announces Full Open Source of Material Definition Language to Streamline Graphics Pipelines

NVIDIA at SIGGRAPH 2022 today announced the full open sourcing of Material Definition Language (MDL)—including the MDL Distiller and GLSL backend…

NVIDIA at SIGGRAPH 2022 today announced the full open sourcing of Material Definition Language (MDL)—including the MDL Distiller and GLSL backend technologies—to further expand the MDL ecosystem.

Building the world’s most accurate and scalable models for material and rendering simulation is a continuous effort, requiring flexibility and adaptability. MDL is NVIDIA’s vision for renderer algorithm-agnostic material definitions for material exchange. 

MDL unlocks material representations from current siloes, allowing them to traverse software ecosystems. It can be used to define complex, physically-accurate materials by reducing material complexity to boost performance.

Dynamic materials support

NVIDIA is open sourcing the MDL Distiller to enable best-in-class implementations of MDL support for all kinds of renderers. Built as a companion technology to the MDL SDK and language, the Distiller is a fully automated solution to simplify all MDL materials to a reduced material model of a simpler renderer. As a renderer developer, you can now provide respective MDL Distiller rules, rather than material artists providing simpler materials for simple renderers.

MDL can now be used to author one high-quality, single-truth material without the need to make compromises or variants for lesser-capable renderers. Approximations and simplifications are left to the software. Once a renderer improves the capabilities, for example, the MDL Distiller rules can be upgraded and the old materials improve without the need to re-author the content.

Workflow flexibility and efficiency

New open source GLSL backend technologies provide MDL support to renderer developers building on OpenGL or Vulkan, closing the gap to established graphics API standards. The MDL Distiller and GLSL backend will enable many more developers to leverage the power of MDL.

By unlocking material representations, artists have the freedom to work across ecosystems while maintaining material appearance. This means a material can be created once and then shared across multiple applications. 

The ability to share physically-based materials between supporting applications, coupled with the flexibility of use in renderers and platforms like NVIDIA Omniverse, is a key advantage of MDL. This flexibility saves time and effort in many workflows and pipelines.

Learn more about NVIDIA MDL and download the SDK to get started.

Watch the NVIDIA Special Address at SIGGRAPH 2022 on demand, and join NVIDIA at SIGGRAPH to see more of the latest technology breakthroughs in graphics, AI, and virtual worlds.

Categories
Misc

NVIDIA and Partners Build Out Universal Scene Description to Accelerate Industrial Metaverse and Next Wave of AI

NVIDIA today announced a broad initiative to evolve Universal Scene Description (USD), the open-source and extensible language of 3D worlds, to become a foundation of the open metaverse and 3D internet.

Categories
Misc

Upping the Standard: NVIDIA Introduces NeuralVDB, Bringing AI and GPU Optimization to Award-Winning OpenVDB

NVIDIA today announced NeuralVDB, which brings the power of AI to OpenVDB, the industry-standard library for simulating and rendering sparse volumetric data, such as water, fire, smoke and clouds. Building on the past decade’s development of OpenVDB, the introduction at SIGGRAPH of NeuralVDB is a game-changer for professionals working in areas like scientific computing and Read article >

The post Upping the Standard: NVIDIA Introduces NeuralVDB, Bringing AI and GPU Optimization to Award-Winning OpenVDB appeared first on NVIDIA Blog.

Categories
Misc

New NVIDIA Neural Graphics SDKs Make Metaverse Content Creation Available to All

The creation of 3D objects for building scenes for games, virtual worlds including the metaverse, product design or visual effects is traditionally a meticulous process, where skilled artists balance detail and photorealism against deadlines and budget pressures. It takes a long time to make something that looks and acts as it would in the physical Read article >

The post New NVIDIA Neural Graphics SDKs Make Metaverse Content Creation Available to All appeared first on NVIDIA Blog.

Categories
Misc

NVIDIA Announces Major Release of Omniverse With New USD Connectors and Tools, Simulation Technologies and Developer Frameworks

NVIDIA today announced a new range of developer frameworks, tools, apps and plugins for NVIDIA Omniverse™, the platform for building and connecting metaverse worlds based on Universal Scene Description (USD).

Categories
Misc

Virtual Assistants and Digital Humans on Pace to Ace Turing Test With New NVIDIA Omniverse Avatar Cloud Engine

NVIDIA today announced NVIDIA Omniverse Avatar Cloud Engine (ACE), a suite of cloud-native AI models and services that make it easier to build and customize lifelike virtual assistants and digital humans.

Categories
Misc

How to Build an Edge Solution: Common Questions and Resources for Success

Learning about new technologies can sometimes be intimidating. The NVIDIA edge computing webinar series aims to present the basics of edge computing so that all…

Learning about new technologies can sometimes be intimidating. The NVIDIA edge computing webinar series aims to present the basics of edge computing so that all attendees can understand the key concepts associated with this technology. 

NVIDIA recently hosted the webinar, Edge Computing 201: How to Build an Edge Solution, which explores the components needed to build a production edge AI deployment. During the presentation, attendees were asked various polling questions about their knowledge of edge computing, their biggest challenges, and their approaches to building solutions. 

You can see a breakdown of the results from those polling questions below, along with some answers and key resources that will help you along in your edge journey. 

What stage are you in on your edge computing journey? 

More than half (55%) of the poll respondents said they are still in the learning phase of their edge computing journey. The first two webinars in the series—Edge Computing 101: An Introduction to the Edge and Edge Computing 201: How to Build an Edge Solution—are specifically designed for those who are just starting to learn about edge computing and edge AI. 

Another resource for learning the basics of edge computing is Top Considerations for Deploying AI at the Edge. This reference guide includes all the technologies and decision points that need to be considered when building any edge computing deployment. 

In addition, 25% of respondents report that they are researching what the right edge computing use cases are for them. By far, the most mature edge computing use case today is computer vision, or vision AI. Since computer vision use cases require high bandwidth and low latency, they are the ideal use case for what edge computing has to offer. 

Let’s Build Smarter, Safer Spaces with AI provides a deep dive into computer vision use cases, and walks you through many of the technologies associated with making these use cases successful. 

What is your biggest challenge when designing an edge computing solution? 

Respondents were more evenly split across several different answers for this polling question. In each subsection below, you can read more about the challenges audience members reported experiencing while getting started with edge computing, along with some resources that can help.

Unsure what components are needed

Although each use case and environment will have unique, specific requirements, almost every edge deployment will include three components:

  1. An application that can be deployed and managed across multiple environments
  2. Infrastructure that provides the right compute and networking to enable the desired use case
  3. Security tools that will protect intellectual property and critical data

Of course, there will be additional considerations, but focusing on these three components provides organizations with what they need to get started with AI at the edge. 

Here’s what a typical edge deployment looks like:

Diagram of an edge deployment workflow
Figure 1. A typical edge deployment workflow

To learn more, see Edge Computing 201: How to Build an Edge Solution. It covers all the considerations for building an edge solution, the needed components, and how to ensure those components work together to create a seamless workflow. 

Implementation challenges

Understanding the steps involved in implementing an edge computing solution is a good way to ensure that the first solution built is a comprehensive solution, which will help eliminate future headaches when maintaining or scaling. 

This understanding will also help to eliminate unforeseen challenges. The five main steps to implementing any edge AI solution are: 

  1. Identify a use case or challenge to be solved
  2. Determine what data and application requirements exist
  3. Evaluate existing edge infrastructure and what pieces must be added
  4. Test the solution and then roll it out at scale
  5. Share success with other groups to promote additional use cases
Diagram showing the five steps to get started with an edge AI project: identify the use case for the edge, determine data requirements, analyze capabilities, roll out edge solutions, and celebrate success
Figure 2. The five steps to implement an edge AI project

To learn more about how to implement an edge computing solution, see Steps to Get Started With Edge AI, which outlines best practices and pitfalls to avoid along the way. 

Scaling across multiple sites

Scaling a solution across multiple (sometimes thousands) of sites is one of the most important, yet challenging tasks associated with edge computing. Some organizations try to manually build solutions to help manage deployments, but find that the resources required to scale these solutions are not sustainable. 

Other organizations try to repurpose data center tools to manage their applications at the edge, but to do this requires custom scripts and automation to adapt these solutions for new environments. These customizations become difficult to support as infrastructure footprints increase and new workloads are added. 

Kubernetes-based solutions can help deploy, manage, and scale applications across multiple edge locations. These are tools specifically built for edge environments, and can come with enterprise support packages. Examples include Red Hat OpenShift, VMware Tanzu, and NVIDIA Fleet Command

Fleet Command is purpose-built for AI. It’s turnkey, secure, and can scale to thousands of devices in minutes. Watch the Simplify AI Management at the Edge demo to learn more. 

Tuning an application for edge use cases

The most important aspects of an edge computing application are flexibility and performance. Applications need to be able to operate in many different environments, and need to be portable enough that they can be easily managed across distributed locations. 

In addition, organizations need applications that they can rely on. Applications need to maintain performance in sometimes extreme locations where network connectivity may be spotty, like an oil rig in the middle of the ocean. 

To fulfill both of those requirements, many organizations have turned to cloud-native technology to ensure their applications have the required level of flexibility and performance. By making an application cloud-native, organizations help ensure that the application is ready for edge deployments. 

To learn more, see Getting Applications Ready for Cloud-Native

Justifying the cost of a solution

Justifying the cost of any technology comes down to understanding the cost variables and proving ROI. For an edge computing solution, there are three main cost variables:

  1. Infrastructure costs
  2. Application costs
  3. Management costs

Proving the ROI of a deployment will vary by use case and will be different for each organization. Generally, ROI depends a lot on the value of the AI application deployed at the edge. 

Learn more about the costs associated with an edge deployment with Building an Edge Strategy: Cost Factors

Securing edge environments

Edge computing environments have unique security considerations. That’s because they cannot rely on the castle-and-moat security architecture of a data center. For instance, physical security of data and equipment are factors that must be considered when deploying AI at the edge. Additionally, if there are connections from edge devices back to an organization’s central network, ensuring encrypted traffic between the two devices is essential. 

The best approach is to find solutions that offer layered security from cloud-to-edge, providing several security protocols to ensure intellectual property and critical data are always protected. 

To learn more about how to secure edge environments, see Edge Computing: Considerations for Security Architects

Do you plan to deploy containerized applications at the edge? 

Cloud-native technology was discussed in the Edge Computing 201 webinar as a way to ensure applications deployed at the edge are flexible and have a reliable level of performance. 54% of respondents reported that they plan on deploying containerized applications at the edge, while 38% said they were unsure. 

Organizations need flexible applications at the edge because the edge locations they are deploying to might have varying requirements. For instance, not all grocery stores are the same size. Some bigger grocery stores might have high power requirements with over a dozen cameras deployed, while a smaller grocery store might have extremely limited power requirements with just one or two cameras deployed. 

Despite the differences, an organization needs to be able to deploy the same application across both of these environments with confidence that the application can easily adapt. 

Cloud-native technology allows for this flexibility, while providing key reliability: applications are re-spun if there are issues, and workloads are migrated if a system fails. 

Learn more about how cloud-native technology can be used in edge environments with The Future of Edge AI Is Cloud Native

Have you considered how you will manage applications and systems at the edge?

When asked if they have considered how they will manage applications and systems at the edge, 52% of respondents reported they are building their own solution, while 24% are buying a solution from a partner. 24% reported they have not considered a management solution. 

For AI at the edge, a management solution is a critical tool. The scale and distance of locations makes manually managing all of them very difficult for production deployments. Even managing a small handful of locations becomes more tedious than it needs to be when an application requires an update or new security patch. 

The section of this post entitled ‘Scaling across multiple sites’ (above) outlines why manual solutions are difficult to scale. They are often useful for POCs or experimental deployments, but for any production environment, a management tool will save many headaches. 

NVIDIA Fleet Command is a managed platform for container orchestration that streamlines provisioning and deployment of systems and AI applications at the edge. It simplifies the management of distributed computing environments with the scale and resiliency of the cloud, turning every site into a secure, intelligent location. 

To learn more about how Fleet Command can help manage edge deployments, watch the Simplify AI Management at the Edge demo. 

Looking ahead

Edge computing is a new yet proven concept for particular use cases. Understanding the basics of this technology can help many organizations accelerate workloads to drive their bottom line. 

While the Edge Computing 101 and Edge Computing 201 webinar sessions focused on designing and building edge solutions, Edge Computing 301: Maintain and Optimizing Deployments dives into what you need for ongoing day-to-day management of edge deployments. Sign up to continue your edge computing learning journey. 

Categories
Misc

Essential Guide to Automatic Speech Recognition Technology

Interested in speech recognition technology? Sign up for the NVIDIA speech AI newsletter. Over the past decade, AI-powered speech recognition systems have…

Interested in speech recognition technology? Sign up for the NVIDIA speech AI newsletter.

Over the past decade, AI-powered speech recognition systems have slowly become part of our everyday lives, from voice search to virtual assistants in contact centers, cars, hospitals, and restaurants. These speech recognition developments are made possible by deep learning advancements.

Developers across many industries now use automatic speech recognition (ASR) to increase business productivity, application efficiency, and even digital accessibility. Read on to learn more about ASR, how it works, use cases, advancements, and more.

What is automatic speech recognition

Speech recognition technology is capable of converting spoken language (an audio signal) into written text that is often used as a command.

Today’s most advanced software can accurately process varying language dialects and accents. For example, ASR is commonly seen in user-facing applications such as virtual agents, live captioning, and clinical note-taking. Accurate speech transcription is essential for these use cases.

Developers in the speech AI space also use alternative terminologies to describe speech recognition such as ASR, speech-to-text (STT), and voice recognition.

ASR is a critical component of speech AI, which is a suite of technologies designed to help humans converse with computers through voice.

Why natural language processing is used in speech recognition

Developers are often unclear about the role of natural language processing (NLP) models in the ASR pipeline. Aside from being applied in language models, NLP is also used to augment generated transcripts with punctuation and capitalization at the end of the ASR pipeline.

After the transcript is post-processed with NLP, the text is used for downstream language modeling tasks including:

  • Sentiment analysis
  • Text analytics
  • Text summarization
  • Question answering

Speech recognition algorithms

Speech recognition algorithms can be implemented in a traditional way using statistical algorithms, or by using deep learning techniques such as neural networks to convert speech into text.

Traditional ASR algorithms

Hidden Markov models (HMM) and dynamic time warping (DTW) are two such examples of traditional statistical techniques for performing speech recognition.

Using a set of transcribed audio samples, an HMM is trained to predict word sequences by varying the model parameters to maximize the likelihood of the observed audio sequence.

DTW is a dynamic programming algorithm that finds the best possible word sequence by calculating the distance between time series: one representing the unknown speech and others representing the known words.

Deep learning ASR algorithms

For the last few years, developers have been interested in deep learning for speech recognition because statistical algorithms are less accurate. In fact, deep learning algorithms work better at understanding dialects, accents, context, and multiple languages, and transcribes accurately even in noisy environments.

Some of the most popular state-of-the-art speech recognition acoustic models are Quartznet, Citrinet, and Conformer. In a typical speech recognition pipeline, you can choose and switch any acoustic model you want based on your use case and performance.

Implementation tools for deep learning models

Several tools are available for developing deep learning speech recognition models and pipelines, including Kaldi, Mozilla DeepSpeech, NVIDIA NeMo, Riva, TAO Toolkit, and services from Google, Amazon, and Microsoft.

Kaldi, DeepSpeech, and NeMo are open-source toolkits that help you build speech recognition models. TAO Toolkit and Riva are closed-source SDKs that help you develop customizable pipelines that can be deployed in production.

Cloud Service Providers like Google, AWS, and Microsoft offer generic services that you can easily plug and play with.

Deep learning speech recognition pipeline

As shown in Figure 1, ASR pipeline consists of the following components: a spectrogram generator that converts raw audio to spectrograms, an acoustic model that takes the spectrograms as input and outputs a matrix of probabilities over characters over time, a decoder (optionally coupled with a language model) that generates possible sentences from the probability matrix, and finally, a punctuation and capitalization model that formats the generated text for easier human consumption.

A typical deep learning pipeline for speech recognition includes:

  • Data preprocessing
  • Neural acoustic model
  • Decoder (optionally coupled with an n-gram language model)
  • Punctuation and capitalization model.

Figure 1 shows an example of a deep learning speech recognition pipeline:

Diagram showing the ASR pipeline
Figure 1. An example of a deep learning speech recognition pipeline

Datasets are essential in any deep learning application. Neural networks function similarly to the human brain. The more data you use to teach the model, the more it learns. The same is true for the speech recognition pipeline.

A few popular speech recognition datasets are LibriSpeech, Fisher English Training Speech, Mozilla Common Voice (MCV), VoxPopuli, 2000 HUB 5 English Evaluation Speech, AN4 (includes recordings of people spelling out addresses and names), and Aishell-1/AIshell-2 Mandarin speech corpus. In addition to your own proprietary datasets, these are some open-source datasets to get started with.

Data processing is the first step. It includes data preprocessing/augmentation techniques such as speed/time/noise/impulse perturbation and time stretch augmentation, Fast Fourier Transformations (FFT) using windowing, and normalization techniques.

For example, in Figure 2 below, the mel spectrogram is generated from a raw audio waveform after applying FFT using the windowing technique.

Diagram showing two forms of an audio recording: waveform (left) and mel spectrogram (right).
Figure 2. An audio recording raw audio waveform (left) and mel spectrogram (right)

We can also use perturbation techniques to augment the training dataset. Figures 3 and 4 represent techniques like noise perturbation and masking being used to increase the size of the training dataset in order to avoid problems like overfitting.

Diagram showing two forms of a noise augmented audio recording: waveform (left) and mel spectrogram (right).
Figure 3. Noise augmented audio waveform to noise augmented mel spectrogram image
Diagram showing two forms of a noise augmented audio recording: mel spectrogram (left) and masked mel spectrogram (right).
Figure 4. Noise augmented mel spectrogram to noise augmented masked mel spectrogram image

The output of the data preprocessing stage is a spectrogram/mel spectrogram, which is a visual representation of the strength of audio signal over time. 

Mel spectrograms are then fed into the next stage: neural acoustic model. QuartzNet, CitriNet, ContextNet, Conformer-CTC, and Conformer-Transducer are examples of cutting-edge neural acoustic models. Multiple ASR models exist for several reasons, such as the need for real-time performance, higher accuracy, memory size, and compute cost for your use case.

However, Conformer-based models are becoming more popular due to their improved accuracy and ability to comprehend. The acoustic model returns the probability of characters/words at each time stamp.

Figure 5 shows the output of the acoustic model, with time stamps. 

Diagram showing the output of acoustic model which includes probabilistic distribution over vocabulary characters per each time step.
Figure 5. The acoustic model’s output includes probabilistic distribution over vocabulary characters for each time step

The acoustic model’s output is fed into the decoder along with the language model. Decoders include beam search and greedy decoders, and language models include n-gram language, KenLM, and neural scoring. When it comes to the decoder, it helps to generate top words, which are then passed to language models to predict the correct sentence.

In the diagram below, the decoder selects the next best word based on the probability score. Based on the final highest score, the correct word or sentence is selected and sent to the punctuation and capitalization model.

Diagram showing how a decoder picks the next word based on the probability scores to generate a final transcript.
Figure 6. An example of a decoder workflow

The ASR pipeline generates text with no punctuation or capitalization.

Finally, a punctuation and capitalization model is used to improve the text quality for better readability. Bidirectional Encoder Representations from Transformers (BERT) models are commonly used to generate punctuated text.

Figure 7 illustrates a simple example of a before-and-after punctuation and capitalization model:

Diagram showing how a punctuation and capitalization model adds punctuations & capitalizations to a generated transcript.
Figure 7. Sample output of punctuation and capitalization model

Speech recognition industry impact

Speech recognition could help industries such as finance, telecommunications, and Unified Communications as a Service (UCaaS) to improve customer experience, operational efficiency, and return on investment (ROI).

Video 1. How speech AI transforms customer engagement

Finance

Speech recognition is applied in the finance industry for applications such as call center agent assist and trade floor transcripts. ASR is used to transcribe conversations between customers and call center agents/trade floor agents. The generated transcriptions can then be analyzed and used to provide real-time recommendations to agents. This adds to an 80% reduction in post-call time.

Furthermore, the generated transcripts are used for downstream tasks, including:

  • Sentiment analysis
  • Text summarization
  • Question answering
  • Intent and entity recognition

Telecommunications

Contact centers are critical components of the telecommunications industry. With contact center technology, you can reimagine the telecommunications customer center, and speech recognition helps with that. As previously discussed in the finance call center use case, ASR is used in Telecom contact centers to transcribe conversations between customers and contact center agents in order to analyze them and recommend call center agents in real time. T-Mobile uses ASR for quick customer resolution, for example.

Unified Communications as a Software (UCaaS)

COVID-19 increased demand for Unified Communications as a Service (UCaaS) solutions, and vendors in the space began focusing on the use of speech AI technologies such as ASR to create more engaging meeting experiences.

For example, ASR can be used to generate live captions in video conferencing meetings. Captions generated can then be used for downstream tasks such as meeting summaries and identifying action items in notes.

Future of ASR technology

Speech recognition is not as easy as it sounds. Developing speech recognition is full of challenges, ranging from accuracy to customization for your use case to real-time performance. On the other hand, businesses and academic institutions are racing to overcome some of these challenges and advance the use of speech recognition capabilities.

ASR challenges

Some of the challenges in developing and deploying speech recognition pipelines in production include:

  • Lack of tools and SDKs that offer state-of-the-art (SOTA) ASR models makes it difficult for developers to take advantage of the best speech recognition technology.
  • Limited customization capabilities that enable developers to fine-tune on domain-specific and context-specific jargon, multiple languages, dialects, and accents in order to have your applications understand and speak like you
  • Restricted deployment support; for example, depending on the use case, the software should be capable of being deployed in any cloud, on-prem, edge, and embedded. 
  • Real-time speech recognition pipelines; for instance, in a call center agent assist use case, we cannot wait several seconds for conversations to be transcribed before using them to empower agents.

ASR advancements

Numerous advancements in speech recognition are occurring on both the research and software development fronts. To begin, research has resulted in the development of several new cutting-edge ASR architectures, E2E speech recognition models, and self-supervised or unsupervised training techniques.

On the software side, there are a few tools that enable quick access to SOTA models, and then there are different sets of tools that enable the deployment of models as services in production. 

Key takeaways

Speech recognition continues to grow in adoption due to its advancements in deep learning-based algorithms that have made ASR as accurate as human recognition. Also, breakthroughs like multilingual ASR help companies make their apps available worldwide, and moving algorithms from cloud to on-device saves money, protects privacy, and speeds up inference.

NVIDIA offers Riva, a speech AI SDK, to address several of the challenges discussed above. With Riva, you can quickly access the latest SOTA research models tailored for production purposes. You can customize these models to your domain and use case, deploy on any cloud, on-prem, edge, or embedded, and run them in real-time for engaging natural interactions.

Learn how your organization can benefit from speech recognition skills with the free ebook, Building Speech AI Applications.

Categories
Misc

Announcing the Summer of Jetson SparkFun Contest

Create a project using the NVIDIA Jetson Nano developer kit and submit it by September 30, 2022 for a chance to win a Machine Learning at Home Kit.

Create a project using the NVIDIA Jetson Nano developer kit and submit it by September 30, 2022 for a chance to win a Machine Learning at Home Kit.