Categories
Offsites

Lyra: A New Very Low-Bitrate Codec for Speech Compression

Connecting to others online via voice and video calls is something that is increasingly a part of everyday life. The real-time communication frameworks, like WebRTC, that make this possible depend on efficient compression techniques, codecs, to encode (or decode) signals for transmission or storage. A vital part of media applications for decades, codecs allow bandwidth-hungry applications to efficiently transmit data, and have led to an expectation of high-quality communication anywhere at any time.

As such, a continuing challenge in developing codecs, both for video and audio, is to provide increasing quality, using less data, and to minimize latency for real-time communication. Even though video might seem much more bandwidth hungry than audio, modern video codecs can reach lower bitrates than some high-quality speech codecs used today. Combining low-bitrate video and speech codecs can deliver a high-quality video call experience even in low-bandwidth networks. Yet historically, the lower the bitrate for an audio codec, the less intelligible and more robotic the voice signal becomes. Furthermore, while some people have access to a consistent high-quality, high-speed network, this level of connectivity isn’t universal, and even those in well connected areas at times experience poor quality, low bandwidth, and congested network connections.

To solve this problem, we have created Lyra, a high-quality, very low-bitrate speech codec that makes voice communication available even on the slowest networks. To do this, we’ve applied traditional codec techniques while leveraging advances in machine learning (ML) with models trained on thousands of hours of data to create a novel method for compressing and transmitting voice signals.

Lyra Overview
The basic architecture of the Lyra codec is quite simple. Features, or distinctive speech attributes, are extracted from speech every 40ms and are then compressed for transmission. The features themselves are log mel spectrograms, a list of numbers representing the speech energy in different frequency bands, which have traditionally been used for their perceptual relevance because they are modeled after human auditory response. On the other end, a generative model uses those features to recreate the speech signal. In this sense, Lyra is very similar to other traditional parametric codecs, such as MELP.

However traditional parametric codecs, which simply extract from speech critical parameters that can then be used to recreate the signal at the receiving end, achieve low bitrates, but often sound robotic and unnatural. These shortcomings have led to the development of a new generation of high-quality audio generative models that have revolutionized the field by being able to not only differentiate between signals, but also generate completely new ones. DeepMind’s WaveNet was the first of these generative models that paved the way for many to come. Additionally, WaveNetEQ, the generative model-based packet-loss-concealment system currently used in Duo, has demonstrated how this technology can be used in real-world scenarios.

A New Approach to Compression with Lyra
Using these models as a baseline, we’ve developed a new model capable of reconstructing speech using minimal amounts of data. Lyra harnesses the power of these new natural-sounding generative models to maintain the low bitrate of parametric codecs while achieving high quality, on par with state-of-the-art waveform codecs used in most streaming and communication platforms today. The drawback of waveform codecs is that they achieve this high quality by compressing and sending over the signal sample-by-sample, which requires a higher bitrate and, in most cases, isn’t necessary to achieve natural sounding speech.

One concern with generative models is their computational complexity. Lyra avoids this issue by using a cheaper recurrent generative model, a WaveRNN variation, that works at a lower rate, but generates in parallel multiple signals in different frequency ranges that it later combines into a single output signal at the desired sample rate. This trick enables Lyra to not only run on cloud servers, but also on-device on mid-range phones in real time (with a processing latency of 90ms, which is in line with other traditional speech codecs). This generative model is then trained on thousands of hours of speech data and optimized, similarly to WaveNet, to accurately recreate the input audio.

Comparison with Existing Codecs
Since the inception of Lyra, our mission has been to provide the best quality audio using a fraction of the bitrate data of alternatives. Currently, the royalty-free open-source codec Opus, is the most widely used codec for WebRTC-based VOIP applications and, with audio at 32kbps, typically obtains transparent speech quality, i.e., indistinguishable from the original. However, while Opus can be used in more bandwidth constrained environments down to 6kbps, it starts to demonstrate degraded audio quality. Other codecs are capable of operating at comparable bitrates to Lyra (Speex, MELP, AMR), but each suffer from increased artifacts and result in a robotic sounding voice.

Lyra is currently designed to operate at 3kbps and listening tests show that Lyra outperforms any other codec at that bitrate and is compared favorably to Opus at 8kbps, thus achieving more than a 60% reduction in bandwidth. Lyra can be used wherever the bandwidth conditions are insufficient for higher-bitrates and existing low-bitrate codecs do not provide adequate quality.

Clean Speech
Original
Opus@6kbps
Lyra@3kbps
Speex@3kbps
Noisy Environment
Original
Opus@6kbps
Lyra@3kbps
Speex@3kbps
Reference Opus@6kbps Lyra@3kbps

Ensuring Fairness
As with any ML based system, the model must be trained to make sure that it works for everyone. We’ve trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners. One of the design goals of Lyra is to ensure universally accessible high-quality audio experiences. Lyra trains on a wide dataset, including speakers in a myriad of languages, to make sure the codec is robust to any situation it might encounter.

Societal Impact and Where We Go From Here
The implications of technologies like Lyra are far reaching, both in the short and long term. With Lyra, billions of users in emerging markets can have access to an efficient low-bitrate codec that allows them to have higher quality audio than ever before. Additionally, Lyra can be used in cloud environments enabling users with various network and device capabilities to chat seamlessly with each other. Pairing Lyra with new video compression technologies, like AV1, will allow video chats to take place, even for users connecting to the internet via a 56kbps dial-in modem.

Duo already uses ML to reduce audio interruptions, and is currently rolling out Lyra to improve audio call quality and reliability on very low bandwidth connections. We will continue to optimize Lyra’s performance and quality to ensure maximum availability of the technology, with investigations into acceleration via GPUs and TPUs. We are also beginning to research how these technologies can lead to a low-bitrate general-purpose audio codec (i.e., music and other non-speech use cases).

Acknowledgements
Thanks to everyone who made Lyra possible including Jan Skoglund, Felicia Lim, Michael Chinen, Bastiaan Kleijn, Tom Denton, Andrew Storus, Yero Yeh (Chrome Media), Henrik Lundin, Niklas Blum, Karl Wiberg (Google Duo), Chenjie Gu, Zach Gleicher, Norman Casagrande, Erich Elsen (DeepMind).

Categories
Misc

Tutorial: Accelerating Deep Learning with Apache Spark and NVIDIA GPUs on AWS

Learn how to create a cluster of GPU machines and use Apache Spark with Deep Java Library (DJL) on Amazon EMR to leverage large-scale image classification in Scala.

Learn how to create a cluster of GPU machines and use Apache Spark with Deep Java Library (DJL) on Amazon EMR to leverage large-scale image classification in Scala.

Categories
Misc

Tutorial: Cross-Compiling Robot Operating System Nodes for NVIDIA DRIVE AGX

In this post, we show you how ROS and DriveWorks can be used for building AV applications using a ROS package that we have put together.

In this post, we show you how ROS and DriveWorks can be used for building AV applications using a ROS package that we have put together.

Categories
Misc

Tutorial: Creating a Real-Time License Plate Detection and Recognition App

In this post, NVIDIA engineers show you how to use production-quality AI models such as License Plate Detection (LPD) and License Plate Recognition (LPR) models in conjunction with the NVIDIA Transfer Learning Toolkit (TLT).

In this post, NVIDIA engineers show you how to use production-quality AI models such as License Plate Detection (LPD) and License Plate Recognition (LPR) models in conjunction with the NVIDIA Transfer Learning Toolkit (TLT).

Categories
Misc

Tutorial: Creating Voice-based Virtual Assistants Using NVIDIA Jarvis and Rasa

Step-by-step tutorial to develop a voice-based virtual assistant and learn what it takes to integrate Jarvis ASR and TTS with Rasa NLP and Dialog Management (DM).

Develop a voice-based virtual assistant and learn what it takes to integrate Jarvis ASR and TTS with Rasa NLP and Dialog Management (DM).

Categories
Misc

Tutorial: Developing a Question Answering Application Quickly Using NVIDIA Jarvis

Learn how you can use Jarvis QA and the Wikipedia API action to create a simple QA application.

Learn how you can use Jarvis QA and the Wikipedia API action to create a simple QA application.

Categories
Misc

Couldn’t train nn for solving 2nd order ODE

I am trying to solve 2nd order ODE which is

y”+100y=0, y(0)=0, y'(0)=10 on [0, 1] interval

using neural network. Here is the code: https://colab.research.google.com/gist/rprtr258/717c07b72f2263ca0dc401c83e9179e5/2nd-order-ode.ipynb#scrollTo=zeub0DBC9pkr

But I have two problems:

  1. I guess tf recompiles(retraces) some function during learning which slows learning proccess significantly. Putting whole learning process into function doesn’t help.
  2. NN doesn’t fit at all. I guess it might be because of gradient size on last layer or something. Anyway it is difficult to test during 1.

Any help with problem 1 and maybe problem 2 will be appreciated.

submitted by /u/rprtr258
[visit reddit] [comments]

Categories
Misc

Running tensorflow for python in multiple cores ?

Hey guys,

Currently working on a tensorflow python script which I plan to use on a server with multiple cores .

The problem is that if I try to run the script in separate ssh sessions it will always default to the same core, and I need it to run in a different core each time so I can take advantage of all of the cores available .

Using tensorflow 2.2 so tf session is no longer available .

Can anyone please tell me how to achieve this ?

Thanks

submitted by /u/Triptonpt
[visit reddit] [comments]

Categories
Misc

NVIDIA Deep Learning Institute Releases New Accelerated Data Science Teaching Kit for Educators

As data grows in volume, velocity and complexity, the field of data science is booming. There’s an ever-increasing demand for talent and skillsets to help design the best data science solutions. However, expertise that can help drive these breakthroughs requires students to have a foundation in various tools, programming languages, computing frameworks and libraries. That’s Read article >

The post NVIDIA Deep Learning Institute Releases New Accelerated Data Science Teaching Kit for Educators appeared first on The Official NVIDIA Blog.

Categories
Misc

Bring AI to Market Fast with Pre-Trained Models and Transfer Learning Toolkit 3.0

Today, NVIDIA released several production-ready, pre-trained models and a developer preview of Transfer Learning Toolkit (TLT) 3.0, along with DeepStream SDK 5.1.

Intelligent vision and speech-enabled services have now become mainstream, impacting almost every aspect of our everyday life. AI-enabled video and audio analytics are enhancing applications from consumer products to enterprise services. Smart speakers at home. Smart kiosks or chatbots in retail stores. Interactive robots on factory floors. Intelligent patient monitoring systems at hospitals. And autonomous traffic solutions in smart cities. NVIDIA has been at the forefront of inventing technologies that power these services, helping developers create high-performance products with faster time-to-market. 

Today, NVIDIA released several production-ready, pre-trained models and a developer preview of Transfer Learning Toolkit (TLT) 3.0, along with DeepStream SDK 5.1. The release includes a collection of new pre-trained models—innovative features that support conversational AI applications—delivering a more powerful solution for accelerating the developer’s journey from training to deployment. 

Accelerate Your Vision AI Production 

Creating a model from scratch can be daunting and expensive for developers, startups, and enterprises. NVIDIA TLT is the AI toolkit that abstracts away the AI/DL framework complexity and enables you to build production quality pre-trained models faster, with no coding required.  

With TLT, you can bring your own data to fine-tune the model for a specific use case using one of NVIDIA’s multi-purpose, production-quality models for common AI tasks or use one of the 100+ permutations of neural network architectures like ResNet, VGG, FasterRCNN, RetinaNet, and YOLOv3/v4. All the models are readily available from NGC.

Key highlights for pre-trained models and TLT 3.0 (developer preview)

  • New vision AI pre-trained models: license plate detection and recognition, heart rate monitoring, gesture recognition, gaze estimation, emotion recognition, face detection, and facial landmarks estimation 
  • Support for conversational AI use cases with pre-trained models for automatic speech recognition (ASR) and natural language processing (NLP) 
  • Choice of training with popular network architectures such as EfficientNet, YoloV4, and UNET
  • Improved PeopleNet model to detect difficult scenarios such as people sitting down and rotated/warped objects
  • TLT launcher for pulling compatible containers to initialize
  • Support for NVIDIA Ampere GPUs with third-generation tensor cores for performance boost 

Get Started 

New Developer Webinar

Join the upcoming webinar “Using NVIDIA Pre-Trained Models and Transfer Learning Toolkit 3.0 to Create Gesture-based Interactions with a Robot” on March 3, 11 a.m. PT. We’ll demonstrate the entire end-to-end developer workflow in a video to show how easy the process is—from training to deployment—to build a gesture-recognition application with human-robot interaction. Register now >>

What Our Customers Are Saying

“INEX RoadView, our comprehensive automatic license plate recognition system for toll roads, uses NVIDIA’s end-to-end vision AI pipeline, production ready AI models, TLT, and DeepStream SDK. Our engineering team not only slashed the development time by 60% but they also reduced the camera hardware cost by 40% using Jetson Nano and Xavier NX. This enabled our vendors to deploy RoadView, the only out of the box ALPR solution, quickly and reliably. For us, nothing else came close.”

Dr. Roman Prilutsky, CEO/CTO, INEX

“We are enabling developers and third-party vendors to readily build intelligent AI apps leveraging Optra’s skills marketplace. As a new entrant to the Edge AI market, being able to differentiate our offerings and time to market was crucial. Readily available MaskRCNN from TLT and easy integration into DeepStream saved 25% development effort right out of the box for our R&D team.” 

Chad McQuillen, Senior Technical Staff Member & Solutions Architect for Optra, Lexmark Ventures

At Quantiphi, we use NVIDIA SDKs to build real-time video analytics workflows for many of our Fortune 500 customers across Retail and Media & Entertainment. Transfer Learning Toolkit provides an efficient way to customize training and model pruning for faster edge inference. DeepStream allows us to build high throughput inference pipelines on the Cloud and easily port them to the Jetson NX devices.”

Siddharth Kotwal, Solution Architecture Lead, Quantiphi

KION Group is working on robust AI-based distribution autonomy solutions across its brands, to address operational needs and logistics optimization challenges and greatly reduce flow exception events. Innovation, engineering and digital transformation services are benefiting from optimized NVIDIA pre-trained models while rapidly innovating and fine-tuning models on the fly using Transfer Learning Toolkit and deploying with NVIDIA DeepStream unlocking multi-stream density with Jetson platforms.

KION Group