DataBloom - Part 442

Misc

SoftBank Solves Key Mobile Edge Computing Challenges Using NVIDIA Maxine

Post author By
Post date August 13, 2021
No Comments on SoftBank Solves Key Mobile Edge Computing Challenges Using NVIDIA Maxine

SoftBank is a global technology player that aspires to drive the Information Revolution. The company operates in broadband, fixed-line telecommunications, ecommerce, information technology, finance, media, and marketing. To improve their users’ communication experience, and overcome the 5G capacity and coverage issues, SoftBank has used NVIDIA Maxine GPU-accelerated SDKs with state-of-the-art AI features to build virtual … Continued

In this post, you learn how SoftBank used the Maxine SuperResolution and hardware-accelerated encode-decode operations to reduce the amount of data that must be uplinked to the multi-access edge computing (MEC) servers. Besides solving the challenge of limited bandwidth, Maxine features such as noise removal and virtual background enabled SoftBank to deliver the best video conferencing solution for their users.

Benefits of using MEC

Edge computing enables providers to deploy their technology closer to users. Simply put, edge computing reduces bandwidth and latency budgets for mission-critical, high-throughput, low-latency applications. This is achieved using MEC network technology to move the computing from a remote cloud server to a node closer to the consumption source. Edge computing relies heavily on network technologies such as 4G, and more recently 5G, to provide connectivity.

Diagram demonstrating the regular pipeline in a MEC 5G infrastructure. Edge devices like mobile phones are severed by 5G transmission infrastructure, which is connected to the MEC server. The MEC server is where you deploy Maxine SDKs. Finally, the MEC server is connected to the central cloud. — *Figure 1. Simplified overview of a pipeline involving MEC servers*

5G features such as ultra-high-speed, ultra-low latency, and multiple simultaneous connections enable new use cases such as telemedicine and smart factories that were previously unfeasible with wireless connectivity. MEC is the key to realizing the support of low-latency, high-throughput use cases. MEC reduces response delays by processing as much as possible at the edge by deploying regional MEC servers and sending only the minimum necessary data to the cloud. MEC servers often use the GPU massively parallel computing power for processing large amounts of data at high speed.

Challenges with the 5G network

The current 5G networks operate in a configuration called non-standalone (NSA). This configuration combines a 4G LTE network and a 5G base station, where some 5G features (such as network slicing) are not available. 5G SA (standalone) configuration has both a 5G core and a base station. 5G SA end-to-end support for 5G speeds services, reduces costs, improves the quality of service, and is a better platform for deploying services.

When the 5G SA configuration is in the market, the full 5G network is complete. In other words, 5G evolves in two steps: 5G NSA and 5G SA. Capital investment is required for each step.

On the other hand, some telecom carriers, including SoftBank, have started using 4G LTE low-band frequency for 4G LTE and 5G NR. Theoretically, capacity and coverage are trade-offs in wireless communication. To ensure the high-quality, wide-area coverage for the 5G SA configuration, SoftBank uses MEC to effectively reduce service delays as much as possible.

A graph showcasing the capacity vs. coverage tradeoff for 5G frequencies. The High-band frequency band has the highest capacity and the lowest coverage and the low band frequency band has the highest coverage but low capacity. — *Figure 2. The trade-off between capacity and coverage in 5G frequencies*

In addition, there are some technical challenges. Mobile networks are generally designed to accommodate a higher downlink speed than uplink. This design philosophy works for general applications such as streaming videos on a smartphone, as most of the traffic is the downlink. However, some critical applications require a strong uplink connection. One of these is video conferencing, where the user needs considerable uplink bandwidth to stream high-resolution video and audio.

The current 5G uplink capacity is insufficient, and carrier aggregation and MIMO antennas are needed to provide more uplink allocation. As more and more devices connect to 5G, saving bandwidth, especially in the uplink, is a common challenge for all global telecom carriers.

Uplink bandwidth-intensive applications, such as video conferencing, can be served with the same quality of service at reduced uplink bandwidth (for example, 500 Kbps) as with ample bandwidth (100 Mbps). In those cases, it’s possible to connect many more devices and provide high-quality services at the same time.

Video conferencing solution on MEC with NVIDIA Maxine

NVIDIA Maxine is a GPU-accelerated SDK platform that enables developers of video conferencing services to build and deploy AI-powered features that use state-of-the-art models in the cloud. Maxine includes APIs using the latest innovations from NVIDIA research, such as artifact reduction, body pose estimation, super resolution, and noise removal. Maxine also uses other products, like NVIDIA Riva, to provide features like closed captioning and access to virtual assistants. These capabilities are fully accelerated on NVIDIA GPUs to run real-time video streaming applications in the cloud.

An image showcasing the Super Resolution effect from the Video Effects SDK. The left half shows a 360p “before” and the right half showcases the 720p output — *Figure 3. Overview of Maxine super resolution*

Maxine applications enable service providers to offer the same features to every user on any device, including computers, tablets, and phones. The key point is that all the processing happens on the cloud so that the application running on any device requires minimal resources. Applications built with Maxine are easily deployed as microservices and scale to hundreds of thousands of streams in a Kubernetes environment.

The idea is to offload the computationally intensive processing involved in video conferencing systems and reduce the amount of data that must be uplinked to the MEC servers. This is done through a combination of video effects like super resolution and hardware-accelerated encode-decode operations. Maxine also adds quality-of-life features like noise removal, virtual background, room echo cancelation, and more.

What does this mean for end users? Essentially, an end user with a low-bandwidth connection working onsite with a wide range of background noise can get connected with clean audio and a high-definition video. For instance, a plant manager at a noisy production floor in a remote location with a 180p stream connection can seem to be in a silent conference room with a 720p stream. The offloading of compute resources also translates to longer battery life and more free memory for the end user to multitask on resource-constrained devices like mobile phones and laptops.

The features mentioned earlier are housed in the following SDKs:

In addition, the NVIDIA Video Codec SDK provides hardware-accelerated encoding and decoding to aid the infrastructure around video conferencing.

An image showcasing the Maxine AI Face codec with the benefits it provides. It has two images. The image on the left showcases the standard h.264 compression along with the bandwidth required. The image on the right showcases Maxine AI video compression, which has much lower bandwidth requirements. — *Figure 4. Overview of Maxine AI Face codec*

How SoftBank used NVIDIA Maxine

Typically, if you want to use a video conference solution on your mobile phone, you must first install a client application. In SoftBank’s case, the Zoom client is installed on the MEC server on the carrier network instead of the mobile phone. The video and microphone output of the mobile phone are inputs to the Zoom client on the MEC over the 5G network. MEC recognizes the smartphone’s microphone and camera as a virtual microphone and camera and uses them as input for the Zoom client.

An architecture diagram for Softbank’s proof-of-concept implementation showing the interplay between the client, MEC server, and Zoom server — *Figure 5. SoftBank and Maxine POC: Overview diagram*

Here are the hardware and software specifications used for the SoftBank proof of concept implementation:

Hardware
- GPU: Quadro RTX6000 (driver version: 456.43)
- CPU: Intel Xeon Gold 6244
Software
- Windows Server 2019
- WebRTC Native Client Momo
- CUDA 11.1
- NVIDIA Maxine Video Effects SDK (3/25/2021 – VFX – prerelease)
- NVIDIA Maxine Audio Effects SDK EA

This work makes use of SoftBank’s MEC servers (Windows), a modified C++-based open source WebRTC client named “WebRTC Client Momo,” and an application that uses the Video Effect SDK and Audio Effect SDK API.

The NvAFX_RUN API (NVAFX_EFFECT_DENOISER) in AudioEffectSDK and NvVFX_RUN API (NVVFX_FX_SUPER_RES) in the Video Effect SDK are used to perform video super resolution and noise removal.

Code examples to highlight important Video Effects SDK API calls. It shows the API calls needed to initialize and run the effect. — *Figure 6. Sample code for Video Effects SDK API*

Code examples to highlight important Audio Effects SDK API calls. It shows the API calls needed to initialize and run the effect — *Figure 7. Sample code for Audio Effects SDK API*

The video stream sent from the 5G user equipment using the WebRTC protocol is uploaded to the MEC at a low bit rate (in this verification, H.264 (CBR) 180p) to conserve uplink bandwidth. MEC receives degraded audio and video at low bit rates and improves quality using Maxine SDKs. For video, the MEC server uses the Maxine SuperResolution function to resize the video sent from the user equipment at 180p to 720p. SuperResolution reduces noise and restores high-frequency components, resulting in high-quality video.

Figure 8 shows the results of SuperResolution.

An image showcasing the Super Resolution effect from the Video Effects SDK (from SoftBank). The left half shows a 360p “before” and the right half showcases the 720p output — *Figure 8. The original blocky image (the left half) vs. image after applying Maxine AI features (the right half)*

In Figure 8, the left side is the original data before applying SuperResolution, and the right side is the image upscaled. The blocky artifacts in the facial details are replaced with more pixels, leading to a high-quality image. You can replicate these results using the sample application provided with the Video Effects SDK. For a full demonstration, see this video. NEED VIDEO UPLOADED TO YOUTUBE OR DEVZONE

As with the Super Resolution result, the noise removal results are shown in the video.

Video 1. Video showcasing the output of Noise Removal

The video shows the results of testing the Maxine noise removal feature in a scenario where the user is talking while typing on a keyboard. Here, keyboard sounds were selected as a sample, but noise removal was also useful in various situations throughout the development process of SoftBank’s PoC. SoftBank believes that noise removal makes noisy-environment meetings possible, such as outdoors or in a car.

You can replicate these results using the sample application provided with the Audio Effects SDK.

Improve the quality of your video stream

By deploying Maxine on their MEC servers, in addition to low latency, SoftBank now provides a high-quality video and audio experience to all end users. The improved end-user experience is achieved with high savings on the uplink bandwidth since no additional hardware or user equipment is needed. To improve the video quality further, SoftBank plans to use Maxine AI Face Codec.

For more information, see the GPU Virtualization for 5G and MEC Coexistence GTC session to learn more about SoftBank’s PoC or download Maxine SDKs to see how Maxine can improve your application. Contact us with any questions.

Misc

NVIDIA Brings Metaverse Momentum, Research Breakthroughs and New Pro GPU to SIGGRAPH

Post author By
Post date August 13, 2021
No Comments on NVIDIA Brings Metaverse Momentum, Research Breakthroughs and New Pro GPU to SIGGRAPH

Award-winning research, stunning demos, a sweeping vision for how NVIDIA Omniverse will accelerate the work of millions more professionals, and a new pro RTX GPU were the highlights at this week’s SIGGRAPH pro graphics conference. Kicking off the week, NVIDA’s SIGGRAPH special address featuring Richard Kerris, vice president, Omniverse, and Sanja Fidler, senior director, AI Read article >

The post NVIDIA Brings Metaverse Momentum, Research Breakthroughs and New Pro GPU to SIGGRAPH appeared first on The Official NVIDIA Blog.

Misc

any examples of a lightweight tensorflow server in python?

Post author By
Post date August 13, 2021
No Comments on any examples of a lightweight tensorflow server in python?

I want to run a minimal TF server on my machine to do inference on a GAN. So it needs to send the image over localhost. I can’t find any comparable examples. Any help?

submitted by /u/diditforthevideocard
[visit reddit] [comments]

Misc

Any examples/ complete demo on how to use ParameterServerStrategy?

Post author By
Post date August 13, 2021
No Comments on Any examples/ complete demo on how to use ParameterServerStrategy?

We have 2 servers with one server A with a CPU and high storage capacity and another server B a nvidia-DGX with 4 GPUs, the tutorial code doesn’t say how to start the server on worker aka here the dgx, is there any fully working demo explaining it step by step?

submitted by /u/StarGazer10k
[visit reddit] [comments]

Misc

How Digitec Galaxus trains and serves millions of personalized newsletters per week with TFX

Post author By
Post date August 13, 2021
No Comments on How Digitec Galaxus trains and serves millions of personalized newsletters per week with TFX

submitted by /u/nbortolotti
[visit reddit] [comments]

Misc

Webinar: Learn How NVIDIA DriveWorks Gets to the Point with Lidar Sensor Processing

Post author By
Post date August 12, 2021
No Comments on Webinar: Learn How NVIDIA DriveWorks Gets to the Point with Lidar Sensor Processing

With NVIDIA DriveWorks SDK, autonomous vehicles can bring their understanding of the world to a new dimension. The SDK enables autonomous vehicle developers to easily process three-dimensional lidar data and apply it to specific tasks, such as perception or localization. You can learn how to implement this critical toolkit in our expert-led webinar, Point Cloud … Continued

With NVIDIA DriveWorks SDK, autonomous vehicles can bring their understanding of the world to a new dimension.

The SDK enables autonomous vehicle developers to easily process three-dimensional lidar data and apply it to specific tasks, such as perception or localization. You can learn how to implement this critical toolkit in our expert-led webinar, Point Cloud Processing on DriveWorks, Aug. 25.

Lidar sensors enhance an autonomous vehicle’s sensing capabilities, detecting the depth of surrounding objects that may not be picked up by camera or radar.

It does so by bouncing invisible lasers off the vehicle’s surrounding environment, building a 3D image based on the time it takes for those lasers to return. However, processing and extracting contextual meaning from lidar data efficiently and quickly is not as straightforward.

Lidar point cloud processing must be performed in real-time and in tight coordination with other sensing modalities to deliver the full benefits of enhanced perception — a difficult feat to accomplish when working with third-party open source modules.

A Streamlined Solution

With DriveWorks, efficient and accelerated lidar point cloud processing can be performed right out of the gate.

The SDK provides middleware functions that are fundamental to autonomous vehicle development. These consist of the sensor abstraction layer (SAL) and sensor plugins, data recorder, vehicle I/O support and a deep neural network framework. It’s modular, open, and designed to be compliant with automotive industry software standards.

These development tools include a point cloud processing module, which works with the SAL and sensor plugin framework to provide a solid basis for developers to implement a lidar-based perception pipeline with little effort and quick results.

The module is CUDA-accelerated and straightforward to implement. It’s the same toolkit the NVIDIA autonomous driving team uses to develop our own self-driving systems, making it purpose-built for production solutions rather than purely research and development.

Register now to learn more from NVIDIA experts about the DriveWorks point cloud processing module and how to use it in your autonomous vehicle development process.

Misc

convert a .pkl file to .pb ? StyleGan2-ada TF model

Post author By
Post date August 12, 2021
No Comments on convert a .pkl file to .pb ? StyleGan2-ada TF model

Hey all, I’m trying to take a train model and move it to a local deployment software (OpenFrameworks with ofxTensorFlow2 library) but the lib only takes .pb format models. Is there a way to convert the model from .pkl to .pb? It is a TF model, so I feel like maybe it isn’t so hard, but I have no idea how.

This is the colab I’m working from: https://colab.research.google.com/github/dvschultz/ml-art-colabs/blob/master/Stylegan2_ada_Custom_Training.ipynb

submitted by /u/diditforthevideocard
[visit reddit] [comments]

Misc

Nvidia Releases CUDA Python

Post author By
Post date August 12, 2021
No Comments on Nvidia Releases CUDA Python

submitted by /u/lindaarden
[visit reddit] [comments]

Misc

Unlocking Operational Consistency with the NVIDIA User Experience CLI Object Model

Post author By
Post date August 12, 2021
No Comments on Unlocking Operational Consistency with the NVIDIA User Experience CLI Object Model

Cumulus Linux 4.4 introduces a new CLI, NVUE, that is more than just a CLI. NVUE provides a complete object model for Linux, unlocking incredible operational potential.

Cumulus Linux 4.4 is the first release with the NVIDIA User Experience (NVUE), a brand new CLI for Cumulus Linux. Being excited about a new networking CLI sounds a bit like being excited about your new 56k modem. What makes NVUE special isn’t just that it’s a new CLI but it’s the principles it was built on that make it unique. At its core, NVUE has created a full object model of Cumulus Linux enabling advanced programmability, extensibility, and usability.

What is an object model?

Object models aren’t exactly the kind of thing network engineers think about daily. I didn’t know what an object model was before I got involved in helping the team design NVUE.

An object model defines the components of a system and their relationships to each other. For example, an interface is an object. It has components like an IP address or MTU setting. It’s not just the fact that an object model exists that is important, but the thought that is put into how those relationships between objects and components fit together.

An interface and IP address are an easy example, but what about something more complicated? Think about a “bond” interface, also called a port-channel. Is the bond a top-level interface like an Ethernet port with the components of other Ethernet interfaces as children or is being a member in a bond an element of the interface?

A circular relationship between interfaces, the bond, and Ethernet. — *Figure 1. Ethernet interfaces and bonds are at the same level with relationships between them.*

A hierarchical relationship between objects. — *Figure 2. A bond is a property of an interface, like the MTU or IP address.*

These relationships get complicated fast. Failing to think through them creates a poor user experience where you may have to define the same setting multiple times to achieve an end goal or an inconsistent configuration. An imaginary network CLI could have you define any route inside a VRF under a VRF object but any route in the global routing table at the top level, like the following example:

ip vrf red
   ip route 10.1.1.0/24 via 169.254.1.1
 !
 ip route 192.168.1.0/24 via 172.16.1.1

This is a trivial example, but now the way that a route is defined is not uniform, depending on where you are in the system.

What do you get with an object model?

With an understanding of what an object model is, the next question is, “Why should you care?” By having an object model, it makes building ways to interact with the system extremely easy. Systems talk to an API that represents the object model. The first interface is, of course, the CLI, but anything can now be an interface to the system: REST, gRPC, or even RFC1149 Avian Carriers.

CLI, REST, gRPC, Terraform, or RFC1149 Carrier Pigeons all interface with the same NVUE API. — *Figure 3. CLI and REST interfaces are available in Cumulus Linux 4.4.*

By having all the interfaces use the same object model, it guarantees consistent results regardless of how you interface with the system. The CLI and REST API use the same methods to configure a BGP peer. There is never a chance of seeing different behaviors based on which interface you use. Because the object model is the same no matter how you interact with it, this means that going from playing with the CLI to building full automation is an evolution, not a completely new process.

REST and CLI are expected for any network device today. Where can we think beyond this? An object model can be directly imported into a programming language like Python or Java. This enables you to use true programming concepts to build configurations for one device or an entire fabric of devices. You can enforce inputs, values, and relationships like never before. The following code example shows what an NVUE Python interface might look like:

from nvue import Switch spine01 = Switch() x = 1 while x

In this example, I load the nvue library and create a new Switch object called spine01. I have the object tell me how many interfaces exist on the system with len(spine01.interfaces). For each interface, I put it in the up state and assign an IP address with the subnet value matching the interface number. For example, port 3 would have an IP address of 10.1.3.1/24.

This doesn’t exist yet, but it is absolutely in the realm of possibility because an object model exists. Unlike all other networking vendor systems, where the model is determined by the CLI, this CLI is based on the model. The object model is a standalone element that can be imported into programming languages, APIs, or any other system.

Try it out

One of the most valuable pieces of Cumulus Linux is the ability to try all our features and functions virtually. You can use NVIDIA Air to start using NVUE today and see what you think of the future of network CLIs and programmability.

Offsites

SoundStream: An End-to-End Neural Audio Codec

Post author By
Post date August 12, 2021
No Comments on SoundStream: An End-to-End Neural Audio Codec

Posted by Neil Zeghidour, Research Scientist and Marco Tagliasacchi, Staff Research Scientist, Google Research

Audio codecs are used to efficiently compress audio to reduce either storage requirements or network bandwidth. Ideally, audio codecs should be transparent to the end user, so that the decoded audio is perceptually indistinguishable from the original and the encoding/decoding process does not introduce perceivable latency.

Over the past few years, different audio codecs have been successfully developed to meet these requirements, including Opus and Enhanced Voice Services (EVS). Opus is a versatile speech and audio codec, supporting bitrates from 6 kbps (kilobits per second) to 510 kbps, which has been widely deployed across applications ranging from video conferencing platforms, like Google Meet, to streaming services, like YouTube. EVS is the latest codec developed by the 3GPP standardization body targeting mobile telephony. Like Opus, it is a versatile codec operating at multiple bitrates, 5.9 kbps to 128 kbps. The quality of the reconstructed audio using either of these codecs is excellent at medium-to-low bitrates (12–20 kbps), but it degrades sharply when operating at very low bitrates (⪅3 kbps). While these codecs leverage expert knowledge of human perception as well as carefully engineered signal processing pipelines to maximize the efficiency of the compression algorithms, there has been recent interest in replacing these handcrafted pipelines by machine learning approaches that learn to encode audio in a data-driven manner.

Earlier this year, we released Lyra, a neural audio codec for low-bitrate speech. In “SoundStream: an End-to-End Neural Audio Codec”, we introduce a novel neural audio codec that extends those efforts by providing higher-quality audio and expanding to encode different sound types, including clean speech, noisy and reverberant speech, music, and environmental sounds. SoundStream is the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU. It is able to deliver state-of-the-art quality over a broad range of bitrates with a single trained model, which represents a significant advance in learnable codecs.

Learning an Audio Codec from Data
The main technical ingredient of SoundStream is a neural network, consisting of an encoder, decoder and quantizer, all of which are trained end-to-end. The encoder converts the input audio stream into a coded signal, which is compressed using the quantizer and then converted back to audio using the decoder. SoundStream leverages state-of-the-art solutions in the field of neural audio synthesis to deliver audio at high perceptual quality, by training a discriminator that computes a combination of adversarial and reconstruction loss functions that induce the reconstructed audio to sound like the uncompressed original input. Once trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network.

SoundStream training and inference. During training, the encoder, quantizer and decoder parameters are optimized using a combination of reconstruction and adversarial losses, computed by a discriminator, which is trained to distinguish between the original input audio and the reconstructed audio. During inference, the encoder and quantizer on a transmitter client send the compressed bitstream to a receiver client that can then decode the audio signal.

Learning a Scalable Codec with Residual Vector Quantization
The encoder of SoundStream produces vectors that can take an indefinite number of values. In order to transmit them to the receiver using a limited number of bits, it is necessary to replace them by close vectors from a finite set (called a codebook), a process known as vector quantization. This approach works well at bitrates around 1 kbps or lower, but quickly reaches its limits when using higher bitrates. For example, even at a bitrate as low as 3 kbps, and assuming the encoder produces 100 vectors per second, one would need to store a codebook with more than 1 billion vectors, which is infeasible in practice.

In SoundStream, we address this issue by proposing a new residual vector quantizer (RVQ), consisting of several layers (up to 80 in our experiments). The first layer quantizes the code vectors with moderate resolution, and each of the following layers processes the residual error from the previous one. By splitting the quantization process in several layers, the codebook size can be reduced drastically. As an example, with 100 vectors per second at 3 kbps, and using 5 quantizer layers, the codebook size goes from 1 billion to 320. Moreover, we can easily increase or decrease the bitrate by adding or removing quantizer layers, respectively.

Because network conditions can vary while transmitting audio, ideally a codec should be “scalable” so that it can change its bitrate from low to high depending on the state of the network. While most traditional codecs are scalable, previous learnable codecs need to be trained and deployed specifically for each bitrate.

To circumvent this limitation, we leverage the fact that the number of quantization layers in SoundStream controls the bitrate, and propose a new method called “quantizer dropout”. During training, we randomly drop some quantization layers to simulate a varying bitrate. This pushes the decoder to perform well at any bitrate of the incoming audio stream, and thus helps SoundStream to become “scalable” so that a single trained model can operate at any bitrate, performing as well as models trained specifically for these bitrates.

Comparison of SoundStream models (higher is better) that are trained at 18 kbps with quantizer dropout (bitrate scalable), without quantizer dropout (not bitrate scalable) and evaluated with a variable number of quantizers, or trained and evaluated at a fixed bitrate (bitrate specific). The bitrate-scalable model (a single model for all bitrates) does not lose any quality when compared to bitrate-specific models (a different model for each bitrate), thanks to quantizer dropout.

A State-of-the-Art Audio Codec
SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches the quality of EVS at 9.6 kbps, while using 3.2x–4x fewer bits. This means that encoding audio with SoundStream can provide a similar quality while using a significantly lower amount of bandwidth. Moreover, at the same bitrate, SoundStream outperforms the current version of Lyra, which is based on an autoregressive network. Unlike Lyra, which is already deployed and optimized for production usage, SoundStream is still at an experimental stage. In the future, Lyra will incorporate the components of SoundStream to provide both higher audio quality and reduced complexity.

SoundStream at 3kbps vs. state-of-the-art codecs. MUSHRA score is an indication of subjective quality (the higher the better).

The demonstration of SoundStream’s performance compared to Opus, EVS, and the original Lyra codec is presented in these audio examples, a selection of which are provided below.

Speech

Reference
Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)

Music

Reference
Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)

Joint Audio Compression and Enhancement
In traditional audio processing pipelines, compression and enhancement (the removal of background noise) are typically performed by different modules. For example, it is possible to apply an audio enhancement algorithm at the transmitter side, before audio is compressed, or at the receiver side, after audio is decoded. In such a setup, each processing step contributes to the end-to-end latency. Conversely, we design SoundStream in such a way that compression and enhancement can be carried out jointly by the same model, without increasing the overall latency. In the following examples, we show that it is possible to combine compression with background noise suppression, by activating and deactivating denoising dynamically (no denoising for 5 seconds, denoising for 5 seconds, no denoising for 5 seconds, etc.).

Original noisy audio
Denoised output*

* Demonstrated by turning denoising on and off every 5 seconds.

Conclusion
Efficient compression is necessary whenever one needs to transmit audio, whether when streaming a video, or during a conference call. SoundStream is an important step towards improving machine learning-driven audio codecs. It outperforms state-of-the-art codecs, such as Opus and EVS, can enhance audio on demand, and requires deployment of only a single scalable model, rather than many.

SoundStream will be released as a part of the next, improved version of Lyra. By integrating SoundStream with Lyra, developers can leverage the existing Lyra APIs and tools for their work, providing both flexibility and better sound quality. We will also release it as a separate TensorFlow model for experimentation.

AcknowledgmentsThe work described here was authored by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund and Marco Tagliasacchi. We are grateful for all discussions and feedback on this work that we received from our colleagues at Google.