Join us for these featured GTC 2022 sessions to learn about optimizing PyTorch models, accelerating graph neural networks, improving GPU performance with…
Join us for these featured GTC 2022 sessions to learn about optimizing PyTorch models, accelerating graph neural networks, improving GPU performance with automated code generation, and more.
As consumers expect faster, cheaper deliveries, companies are turning to AI to rethink how they move goods. Foremost among these new systems are “hub-and-spoke,” or middle-mile, operations, where companies place distribution centers closer to retail operations for quicker access to inventory. However, faster delivery is just part of the equation. These systems must also be Read article >
Virtual reality (VR), augmented reality (AR), and mixed reality (MR) environments can feel incredibly real due to the physically immersive experience. Adding a…
Virtual reality (VR), augmented reality (AR), and mixed reality (MR) environments can feel incredibly real due to the physically immersive experience. Adding a voice-based interface to your extended reality (XR) application can make it appear even more realistic.
Imagine using your voice to navigate through an environment or giving a verbal command and hearing a response back from a virtual entity.
The possibilities to harness speech AI in XR environments is fascinating. Speech AI skills, such as automatic speech recognition (ASR) and text-to-speech (TTS), make XR applications enjoyable, easy to use, and more accessible to users with speech impairments.
This post explains how speech recognition, also referred to as speech-to-text (STT), can be used in your XR app, what ASR customizations are available, and how to get started with running ASR services in your Windows applications.
Why add speech AI services to XR applications?
In most of today’s XR experiences, users don’t have access to a keyboard or mouse. The way VR game controllers typically interact with a virtual experience is clumsy and unintuitive, making navigation through menus difficult when you’re immersed in the environment.
When virtually immersed, we want our experience to feel natural, both in how we perceive it and in how we interact with it. Speech is one of the most common interactions that we use in the real world.
Adding speech AI-enabled voice commands and responses to your XR application makes interaction feel much more natural and dramatically simplifies the learning curve for users.
Examples of speech AI-enabled XR applications
Today, there are a wide array of wearable tech devices that enable people to experience immersive realities while using their voice:
AR translation glasses can provide real-time translation in AR or just transcribe spoken audio in AR to help people with hearing impairments.
Branded voices are customized and developed for digital avatars in the metaverse, making the experience more believable and realistic.
Social media platforms provide voice-activated AR filters for ease of search and usability. For instance, Snapchat users can search for their desired digital filter using a hands-free voice scan feature.
VR design review
VR can help businesses save costs by automating a number of tasks in the automotive industry, such as modeling cars, training assembly workers, and driving simulations.
An added speech AI component makes hands-free interactions possible. For example, users can leverage STT skills to give commands to VR apps, and apps can respond in a way that sounds human with TTS.
Figure 1. VR car design review workflow architecture
As shown in Figure 1, a user sends an audio request to a VR application that is then converted to text using ASR. Natural language understanding takes text as an input and generates a response, which is spoken back to the user using TTS.
Developing speech AI pipelines is not as easy as it sounds. Traditionally, there has always been a trade-off between accuracy and real-time response when building pipelines.
This post focuses solely on ASR, and we examine some of today’s available customizations for XR app developers. We also discuss using NVIDIA Riva, a GPU-accelerated speech AI SDK, for building applications customized for specific use cases while delivering real-time performance.
Solve domain– and language-specific challenges with ASR customizations
An ASR pipeline includes a feature extractor, acoustic model, decoder or language model, and punctuation and capitalization model (Figure 2).
Figure 2. ASR pipeline
To understand the ASR customizations available, it’s important to grasp the end-to-end process. First, feature extraction takes place to turn raw audio waveforms into spectrograms / mel spectrograms. These spectrograms are then fed into an acoustic model that generates a matrix with probabilities for all the characters at each time step.
Next, the decoder, in conjunction with the language model, uses that matrix as an input to produce a transcript. You can then run the resulting transcript through the punctuation and capitalization model to improve readability.
Advanced speech AI SDKs and workflows, such as Riva, support speech recognition pipeline customization. Customization helps you address several language-specific challenges, such as understanding one or more of the following:
Multiple accents
Word contextualization
Domain-specific jargon
Multiple dialects
Multiple languages
Users in noisy environments
Customizations in Riva can be applied in both the training and inference stages. Starting with training-level customizations, you can fine-tune acoustic models, decoder/language models, and punctuation and capitalization models. This ensures that your pipeline understands different language, dialects, accents, and industry-specific jargon, and is robust to noise.
When it comes to inference-level customizations, you can use word boosting. With word boosting, the ASR pipeline is more likely to recognize certain words of interest by giving them a higher score when decoding the output of the acoustic model.
Get started with integrating ASR services for XR development using NVIDIA Riva
Riva runs as a client-server model. To run Riva, you need access to a Linux server with an NVIDIA GPU, where you can install and run the Riva server (specifics and instructions are provided in this post).
The Riva client API is integrated into your Windows application. At runtime, the Windows client sends Riva requests over the network to the Riva server, and the Riva server sends back replies. A single Riva server can simultaneously support many Riva clients.
ASR services can be run in two different modes:
Offline mode: A complete speech segment is captured, and when complete it is then sent to Riva to be converted to text.
Streaming mode: The speech segment is being streamed to the Riva server in real time, and the text result is being streamed back in real time. Streaming mode is a bit more complicated, as it requires multiple threads.
Examples showing both modes are provided later in this post.
In this section, you learn several ways to integrate Riva into your Windows application:
Python ASR offline client
Python streaming ASR client
C++ offline client using Docker
C++ streaming client
First, here’s how to set up and run the Riva server.
Follow all steps to be able to run ngc commands from a command-line interface (CLI).
Access to NVIDIA Volta, NVIDIA Turing, or an NVIDIA Ampere Architecture-based A100 GPU. Linux servers with NVIDIA GPUs are also available from the major CSPs. For more information, see the support matrix.
Docker installation with support for NVIDIA GPUs. For more information about instructions, see the installation guide.
Follow the instructions to install the NVIDIA Container Toolkit and then the nvidia-docker package.
Server setup
Download the scripts from NGC by running the following command:
ngc registry resource download-version nvidia/riva/riva_quickstart:2.4.0
First, run the following command to install the riva client package. Make sure that you’re using Python version 3.7.
pip install nvidia-riva-client
The following code example runs ASR transcription in offline mode. You must change the server address, give the path to the audio file to be transcribed, and select the language code for your choice. Currently, Riva supports English, Spanish, German, Russian, and Mandarin.
import io
import IPython.display as ipd
import grpc
import riva.client
auth = riva.client.Auth(uri='server address:port number')
riva_asr = riva.client.ASRService(auth)
# Supports .wav file in LINEAR_PCM encoding, including .alaw, .mulaw, and .flac formats with single channel
# read in an audio file from local disk
path = "audio file path"
with io.open(path, 'rb') as fh:
content = fh.read()
ipd.Audio(path)
# Set up an offline/batch recognition request
config = riva.client.RecognitionConfig()
#req.config.encoding = ra.AudioEncoding.LINEAR_PCM # Audio encoding can be detected from wav
#req.config.sample_rate_hertz = 0 # Sample rate can be detected from wav and resampled if needed
config.language_code = "en-US" # Language code of the audio clip
config.max_alternatives = 1 # How many top-N hypotheses to return
config.enable_automatic_punctuation = True # Add punctuation when end of VAD detected
config.audio_channel_count = 1 # Mono channel
response = riva_asr.offline_recognize(content, config)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)
print("nnFull Response Message:")
print(response)
Running the Python streaming ASR client
To run an ASR streaming client, clone the riva python-clients repository and run the file that comes with the repository.
To get the ASR streaming client to work on Windows, clone the repository by running the following command:
Riva_asr_client –riva_url server address:port number –audio_file audio_sample
Running the C++ ASR streaming client
To run the ASR streaming client riva_asr in C++, you must first compile the cpp sample. It is straightforward using CMake, after the following dependencies are met:
gflags
glog
grpc
rtaudio
rapidjson
protobuf
grpc_cpp_plugin
Create a folder /build within the root source folder. From the terminal, type cmake .. and then make. For more information, see the readme file included in the repository.
After the sample is compiled, run it by entering the following command:
riva_asr.exe --riva_uri={riva server url}:{riva server port} --audio_device={Input device name, e.g. "plughw:PCH,0"}
riva_uri:The address:port value of the riva server. By default, the riva server listens to port 50051.
audio_device: The input device (microphone) to be used.
The sample implements essentially four steps. Only a few short examples are shown in this post. For more information, see the file streaming_recognize_client.cc.
Open the input stream using the input (microphone) device specified from the command line. In this case, you are using one channel at 16K samples per second and 16 bits per sample.
Open the grpc communication channel with the Riva server using the protocol api interface specified by the .proto files (in the source in the folder riva/proto):
int StreamingRecognizeClient::DoStreamingFromMicrophone(const std::string& audio_device, bool& request_exit)
{
…
std::shared_ptr call = std::make_shared(1, word_time_offsets_);
call->streamer = stub_->StreamingRecognize(&call->context);
// Send first request
nr_asr::StreamingRecognizeRequest request;
auto streaming_config = request.mutable_streaming_config();
streaming_config->set_interim_results(interim_results_);
auto config = streaming_config->mutable_config();
config->set_sample_rate_hertz(sampleRate);
config->set_language_code(language_code_);
config->set_encoding(encoding);
config->set_max_alternatives(max_alternatives_);
config->set_audio_channel_count(parameters.nChannels);
config->set_enable_word_time_offsets(word_time_offsets_);
config->set_enable_automatic_punctuation(automatic_punctuation_);
config->set_enable_separate_recognition_per_channel(separate_recognition_per_channel_);
config->set_verbatim_transcripts(verbatim_transcripts_);
if (model_name_ != "") {
config->set_model(model_name_);
}
call->streamer->Write(request);
Start sending audio data, received by the microphone to riva through grpc messages:
static int MicrophoneCallbackMain( void *outputBuffer, void *inputBuffer, unsigned int nBufferFrames, double streamTime, RtAudioStreamStatus status, void *userData )
Receive the transcribed audio through grpc answers from the server:
void
StreamingRecognizeClient::ReceiveResponses(std::shared_ptr call, bool audio_device)
{
…
while (call->streamer->Read(&call->response)) { // Returns false when no m ore to read.
call->recv_times.push_back(std::chrono::steady_clock::now());
// Reset the partial transcript
call->latest_result_.partial_transcript = "";
call->latest_result_.partial_time_stamps.clear();
bool is_final = false;
for (int r = 0; r response.results_size(); ++r) {
const auto& result = call->response.results(r);
if (result.is_final()) {
is_final = true;
}
…
call->latest_result_.audio_processed = result.audio_processed();
if (print_transcripts_) {
call->AppendResult(result);
}
}
if (call->response.results_size() && interim_results_ && print_transcripts_) {
std::cout latest_result_.final_transcripts[0] +
call->latest_result_.partial_transcript
recv_final_flags.push_back(is_final);
}
Resources for developing speech AI applications
By recognizing your voice or carrying out a command, speech AI is expanding from empowering actual humans in contact centers to empowering digital humans in the metaverse.
For more information about how to add speech AI skills to your applications, see the following resources
Access beginner and advanced scripts in the /nvidia-riva/tutorials GitHub repo to try out ASR and TTS augmentations such as ASR word boosting and adjusting TTS pitch, rate, and pronunciation settings.
Learn how to add ASR or TTS services to your specific use case by downloading the free ebook, Building Speech AI Applications.
AI processing requires full-stack innovation across hardware and software platforms to address the growing computational demands of neural networks. A key area…
AI processing requires full-stack innovation across hardware and software platforms to address the growing computational demands of neural networks. A key area to drive efficiency is using lower precision number formats to improve computational efficiency, reduce memory usage, and optimize for interconnect bandwidth.
To realize these benefits, the industry has moved from 32-bit precisions to 16-bit, and now even 8-bit precision formats. Transformer networks, which are one of the most important innovations in AI, benefit from an 8-bit floating point precision in particular. We believe that having a common interchange format will enable rapid advancements and the interoperability of both hardware and software platforms to advance computing.
NVIDIA, Arm, and Intel have jointly authored a whitepaper, FP8 Formats for Deep Learning, describing an 8-bit floating point (FP8) specification. It provides a common format that accelerates AI development by optimizing memory usage and works for both AI training and inference. This FP8 specification has two variants, E5M2 and E4M3.
This format is natively implemented in the NVIDIA Hopper architecture and has shown excellent results in initial testing. It will immediately benefit from the work being done by the broader ecosystem, including the AI frameworks, in implementing it for developers.
Compatibility and flexibility
FP8 minimizes deviations from existing IEEE 754 floating point formats with a good balance between hardware and software to leverage existing implementations, accelerate adoption, and improve developer productivity.
E5M2 uses five bits for the exponent and two bits for the mantissa and is a truncated IEEE FP16 format. In circumstances where more precision is required at the expense of some numerical range, the E4M3 format makes a few adjustments to extend the range representable with a four-bit exponent and a three-bit mantissa.
The new format saves additional computational cycles since it uses just eight bits. It can be used for both AI training and inference without requiring any re-casting between precisions. Furthermore, by minimizing deviations from existing floating point formats, it enables the greatest latitude for future AI innovation while still adhering to current conventions.
High-accuracy training and inference
Testing the proposed FP8 format shows comparable accuracy to 16-bit precisions across a wide array of use cases, architectures, and networks. Results on transformers, computer vision, and GAN networks all show that FP8 training accuracy is similar to 16-bit precisions while delivering significant speedups. For more information about accuracy studies, see the FP8 Formats for Deep Learning whitepaper.
Figure 1. Language model AI training
In Figure 1, different networks use different accuracy metrics (PPL and Loss), as indicated.
Figure 2. Language model AI inference
In MLPerf Inference v2.1, the AI industry’s leading benchmark, NVIDIA Hopper leveraged this new FP8 format to deliver a 4.5x speedup on the BERT high-accuracy model, gaining throughput without compromising on accuracy.
Moving towards standardization
NVIDIA, Arm, and Intel have published this specification in an open, license-free format to encourage broad industry adoption. They will also submit this proposal to IEEE.
By adopting an interchangeable format that maintains accuracy, AI models will operate consistently and performantly across all hardware platforms, and help advance the state of the art of AI.
Standards bodies and the industry as a whole are encouraged to build platforms that can efficiently adopt the new standard. This will help accelerate AI development and deployment by providing a universal, interchangeable precision.
Posted by Daniel Rebain, Student Researcher, and Mark Matthews, Senior Software Engineer, Google Research, Perception Team
An important aspect of human vision is our ability to comprehend 3D shape from the 2D images we observe. Achieving this kind of understanding with computer vision systems has been a fundamental challenge in the field. Many successful approaches rely on multi-view data, where two or more images of the same scene are available from different perspectives, which makes it much easier to infer the 3D shape of objects in the images.
There are, however, many situations where it would be useful to know 3D structure from a single image, but this problem is generally difficult or impossible to solve. For example, it isn’t necessarily possible to tell the difference between an image of an actual beach and an image of a flat poster of the same beach. However it is possible to estimate 3D structure based on what kind of 3D objects occur commonly and what similar structures look like from different perspectives.
In “LOLNeRF: Learn from One Look”, presented at CVPR 2022, we propose a framework that learns to model 3D structure and appearance from collections of single-view images. LOLNeRF learns the typical 3D structure of a class of objects, such as cars, human faces or cats, but only from single views of any one object, never the same object twice. We build our approach by combining Generative Latent Optimization (GLO) and neural radiance fields (NeRF) to achieve state-of-the-art results for novel view synthesis and competitive results for depth estimation.
We learn a 3D object model by reconstructing a large collection of single-view images using a neural network conditioned on latent vectors, z (left). This allows for a 3D model to be lifted from the image, and rendered from novel viewpoints. Holding the camera fixed, we can interpolate or sample novel identities (right).
Combining GLO and NeRF GLO is a general method that learns to reconstruct a dataset (such as a set of 2D images) by co-learning a neural network (decoder) and table of codes (latents) that is also an input to the decoder. Each of these latent codes re-creates a single element (such as an image) from the dataset. Because the latent codes have fewer dimensions than the data elements themselves, the network is forced to generalize, learning common structure in the data (such as the general shape of dog snouts).
NeRF is a technique that is very good at reconstructing a static 3D object from 2D images. It represents an object with a neural network that outputs color and density for each point in 3D space. Color and density values are accumulated along rays, one ray for each pixel in a 2D image. These are then combined using standard computer graphics volume rendering to compute a final pixel color. Importantly, all these operations are differentiable, allowing for end-to-end supervision. By enforcing that each rendered pixel (of the 3D representation) matches the color of ground truth (2D) pixels, the neural network creates a 3D representation that can be rendered from any viewpoint.
We combine NeRF with GLO by assigning each object a latent code and concatenating it with standard NeRF inputs, giving it the ability to reconstruct multiple objects. Following GLO, we co-optimize these latent codes along with network weights during training to reconstruct the input images. Unlike standard NeRF, which requires multiple views of the same object, we supervise our method with only single views of any one object (but multiple examples of that type of object). Because NeRF is inherently 3D, we can then render the object from arbitrary viewpoints. Combining NeRF with GLO gives it the ability to learn common 3D structure across instances from only single views while still retaining the ability to recreate specific instances of the dataset.
Camera Estimation In order for NeRF to work, it needs to know the exact camera location, relative to the object, for each image. Unless this was measured when the image was taken, it is generally unknown. Instead, we use the MediaPipe Face Mesh to extract five landmark locations from the images. Each of these 2D predictions correspond to a semantically consistent point on the object (e.g., the tip of the nose or corners of the eyes). We can then derive a set of canonical 3D locations for the semantic points, along with estimates of the camera poses for each image, such that the projection of the canonical points into the images is as consistent as possible with the 2D landmarks.
We train a per-image table of latent codes alongside a NeRF model. Output is subject to per-ray RGB, mask and hardness losses. Cameras are derived from a fit of predicted landmarks to canonical 3D keypoints.
Hard Surface and Mask Losses Standard NeRF is effective for accurately reproducing the images, but in our single-view case, it tends to produce images that look blurry when viewed off-axis. To address this, we introduce a novel hard surface loss, which encourages the density to adopt sharp transitions from exterior to interior regions, reducing blurring. This essentially tells the network to create “solid” surfaces, and not semi-transparent ones like clouds.
We also obtained better results by splitting the network into separate foreground and background networks. We supervised this separation with a mask from the MediaPipe Selfie Segmenter and a loss to encourage network specialization. This allows the foreground network to specialize only on the object of interest, and not get “distracted” by the background, increasing its quality.
Results We surprisingly found that fitting only five key points gave accurate enough camera estimates to train a model for cats, dogs, or human faces. This means that given only a single view of your beloved cats Schnitzel, Widget and friends, you can create a new image from any other angle.
Top: example cat images from AFHQ. Bottom: A synthesis of novel 3D views created by LOLNeRF.
Conclusion We’ve developed a technique that is effective at discovering 3D structure from single 2D images. We see great potential in LOLNeRF for a variety of applications and are currently investigating potential use-cases.
Interpolation of feline identities from linear interpolation of learned latent codes for different examples in AFHQ.
Code Release We acknowledge the potential for misuse and importance of acting responsibly. To that end, we will only release the code for reproducibility purposes, but will not release any trained generative models.
Acknowledgements We would like to thank Andrea Tagliasacchi, Kwang Moo Yi, Viral Carpenter, David Fleet, Danica Matthews, Florian Schroff, Hartwig Adam and Dmitry Lagun for continuous help in building this technology.
Think fast. Enterprise AI, new gaming technology, the metaverse and the 3D internet, and advanced AI technologies tailored to just about every industry are all coming your way. NVIDIA founder and CEO Jensen Huang’s keynote at NVIDIA GTC on Tuesday, Sept. 20, is the best way to get ahead of all these trends. NVIDIA’s virtual Read article >
More than 6 million pairs of eyes will be on real-time AI avatar technology in this week’s finale of America’s Got Talent — currently the second-most popular primetime TV show in the U.S.. Metaphysic, a member of the NVIDIA Inception global network of technology startups, is one of 11 acts competing for $1 million and Read article >
GPU-accelerated processing is vital to many automotive and embedded systems. Safety-critical and real-time applications have different requirements and…
GPU-accelerated processing is vital to many automotive and embedded systems. Safety-critical and real-time applications have different requirements and deployment priorities than consumer applications, but they often are developed using GPU APIs that have been primarily designed for use in games.
Vulkan SC (Safety Critical) is a newly released open standard to streamline the use of GPUs in markets where functional safety and hitch-free performance are essential.
NVIDIA helped lead the creation of the Vulkan SC 1.0 API and is now shipping production drivers on its NVIDIA DRIVE and NVIDIA Jetson platforms.
Deterministic GPU processing
Vulkan is a royalty-free open standard from the Khronos Group standards organization. It is the only modern, cross-platform GPU API. Launched in 2016, Vulkan is primarily designed for use in games and professional design applications on desktop and mobile devices using Windows, Linux, and Android.
Khronos derived Vulkan SC from Vulkan 1.2, with the Vulkan SC 1.0 specification being released in March 2022. Vulkan SC defines the subset of the Vulkan API that is essential for embedded markets in order to reduce API surface area for streamlined implementation and testing.
Vulkan SC also increases API robustness by eliminating ignored parameters and undefined behaviors, and enhancing detection, reporting, and correction of run-time faults. Vulkan SC enables predictable, hitch-free execution by moving pipeline compilation offline, and providing sophisticated functionality for managing static memory allocation and resource management with explicit synchronization.
Vulkan SC and the NVIDIA DRIVE automotive platform
The streamlined Vulkan SC API reduces the cost and effort of system-level safety certification to standards such as ISO 26262, a functional safety standard used in the automotive industry. Simplifying system certification enables manufacturers to smoothly deploy advanced graphics capabilities in driver assistance systems on the NVIDIA DRIVE platform.
For example, Level 2 and Level 3 AI-assisted vehicles require the driver to remain in the loop during vehicle operation. Safe visualization inside the cockpit and the digital instrument cluster is key to ensuring the human driver is aware of how the system is perceiving and reacting to the surrounding environment.
The confidence view is a rendering of the mind of the vehicle’s AI and how it sees the world. It shows exactly what the sensor suite and perception system are detecting in real time using a 3D surround model. By incorporating this view in the cabin interior, the vehicle can communicate to its occupants the accuracy and reliability of the autonomous driving system at every step of the journey.
The ability to support such in-vehicle graphics safely and securely is what makes Vulkan SC critical to the next-generation intelligent vehicle experience. Production Vulkan SC 1.0 drivers are included in DRIVE OS 6.0.4.0, which shipped August 29, 2022.
Vulkan SC on the NVIDIA Jetson embedded platform
NVIDIA Jetson is the world’s leading platform for autonomous machines and other embedded applications. It includes Jetson modules, which are small form-factor, high-performance computers, the NVIDIA JetPack SDK for accelerating software, and an ecosystem with sensors, SDKs, services, and products to speed development.
Applications for Jetson-based systems typically do not require formal safety certification. However, many embedded and autonomous systems can directly benefit from the deterministic, real-time GPU graphics and compute acceleration provided by Vulkan SC. With these capabilities, the Jetson platform can support a broader diversity of applications.
The NVIDIA Jetpack 5.0.2 SDK released on August 15 2022 includes conformant, production Vulkan SC 1.0 drivers for the Linux OS.
Ongoing NVIDIA commitment to the Vulkan SC API
NVIDIA will continue to invest in the evolution of the Vulkan SC open standard API at Khronos. We are committed to providing conformant, production drivers on platforms such as NVIDIA DRIVE and Jetson.
Later in 2022, NVIDIA will also ship support for Vulkan SC in NVIDIA Nsight developer tools. Vulkan SC streamlines the open, cross-platform Vulkan API for deterministic GPU graphics and compute, enabling advanced applications and use cases on safety-certified and real-time embedded platforms.
Now, NVIDIA provides industry-leading support for this groundbreaking open standard, enabling GPU acceleration in new classes of products. Download the latest NVIDIA DRIVE or NVIDIA Jetpack releases with Vulkan SC drivers today.