Categories
Misc

Upcoming Event: Deep Learning Framework Sessions at GTC 2022

Join us for these featured GTC 2022 sessions to learn about optimizing PyTorch models, accelerating graph neural networks, improving GPU performance with…

Join us for these featured GTC 2022 sessions to learn about optimizing PyTorch models, accelerating graph neural networks, improving GPU performance with automated code generation, and more.

Categories
Misc

Explainer: What Is an Exaflop?

An exaflop is a measure of performance for a supercomputer that can calculate at least one quintillion floating point operations per second.

An exaflop is a measure of performance for a supercomputer that can calculate at least one quintillion floating point operations per second.

Categories
Misc

Reinventing the Wheel: Gatik’s Apeksha Kumavat Accelerates Autonomous Delivery for Wal-Mart and More

As consumers expect faster, cheaper deliveries, companies are turning to AI to rethink how they move goods. Foremost among these new systems are “hub-and-spoke,” or middle-mile, operations, where companies place distribution centers closer to retail operations for quicker access to inventory. However, faster delivery is just part of the equation. These systems must also be Read article >

The post Reinventing the Wheel: Gatik’s Apeksha Kumavat Accelerates Autonomous Delivery for Wal-Mart and More appeared first on NVIDIA Blog.

Categories
Misc

Developing the Next Generation of Extended Reality Applications with Speech AI

Virtual reality (VR), augmented reality (AR), and mixed reality (MR) environments can feel incredibly real due to the physically immersive experience. Adding a…

Virtual reality (VR), augmented reality (AR), and mixed reality (MR) environments can feel incredibly real due to the physically immersive experience. Adding a voice-based interface to your extended reality (XR) application can make it appear even more realistic.

Imagine using your voice to navigate through an environment or giving a verbal command and hearing a response back from a virtual entity.

Sign up for the latest Speech AI News from NVIDIA.

The possibilities to harness speech AI in XR environments is fascinating. Speech AI skills, such as automatic speech recognition (ASR) and text-to-speech (TTS), make XR applications enjoyable, easy to use, and more accessible to users with speech impairments.

This post explains how speech recognition, also referred to as speech-to-text (STT), can be used in your XR app, what ASR customizations are available, and how to get started with running ASR services in your Windows applications.

Why add speech AI services to XR applications? 

In most of today’s XR experiences, users don’t have access to a keyboard or mouse. The way VR game controllers typically interact with a virtual experience is clumsy and unintuitive, making navigation through menus difficult when you’re immersed in the environment.

When virtually immersed, we want our experience to feel natural, both in how we perceive it and in how we interact with it. Speech is one of the most common interactions that we use in the real world.

Adding speech AI-enabled voice commands and responses to your XR application makes interaction feel much more natural and dramatically simplifies the learning curve for users.

Examples of speech AI-enabled XR applications

Today, there are a wide array of wearable tech devices that enable people to experience immersive realities while using their voice:

  • AR translation glasses can provide real-time translation in AR or just transcribe spoken audio in AR to help people with hearing impairments.
  • Branded voices are customized and developed for digital avatars in the metaverse, making the experience more believable and realistic.
  • Social media platforms provide voice-activated AR filters for ease of search and usability. For instance, Snapchat users can search for their desired digital filter using a hands-free voice scan feature.

VR design review

VR can help businesses save costs by automating a number of tasks in the automotive industry, such as modeling cars, training assembly workers, and driving simulations.

An added speech AI component makes hands-free interactions possible. For example, users can leverage STT skills to give commands to VR apps, and apps can respond in a way that sounds human with TTS.

Diagram showing how automatic speech recognition and text-to-speech integrate into the VR car design workflow architecture.
Figure 1. VR car design review workflow architecture

As shown in Figure 1, a user sends an audio request to a VR application that is then converted to text using ASR. Natural language understanding takes text as an input and generates a response, which is spoken back to the user using TTS.

Developing speech AI pipelines is not as easy as it sounds. Traditionally, there has always been a trade-off between accuracy and real-time response when building pipelines.

This post focuses solely on ASR, and we examine some of today’s available customizations for XR app developers. We also discuss using NVIDIA Riva, a GPU-accelerated speech AI SDK, for building applications customized for specific use cases while delivering real-time performance.

Solve domain– and language-specific challenges with ASR customizations

An ASR pipeline includes a feature extractor, acoustic model, decoder or language model, and punctuation and capitalization model (Figure 2).

Diagram shows the speech recognition pipeline process with an audio input and text output.
Figure 2. ASR pipeline

To understand the ASR customizations available, it’s important to grasp the end-to-end process. First, feature extraction takes place to turn raw audio waveforms into spectrograms / mel spectrograms. These spectrograms are then fed into an acoustic model that generates a matrix with probabilities for all the characters at each time step.

Next, the decoder, in conjunction with the language model, uses that matrix as an input to produce a transcript. You can then run the resulting transcript through the punctuation and capitalization model to improve readability.

Advanced speech AI SDKs and workflows, such as Riva, support speech recognition pipeline customization. Customization helps you address several language-specific challenges, such as understanding one or more of the following:

  • Multiple accents
  • Word contextualization
  • Domain-specific jargon
  • Multiple dialects
  • Multiple languages
  • Users in noisy environments

Customizations in Riva can be applied in both the training and inference stages. Starting with training-level customizations, you can fine-tune acoustic models, decoder/language models, and punctuation and capitalization models. This ensures that your pipeline understands different language, dialects, accents, and industry-specific jargon, and is robust to noise.

When it comes to inference-level customizations, you can use word boosting. With word boosting, the ASR pipeline is more likely to recognize certain words of interest by giving them a higher score when decoding the output of the acoustic model.

Get started with integrating ASR services for XR development using NVIDIA Riva

Riva runs as a client-server model. To run Riva, you need access to a Linux server with an NVIDIA GPU, where you can install and run the Riva server (specifics and instructions are provided in this post).

The Riva client API is integrated into your Windows application. At runtime, the Windows client sends Riva requests over the network to the Riva server, and the Riva server sends back replies. A single Riva server can simultaneously support many Riva clients.

ASR services can be run in two different modes:

  • Offline mode: A complete speech segment is captured, and when complete it is then sent to Riva to be converted to text.
  • Streaming mode: The speech segment is being streamed to the Riva server in real time, and the text result is being streamed back in real time. Streaming mode is a bit more complicated, as it requires multiple threads.

Examples showing both modes are provided later in this post.

In this section, you learn several ways to integrate Riva into your Windows application:

  • Python ASR offline client
  • Python streaming ASR client
  • C++ offline client using Docker
  • C++ streaming client

First, here’s how to set up and run the Riva server.

Prerequisites

  • Access to NGC. For step-by-step instructions, see the NGC Getting Started Guide
    • Follow all steps to be able to run ngc commands from a command-line interface (CLI).
  • Access to NVIDIA Volta, NVIDIA Turing, or an NVIDIA Ampere Architecture-based A100 GPU. Linux servers with NVIDIA GPUs are also available from the major CSPs. For more information, see the support matrix.
  • Docker installation with support for NVIDIA GPUs. For more information about instructions, see the installation guide.
    • Follow the instructions to install the NVIDIA Container Toolkit and then the nvidia-docker package.

Server setup

Download the scripts from NGC by running the following command:

ngc registry resource download-version nvidia/riva/riva_quickstart:2.4.0

Initialize the Riva server:

bash riva_init.sh

Start the Riva server:

bash riva_start.sh

For more information about the latest updates, see the NVIDIA Riva Quick Start Guide.

Running the Python ASR offline client

First, run the following command to install the riva client package. Make sure that you’re using Python version 3.7.

pip install nvidia-riva-client

The following code example runs ASR transcription in offline mode. You must change the server address, give the path to the audio file to be transcribed, and select the language code for your choice. Currently, Riva supports English, Spanish, German, Russian, and Mandarin.

import io
import IPython.display as ipd
import grpc

import riva.client

auth = riva.client.Auth(uri='server address:port number')

riva_asr = riva.client.ASRService(auth)

# Supports .wav file in LINEAR_PCM encoding, including .alaw, .mulaw, and .flac formats with single channel
# read in an audio file from local disk
path = "audio file path"
with io.open(path, 'rb') as fh:
    content = fh.read()
ipd.Audio(path)

# Set up an offline/batch recognition request
config = riva.client.RecognitionConfig()
#req.config.encoding = ra.AudioEncoding.LINEAR_PCM    # Audio encoding can be detected from wav
#req.config.sample_rate_hertz = 0                     # Sample rate can be detected from wav and resampled if needed
config.language_code = "en-US"                    # Language code of the audio clip
config.max_alternatives = 1                       # How many top-N hypotheses to return
config.enable_automatic_punctuation = True        # Add punctuation when end of VAD detected
config.audio_channel_count = 1                    # Mono channel

response = riva_asr.offline_recognize(content, config)
asr_best_transcript = response.results[0].alternatives[0].transcript
print("ASR Transcript:", asr_best_transcript)

print("nnFull Response Message:")
print(response)

Running the Python streaming ASR client

To run an ASR streaming client, clone the riva python-clients repository and run the file that comes with the repository.

To get the ASR streaming client to work on Windows, clone the repository by running the following command:

git clone https://github.com/nvidia-riva/python-clients.git

From the python-clients/scripts/asr folder, run the following command:

python transcribe_mic.py --server=server address:port number

Here’s the transcribe_mic.py:

import argparse

import riva.client
from riva.client.argparse_utils import add_asr_config_argparse_parameters, add_connection_argparse_parameters

import riva.client.audio_io


def parse_args() -> argparse.Namespace:
    default_device_info = riva.client.audio_io.get_default_input_device_info()
    default_device_index = None if default_device_info is None else default_device_info['index']
    parser = argparse.ArgumentParser(
        description="Streaming transcription from microphone via Riva AI Services",
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
    )
    parser.add_argument("--input-device", type=int, default=default_device_index, help="An input audio device to use.")
    parser.add_argument("--list-devices", action="store_true", help="List input audio device indices.")
    parser = add_asr_config_argparse_parameters(parser, profanity_filter=True)
    parser = add_connection_argparse_parameters(parser)
    parser.add_argument(
        "--sample-rate-hz",
        type=int,
        help="A number of frames per second in audio streamed from a microphone.",
        default=16000,
    )
    parser.add_argument(
        "--file-streaming-chunk",
        type=int,
        default=1600,
        help="A maximum number of frames in a audio chunk sent to server.",
    )
    args = parser.parse_args()
    return args


def main() -> None:
    args = parse_args()
    if args.list_devices:
        riva.client.audio_io.list_input_devices()
        return
    auth = riva.client.Auth(args.ssl_cert, args.use_ssl, args.server)
    asr_service = riva.client.ASRService(auth)
    config = riva.client.StreamingRecognitionConfig(
        config=riva.client.RecognitionConfig(
            encoding=riva.client.AudioEncoding.LINEAR_PCM,
            language_code=args.language_code,
            max_alternatives=1,
            profanity_filter=args.profanity_filter,
            enable_automatic_punctuation=args.automatic_punctuation,
            verbatim_transcripts=not args.no_verbatim_transcripts,
            sample_rate_hertz=args.sample_rate_hz,
            audio_channel_count=1,
        ),
        interim_results=True,
    )
    riva.client.add_word_boosting_to_config(config, args.boosted_lm_words, args.boosted_lm_score)
    with riva.client.audio_io.MicrophoneStream(
        args.sample_rate_hz,
        args.file_streaming_chunk,
        device=args.input_device,
    ) as audio_chunk_iterator:
        riva.client.print_streaming(
            responses=asr_service.streaming_response_generator(
                audio_chunks=audio_chunk_iterator,
                streaming_config=config,
            ),
            show_intermediate=True,
        )


if __name__ == '__main__':
    main()

Running the C++ ASR offline client using Docker

Here’s how to run a Riva ASR offline client using Docker in C++.

Clone the /cpp-clients GitHub repository by running the following command:

git clone https://github.com/nvidia-riva/cpp-clients.git

Build the Docker image:

DOCKER_BUILDKIT=1 docker build . –tag riva-client

Run the Docker image:

docker run -it --net=host riva-client

Start the Riva speech recognition client:

Riva_asr_client –riva_url server address:port number –audio_file audio_sample

Running the C++ ASR streaming client

To run the ASR streaming client riva_asr in C++, you must first compile the cpp sample. It is straightforward using CMake, after the following dependencies are met:

  • gflags
  • glog
  • grpc
  • rtaudio
  • rapidjson
  • protobuf
  • grpc_cpp_plugin

Create a folder /build within the root source folder. From the terminal, type cmake .. and then make. For more information, see the readme file included in the repository.

After the sample is compiled, run it by entering the following command:

riva_asr.exe --riva_uri={riva server url}:{riva server port} --audio_device={Input device name, e.g. "plughw:PCH,0"}
  • riva_uri: The address:port value of the riva server. By default, the riva server listens to port 50051.
  • audio_device: The input device (microphone) to be used.

The sample implements essentially four steps. Only a few short examples are shown in this post. For more information, see the file streaming_recognize_client.cc.

Open the input stream using the input (microphone) device specified from the command line. In this case, you are using one channel at 16K samples per second and 16 bits per sample.

int StreamingRecognizeClient::DoStreamingFromMicrophone(const std::string& audio_device, bool& request_exit)
{
 nr::AudioEncoding encoding = nr::LINEAR_PCM;
 
 adc.setErrorCallback(rtErrorCallback);

 RtAudio::StreamParameters parameters;
 parameters.nChannels = 1;
 parameters.firstChannel = 0;
 unsigned int sampleRate = 16000;
 unsigned int bufferFrames = 1600; // (0.1 sec of rec) sample frames
 RtAudio::StreamOptions streamOptions;
 streamOptions.flags = RTAUDIO_MINIMIZE_LATENCY;
…

RtAudioErrorType error = adc.openStream( nullptr, &parameters, RTAUDIO_SINT16, sampleRate, &bufferFrames, &MicrophoneCallbackMain, static_cast(&uData), &streamOptions);

Open the grpc communication channel with the Riva server using the protocol api interface specified by the .proto files (in the source in the folder riva/proto):

int StreamingRecognizeClient::DoStreamingFromMicrophone(const std::string& audio_device, bool& request_exit)
{
…

std::shared_ptr call = std::make_shared(1, word_time_offsets_);
call->streamer = stub_->StreamingRecognize(&call->context);

// Send first request
nr_asr::StreamingRecognizeRequest request;
auto streaming_config = request.mutable_streaming_config();
streaming_config->set_interim_results(interim_results_);
auto config = streaming_config->mutable_config();
config->set_sample_rate_hertz(sampleRate);
config->set_language_code(language_code_);
config->set_encoding(encoding);
config->set_max_alternatives(max_alternatives_);
config->set_audio_channel_count(parameters.nChannels);
config->set_enable_word_time_offsets(word_time_offsets_);
config->set_enable_automatic_punctuation(automatic_punctuation_);
config->set_enable_separate_recognition_per_channel(separate_recognition_per_channel_);
config->set_verbatim_transcripts(verbatim_transcripts_);
if (model_name_ != "") {
 config->set_model(model_name_);
}

call->streamer->Write(request);

Start sending audio data, received by the microphone to riva through grpc messages:

static int MicrophoneCallbackMain( void *outputBuffer, void *inputBuffer, unsigned int nBufferFrames, double streamTime, RtAudioStreamStatus status, void *userData )

Receive the transcribed audio through grpc answers from the server:

void
StreamingRecognizeClient::ReceiveResponses(std::shared_ptr call, bool audio_device)
{
…

while (call->streamer->Read(&call->response)) {  // Returns false when no m ore to read.
 call->recv_times.push_back(std::chrono::steady_clock::now());

 // Reset the partial transcript
 call->latest_result_.partial_transcript = "";
 call->latest_result_.partial_time_stamps.clear();

 bool is_final = false;
 for (int r = 0; r response.results_size(); ++r) {
   const auto& result = call->response.results(r);
   if (result.is_final()) {
     is_final = true;
   }

…

   call->latest_result_.audio_processed = result.audio_processed();
   if (print_transcripts_) {
     call->AppendResult(result);
   }
 }

 if (call->response.results_size() && interim_results_ && print_transcripts_) {
   std::cout latest_result_.final_transcripts[0] +
                    call->latest_result_.partial_transcript
             recv_final_flags.push_back(is_final);
}

Resources for developing speech AI applications

By recognizing your voice or carrying out a command, speech AI is expanding from empowering actual humans in contact centers to empowering digital humans in the metaverse.

For more information about how to add speech AI skills to your applications, see the following resources

For more information about how businesses are deploying Riva in production to improve their services, see the customer stories in Solution Showcase.

Categories
Misc

NVIDIA, Arm, and Intel Publish FP8 Specification for Standardization as an Interchange Format for AI

AI processing requires full-stack innovation across hardware and software platforms to address the growing computational demands of neural networks. A key area…

AI processing requires full-stack innovation across hardware and software platforms to address the growing computational demands of neural networks. A key area to drive efficiency is using lower precision number formats to improve computational efficiency, reduce memory usage, and optimize for interconnect bandwidth.

To realize these benefits, the industry has moved from 32-bit precisions to 16-bit, and now even 8-bit precision formats. Transformer networks, which are one of the most important innovations in AI, benefit from an 8-bit floating point precision in particular. We believe that having a common interchange format will enable rapid advancements and the interoperability of both hardware and software platforms to advance computing.

NVIDIA, Arm, and Intel have jointly authored a whitepaper, FP8 Formats for Deep Learning, describing an 8-bit floating point (FP8) specification. It provides a common format that accelerates AI development by optimizing memory usage and works for both AI training and inference. This FP8 specification has two variants, E5M2 and E4M3. 

This format is natively implemented in the NVIDIA Hopper architecture and has shown excellent results in initial testing. It will immediately benefit from the work being done by the broader ecosystem, including the AI frameworks, in implementing it for developers.

Compatibility and flexibility

FP8 minimizes deviations from existing IEEE 754 floating point formats with a good balance between hardware and software to leverage existing implementations, accelerate adoption, and improve developer productivity. 

E5M2 uses five bits for the exponent and two bits for the mantissa and is a truncated IEEE FP16 format. In circumstances where more precision is required at the expense of some numerical range, the E4M3 format makes a few adjustments to extend the range representable with a four-bit exponent and a three-bit mantissa.

The new format saves additional computational cycles since it uses just eight bits. It can be used for both AI training and inference without requiring any re-casting between precisions. Furthermore, by minimizing deviations from existing floating point formats, it enables the greatest latitude for future AI innovation while still adhering to current conventions.

High-accuracy training and inference 

Testing the proposed FP8 format shows comparable accuracy to 16-bit precisions across a wide array of use cases, architectures, and networks. Results on transformers, computer vision, and GAN networks all show that FP8 training accuracy is similar to 16-bit precisions while delivering significant speedups. For more information about accuracy studies, see the FP8 Formats for Deep Learning whitepaper.

Chart shows the accuracy performance of AI training of language models using 16-bit and FP8 formats. Several network types (Transformer, BERT, and GPT) and multiple networks are tested in each type. The accuracy metrics that are used are PPL and Loss to evaluate performance. The results show that the accuracy of the networks is comparable using either 16-bit or FP8 training.
Figure 1. Language model AI training

In Figure 1, different networks use different accuracy metrics (PPL and Loss), as indicated.

Chart shows the accuracy performance of AI Inference of language models using 16-bit and 8-bit formats. BERT Base and BERT Large are tested. The accuracy metric used to evaluate performance is F1. The results show that the accuracy of the networks is comparable using either 16-bit floating point or FP8 inference, and both outperform INT-8.
Figure 2. Language model AI inference

In MLPerf Inference v2.1, the AI industry’s leading benchmark, NVIDIA Hopper leveraged this new FP8 format to deliver a 4.5x speedup on the BERT high-accuracy model, gaining throughput without compromising on accuracy.

Moving towards standardization

NVIDIA, Arm, and Intel have published this specification in an open, license-free format to encourage broad industry adoption. They will also submit this proposal to IEEE.

By adopting an interchangeable format that maintains accuracy, AI models will operate consistently and performantly across all hardware platforms, and help advance the state of the art of AI.

Standards bodies and the industry as a whole are encouraged to build platforms that can efficiently adopt the new standard. This will help accelerate AI development and deployment by providing a universal, interchangeable precision.

Categories
Offsites

LOLNeRF: Learn from One Look

An important aspect of human vision is our ability to comprehend 3D shape from the 2D images we observe. Achieving this kind of understanding with computer vision systems has been a fundamental challenge in the field. Many successful approaches rely on multi-view data, where two or more images of the same scene are available from different perspectives, which makes it much easier to infer the 3D shape of objects in the images.

There are, however, many situations where it would be useful to know 3D structure from a single image, but this problem is generally difficult or impossible to solve. For example, it isn’t necessarily possible to tell the difference between an image of an actual beach and an image of a flat poster of the same beach. However it is possible to estimate 3D structure based on what kind of 3D objects occur commonly and what similar structures look like from different perspectives.

In “LOLNeRF: Learn from One Look”, presented at CVPR 2022, we propose a framework that learns to model 3D structure and appearance from collections of single-view images. LOLNeRF learns the typical 3D structure of a class of objects, such as cars, human faces or cats, but only from single views of any one object, never the same object twice. We build our approach by combining Generative Latent Optimization (GLO) and neural radiance fields (NeRF) to achieve state-of-the-art results for novel view synthesis and competitive results for depth estimation.

We learn a 3D object model by reconstructing a large collection of single-view images using a neural network conditioned on latent vectors, z (left). This allows for a 3D model to be lifted from the image, and rendered from novel viewpoints. Holding the camera fixed, we can interpolate or sample novel identities (right).

Combining GLO and NeRF
GLO is a general method that learns to reconstruct a dataset (such as a set of 2D images) by co-learning a neural network (decoder) and table of codes (latents) that is also an input to the decoder. Each of these latent codes re-creates a single element (such as an image) from the dataset. Because the latent codes have fewer dimensions than the data elements themselves, the network is forced to generalize, learning common structure in the data (such as the general shape of dog snouts).

NeRF is a technique that is very good at reconstructing a static 3D object from 2D images. It represents an object with a neural network that outputs color and density for each point in 3D space. Color and density values are accumulated along rays, one ray for each pixel in a 2D image. These are then combined using standard computer graphics volume rendering to compute a final pixel color. Importantly, all these operations are differentiable, allowing for end-to-end supervision. By enforcing that each rendered pixel (of the 3D representation) matches the color of ground truth (2D) pixels, the neural network creates a 3D representation that can be rendered from any viewpoint.

We combine NeRF with GLO by assigning each object a latent code and concatenating it with standard NeRF inputs, giving it the ability to reconstruct multiple objects. Following GLO, we co-optimize these latent codes along with network weights during training to reconstruct the input images. Unlike standard NeRF, which requires multiple views of the same object, we supervise our method with only single views of any one object (but multiple examples of that type of object). Because NeRF is inherently 3D, we can then render the object from arbitrary viewpoints. Combining NeRF with GLO gives it the ability to learn common 3D structure across instances from only single views while still retaining the ability to recreate specific instances of the dataset.

Camera Estimation
In order for NeRF to work, it needs to know the exact camera location, relative to the object, for each image. Unless this was measured when the image was taken, it is generally unknown. Instead, we use the MediaPipe Face Mesh to extract five landmark locations from the images. Each of these 2D predictions correspond to a semantically consistent point on the object (e.g., the tip of the nose or corners of the eyes). We can then derive a set of canonical 3D locations for the semantic points, along with estimates of the camera poses for each image, such that the projection of the canonical points into the images is as consistent as possible with the 2D landmarks.

We train a per-image table of latent codes alongside a NeRF model. Output is subject to per-ray RGB, mask and hardness losses. Cameras are derived from a fit of predicted landmarks to canonical 3D keypoints.
Example MediaPipe landmarks and segmentation masks (images from CelebA).

Hard Surface and Mask Losses
Standard NeRF is effective for accurately reproducing the images, but in our single-view case, it tends to produce images that look blurry when viewed off-axis. To address this, we introduce a novel hard surface loss, which encourages the density to adopt sharp transitions from exterior to interior regions, reducing blurring. This essentially tells the network to create “solid” surfaces, and not semi-transparent ones like clouds.

We also obtained better results by splitting the network into separate foreground and background networks. We supervised this separation with a mask from the MediaPipe Selfie Segmenter and a loss to encourage network specialization. This allows the foreground network to specialize only on the object of interest, and not get “distracted” by the background, increasing its quality.

Results
We surprisingly found that fitting only five key points gave accurate enough camera estimates to train a model for cats, dogs, or human faces. This means that given only a single view of your beloved cats Schnitzel, Widget and friends, you can create a new image from any other angle.

Top: example cat images from AFHQ. Bottom: A synthesis of novel 3D views created by LOLNeRF.

Conclusion
We’ve developed a technique that is effective at discovering 3D structure from single 2D images. We see great potential in LOLNeRF for a variety of applications and are currently investigating potential use-cases.

Interpolation of feline identities from linear interpolation of learned latent codes for different examples in AFHQ.

Code Release
We acknowledge the potential for misuse and importance of acting responsibly. To that end, we will only release the code for reproducibility purposes, but will not release any trained generative models.

Acknowledgements
We would like to thank Andrea Tagliasacchi, Kwang Moo Yi, Viral Carpenter, David Fleet, Danica Matthews, Florian Schroff, Hartwig Adam and Dmitry Lagun for continuous help in building this technology.

Categories
Misc

Get up to Speed: Five Reasons Not to Miss NVIDIA CEO Jensen Huang’s GTC Keynote Sept. 20

Think fast. Enterprise AI, new gaming technology, the metaverse and the 3D internet, and advanced AI technologies tailored to just about every industry are all coming your way. NVIDIA founder and CEO Jensen Huang’s keynote at NVIDIA GTC on Tuesday, Sept. 20, is the best way to get ahead of all these trends. NVIDIA’s virtual Read article >

The post Get up to Speed: Five Reasons Not to Miss NVIDIA CEO Jensen Huang’s GTC Keynote Sept. 20 appeared first on NVIDIA Blog.

Categories
Misc

AI on the Stars: Hyperrealistic Avatars Propel Startup to ‘America’s Got Talent’ Finals

More than 6 million pairs of eyes will be on real-time AI avatar technology in this week’s finale of America’s Got Talent — currently the second-most popular primetime TV show in the U.S.. Metaphysic, a member of the NVIDIA Inception global network of technology startups, is one of 11 acts competing for $1 million and Read article >

The post AI on the Stars: Hyperrealistic Avatars Propel Startup to ‘America’s Got Talent’ Finals appeared first on NVIDIA Blog.

Categories
Misc

Using Vulkan SC for Safety-Critical Graphics and Real-time GPU Processing

GPU-accelerated processing is vital to many automotive and embedded systems. Safety-critical and real-time applications have different requirements and…

GPU-accelerated processing is vital to many automotive and embedded systems. Safety-critical and real-time applications have different requirements and deployment priorities than consumer applications, but they often are developed using GPU APIs that have been primarily designed for use in games.

Vulkan SC (Safety Critical) is a newly released open standard to streamline the use of GPUs in markets where functional safety and hitch-free performance are essential.

NVIDIA helped lead the creation of the Vulkan SC 1.0 API and is now shipping production drivers on its NVIDIA DRIVE and NVIDIA Jetson platforms.

Deterministic GPU processing

Vulkan is a royalty-free open standard from the Khronos Group standards organization. It is the only modern, cross-platform GPU API. Launched in 2016, Vulkan is primarily designed for use in games and professional design applications on desktop and mobile devices using Windows, Linux, and Android. 

Khronos derived Vulkan SC from Vulkan 1.2, with the Vulkan SC 1.0 specification being released in March 2022. Vulkan SC defines the subset of the Vulkan API that is essential for embedded markets in order to reduce API surface area for streamlined implementation and testing. 

Vulkan SC also increases API robustness by eliminating ignored parameters and undefined behaviors, and enhancing detection, reporting, and correction of run-time faults. Vulkan SC enables predictable, hitch-free execution by moving pipeline compilation offline, and providing sophisticated functionality for managing static memory allocation and resource management with explicit synchronization.

For more information, see Vulkan SC: Overview – and how it is different from the Vulkan you already know.

Vulkan SC and the NVIDIA DRIVE automotive platform

The streamlined Vulkan SC API reduces the cost and effort of system-level safety certification to standards such as ISO 26262, a functional safety standard used in the automotive industry. Simplifying system certification enables manufacturers to smoothly deploy advanced graphics capabilities in driver assistance systems on the NVIDIA DRIVE platform.

For example, Level 2 and Level 3 AI-assisted vehicles require the driver to remain in the loop during vehicle operation. Safe visualization inside the cockpit and the digital instrument cluster is key to ensuring the human driver is aware of how the system is perceiving and reacting to the surrounding environment. 

The confidence view is a rendering of the mind of the vehicle’s AI and how it sees the world. It shows exactly what the sensor suite and perception system are detecting in real time using a 3D surround model. By incorporating this view in the cabin interior, the vehicle can communicate to its occupants the accuracy and reliability of the autonomous driving system at every step of the journey.

The ability to support such in-vehicle graphics safely and securely is what makes Vulkan SC critical to the next-generation intelligent vehicle experience. Production Vulkan SC 1.0 drivers are included in DRIVE OS 6.0.4.0, which shipped August 29, 2022.

Vulkan SC on the NVIDIA Jetson embedded platform

NVIDIA Jetson is the world’s leading platform for autonomous machines and other embedded applications. It includes Jetson modules, which are small form-factor, high-performance computers, the NVIDIA JetPack SDK for accelerating software, and an ecosystem with sensors, SDKs, services, and products to speed development.

Applications for Jetson-based systems typically do not require formal safety certification. However, many embedded and autonomous systems can directly benefit from the deterministic, real-time GPU graphics and compute acceleration provided by Vulkan SC. With these capabilities, the Jetson platform can support a broader diversity of applications.

The NVIDIA Jetpack 5.0.2 SDK released on August 15 2022 includes conformant, production Vulkan SC 1.0 drivers for the Linux OS.

Ongoing NVIDIA commitment to the Vulkan SC API

NVIDIA will continue to invest in the evolution of the Vulkan SC open standard API at Khronos. We are committed to providing conformant, production drivers on platforms such as NVIDIA DRIVE and Jetson.

Later in 2022, NVIDIA will also ship support for Vulkan SC in NVIDIA Nsight developer tools. Vulkan SC streamlines the open, cross-platform Vulkan API for deterministic GPU graphics and compute, enabling advanced applications and use cases on safety-certified and real-time embedded platforms.

Now, NVIDIA provides industry-leading support for this groundbreaking open standard, enabling GPU acceleration in new classes of products. Download the latest NVIDIA DRIVE or NVIDIA Jetpack releases with Vulkan SC drivers today.

Categories
Misc

Top Manufacturing Sessions at GTC 2022

Discover the latest innovations in manufacturing and aerospace with GTC sessions from leaders at Siemens, Boeing, BMW, and more.

Discover the latest innovations in manufacturing and aerospace with GTC sessions from leaders at Siemens, Boeing, BMW, and more.