Categories
Misc

Continental and AEye Join NVIDIA DRIVE Sim Sensor Ecosystem, Providing Rich Capabilities for AV Development

Autonomous vehicle sensors require the same rigorous testing and validation as the car itself, and one simulation platform is up to the task. Global tier-1 supplier Continental and software-defined lidar maker AEye announced this week at NVIDIA GTC that they will migrate their intelligent lidar sensor model into NVIDIA DRIVE Sim. The companies are the Read article >

The post Continental and AEye Join NVIDIA DRIVE Sim Sensor Ecosystem, Providing Rich Capabilities for AV Development appeared first on NVIDIA Blog.

Categories
Misc

Go Hands On: Logitech G CLOUD Launches With Support for GeForce NOW

When it rains, it pours. And this GFN Thursday brings a downpour of news for GeForce NOW members. The Logitech G CLOUD is the latest gaming handheld device to support GeForce NOW, giving members a brand new way to keep the gaming going. But that’s not all: Portal with RTX joins GeForce NOW in November, Read article >

The post Go Hands On: Logitech G CLOUD Launches With Support for GeForce NOW appeared first on NVIDIA Blog.

Categories
Misc

AV1 Encoding and FRUC: Video Performance Boosts and Higher Fidelity on the NVIDIA Ada Architecture

Announced at GTC 2022, the next generation of NVIDIA GPUs—the NVIDIA GeForce RTX 40 series, NVIDIA RTX 6000 Ada Generation, and NVIDIA L40 for data…

Announced at GTC 2022, the next generation of NVIDIA GPUs—the NVIDIA GeForce RTX 40 series, NVIDIA RTX 6000 Ada Generation, and NVIDIA L40 for data center—are built with the new NVIDIA Ada Architecture.

The NVIDIA Ada Architecture features third-generation ray tracing cores, fourth-generation Tensor Cores, multiple video encoders, and a new optical flow accelerator.

To enable you to fully harness the new hardware upgrades, NVIDIA is announcing accompanying updates to the Video Codec SDK and Optical Flow SDK.

NVIDIA Video Codec SDK 12.0

AV1 is the state-of-the-art video coding format that offers both substantial performance boosts and higher fidelity compared to H.264, the popular standard. Introduced on the NVIDIA Ampere Architecture, the Video Codec SDK extended support to AV1 decoding. Now, with Video Codec SDK 12.0, NVIDIA Ada-generation GPUs support AV1 encoding.

Line chart of PSNR by bit rate shows that AV1 supports higher-quality video at a lower bit rate compared to H.264.
Figure 1. PSNR compared to bit rate for AV1 and H.264

Hardware-accelerated AV1 encoding is a huge milestone in transitioning AV1 to be the new standard video format. Figure 1 shows how the AV1 bit-rate savings translate into impressive performance boosts and higher fidelity images.

PSNR (peak signal to noise ratio) is a video quality measure. To achieve 42 dB PSNR, AV1 video has a 7 Mbps bit rate while H.264 has upwards of 12 Mbps. Across all resolutions, AV1 encoding averages 40% more efficient than H.264. This fundamental performance difference opens the doors for AV1 to support higher-quality video, increased throughput, and high dynamic range (HDR). 

Bar chart shows that at 2160p, AV1 has a 1.45x bit-rate saving compared to NVENC H.264.
Figure 2. Bit-rate saving for AV1 compared to H.264

As Figure 2 shows, at 1440p and 2160p, NVENC AV1 is 1.45x more efficient than NVENC H.264. This new performance headroom enables higher than ever image quality, including 8k.

The benefits of AV1 are best used in unison with the multi-encoder design featured on the NVIDIA Ada Architecture. New to Video Codec SDK 12.0 on chips with multiple NVENC, the processing load is evenly distributed across each encoder simultaneously. This optimization creates a huge reduction in encoding times. Multiple encoders in combination with the AV1 format allows NVIDIA Ada to support an incredible 8k at 60 fps video encode in real time.

AV1 encoding across multiple hardware NVENC is enabling the next generation of video performance and fidelity. Broadcasters can achieve higher livestream resolutions, video editors can export video at 2x speed, and all this is enabled by the Video Codec SDK.

NVIDIA Video Codec SDK 12.0 will be available to download from the NVIDIA Developer Center in October 2022.

NVIDIA Optical Flow 4.0

The new NVIDIA Optical Flow SDK 4.0 release introduces an engine-assisted frame rate up conversion (FRUC). FRUC generates higher frame-rate video from lower frame-rate video by inserting interpolated frames using optical flow vectors. Such high frame rate video shows smooth continuity of motion across frames. The result is improved smoothness of video playback and perceived visual quality.

The NVIDIA Ada Lovelace Architecture has a new optical flow accelerator, NVOFA, that is 2.5x more performant than the NVIDIA Ampere Architecture NVOFA. It provides a 15% quality improvement on popular benchmarks including KITTI and MPI Sintel.

The FRUC library uses the NVOFA and CUDA to interpolate frames significantly faster than software-only methods. It also works seamlessly with custom DirectX or CUDA applications, making it easy for developers to integrate.

Diagram shows four frames of a low frame-rate video being interleaved with interpolated frames to make a high frame-rate video.
Figure 3. Frame rate up conversion

The Optical Flow SDK 4.0 includes the FRUC library and sample application, in addition to basic Optical Flow sample applications. The FRUC library exposes NVIDIA FRUC APIs that take two consecutive frames and return an interpolated frame in between them. These APIs can be used for the up-conversion of any video.

Frame interpolation using the FRUC library is extremely fast compared to other software-only methods. The APIs are easy to use, and support ARGB and NV12 input surface formats. It can be directly integrated into any DirectX or CUDA application.

The sample application source code included in the SDK demonstrates how to use FRUC APIs for video FRUC. This source code can be reused or modified as required to build a custom application.

The Video 1 sample was created using the FRUC library. As you can see, the motion of foreground objects and background appears much smoother in the right video compared to the left video.

Video 1. Side-by-side comparison of original video and frame rate up-converted video. (left) Original video played at 15 fps. (right) Frame rate up-converted video played at 30 fps. Video created using the FRUC library. (Source: http://ultravideo.fi/#testsequences)

Inside the FRUC library

Here is a brief explanation about how FRUC library processes a pair of frames and generates an interpolated frame. 

A pair of consecutive frames (previous and next) are input into the FRUC library (Figure 4).

GIF image shows the previous and next frames of a horse and rider.
Figure 4. Consecutive frames used as input

Using NVIDIA Optical flow APIs, forward and backward flow vector maps are generated.

GIF image shows forward and backward flow vector maps.
Figure 5. Forward and backward flow vector maps

Flow vectors in the map are then validated using a forward-backward consistency check. Flow vectors that do not pass the consistency check are rejected. The black portions in this figure are flow vectors that did not pass the forward-backward consistency check.

Picture shows black spots for rejected flow vectors.
Figure 6. Validated and rejected flow vectors

Using available flow vectors and advanced CUDA accelerated techniques, more accurate flow vectors are generated to fill in the rejected flow vectors. Figure 7 shows the infilled flow vector map generated.

Picture shows the rejected flow vectors filled in with other colors.
Figure 7. Infilled flow vector map
Image shows a closeup of pixel regions without valid color on the interpolated frame.
Figure 8. New interpolated frame with gray regions

Using a complete flow vector map between the two frames, the algorithm generates an interpolated frame between the two input frames. Such an image may contain few holes (pixels that don’t have valid color). This figure shows a few small gray regions near the head of the horse and in the sky that are holes.

Holes in the interpolated frame are filled using image domain hole infilling techniques to generate the final interpolated image. This is the output of the FRUC library.

Image shows the interpolated frame with pixel holes filled in.
Figure 9. Output of the FRUC library

The calling application can interleave this interpolated frame with original frames to increase the frame rate of video or game. Figure 10 shows the interpolated frame interleaved between previous and next image.

GIF shows interpolated frame between original two-frame GIF.
Figure 10. Interpolated frame interleaved

Lastly, to expand the platforms that can harness the NVOFA hardware, Optical Flow SDK 4.0 also introduces support for Windows Subsystem for Linux.

Harness the NVIDIA Ada Architecture and the FRUC library when the NVIDIA Optical Flow SDK 4.0 is available in October. If you have any questions, contact Video DevTech Support.

Categories
Misc

Develop for All Six NVIDIA Jetson Orin Modules with the Power of One Developer Kit

With the Jetson Orin Nano announcement this week at GTC, the entire Jetson Orin module lineup is now revealed. With up to 40 TOPS of AI performance, Orin Nano…

With the Jetson Orin Nano announcement this week at GTC, the entire Jetson Orin module lineup is now revealed. With up to 40 TOPS of AI performance, Orin Nano modules set the new standard for entry-level AI, just as Jetson AGX Orin is already redefining robotics and other autonomous edge use cases with 275 TOPS of server class compute.

All Jetson Orin modules and the Jetson AGX Orin Developer Kit are based on a single SoC architecture with an NVIDIA Ampere Architecture GPU, a high-performance CPU, and the latest accelerators. This shared architecture means you can develop software for one Jetson Orin module and then easily deploy it to any of the others.

You can begin development today for any Jetson Orin module using the Jetson AGX Orin Developer Kit. The developer kit’s ability to natively emulate performance for any of the modules lets you start now and shorten your time to market. The developer kit can accurately emulate the performance of any Jetson Orin modules by configuring the hardware features and clocks to match that of the target module.

Development teams benefit from the simplicity of needing only one type of developer kit, irrespective of which modules are targeted for production. This also simplifies CI/CD infrastructure. Whether you are developing for robotics, video analytics, or any other use case, the capability of this one developer kit brings many benefits.

Transform the Jetson AGX Orin Developer Kit into any Jetson Orin module

With one step, you can transform a Jetson AGX Orin Developer Kit into any one of the Jetson Orin modules. We provide flashing configuration files for this process.

Emulating Jetson Orin module on the Jetson AGX Orin Developer Kit, follows the same steps as mentioned in to flash a Jetson AGX Orin Developer Kit using the flashing utilities. After placing your developer kit in Force Recovery Mode, the flash.sh command-line tool is used to flash it with a new image. For example, the following command flashes the developer kit with its default configuration:

$ sudo ./flash.sh jetson-agx-orin-devkit mmcblk0p1

The exact command that you use should be modified with the name of the flash configuration appropriate for your targeted Jetson Orin module being emulated. For example, to emulate a Jetson Orin NX 16GB module, use the following command:

$ sudo ./flash.sh jetson-agx-orin-devkit-as-nx-16gb mmcblk0p1

Table 1 lists the Jetson Orin modules and the flash.sh command appropriate for each.

Jetson Orin module to be emulated Flashing command
Jetson AGX Orin 64GB sudo ./flash.sh jetson-agx-orin-devkit mmcblk0p1
Jetson AGX Orin 32GB sudo ./flash.sh jetson-agx-orin-devkit-as-jao-32gb mmcblk0p
Jetson Orin NX 16GB sudo ./flash.sh jetson-agx-orin-devkit-as-onx16gb mmcblk0p1
Jetson Orin NX 8GB sudo ./flash.sh jetson-agx-orin-devkit-as-onx8gb mmcblk0p1
Jetson Orin Nano 8GB* sudo ./flash.sh jetson-agx-orin-devkit-as-nano8gb mmcblk0p1
Jetson Orin Nano 4GB sudo ./flash.sh jetson-agx-orin-devkit-as-nano4gb mmcblk0p1
Table 1. Flash.sh commands for Jetson Orin modules

Flash configurations for Jetson Orin Nano modules are not yet included in NVIDIA JetPack, as of version 5.0.2. Use these new configurations after downloading them and applying an overlay patch on top of NVIDIA JetPack 5.0.2 per the instructions found inside the downloaded file.

For more information about the flashing configurations useful for emulation, see Emulation Flash Configurations.

After flashing is complete, complete the initial bootup and configuration. Then you can install the rest of the NVIDIA JetPack components using SDK Manager or simply by using a package manager on the running developer kit:

sudo apt update
sudo apt install nvidia-jetpack

Now you have the developer kit running and NVIDIA JetPack installed. Your Jetson AGX Orin Developer Kit now emulates the performance and power of the specified Jetson Orin module.

Accurately emulate any Jetson Orin module

This native emulation is accurate because it configures the developer kit to match the clock frequencies, number of GPU and CPU cores, and hardware accelerators available with the target module.

For example, when emulating the Jetson Orin NX 16GB module:

  • The developer kit GPU is configured with 1024 CUDA cores and 32 Tensor Cores with a max frequency of 918 MHz.
  • The CPU complex is configured with eight Arm Cortex-A78AE cores running at 2 GHz.
  • The DRAM is configured to 16 GB with a bandwidth of 102 GB/s.
  • The system offers the same power profiles supported by the Jetson Orin NX 16GB module.
Screenshot shows that the available power modes on a developer kit flashed to emulate Jetson Orin NX 16GB matches the power modes available on Jetson Orin NX 16GB.
Figure 1. Available power modes

Open the Jetson Power graphical user interface from the top menu on the desktop and you see that the system has been configured accurately as per the target module being emulated. Max clocks can be configured by running the following command, and the Jetson Power graphical user interface will show the change.

sudo jetson_clocks

Figure 2 shows the Jetson Power graphical user interface after configuring max clocks when the Jetson AGX Orin Developer Kit is flashed to an emulated Jetson AGX Orin 64GB module compared to when flashed to emulate a Jetson Orin NX 16GB module.

Screenshot shows two sets of various system configurations adapted to the target modules.
Figure 2. Jetson Power graphical user interface shown on a developer kit flashed to emulate Jetson AGX Orin 64GB (left) and Jetson Orin NX 16GB with MAXN power mode selected (right)

By running various samples provided with NVIDIA JetPack, you can see that the performance is adjusted to match that of the module being emulated. For example, the benchmarking sample packaged with the VPI library can be used to show CPU, GPU, and PVA performance for Jetson AGX Orin 64GB, Jetson Orin NX 16GB, and Jetson Orin Nano 8GB modules after configuring the Jetson AGX Orin Developer Kit to emulate the respective module.

To run the VPI benchmarking sample, use the following commands:

cd /opt/nvidia/vpi2/samples/05-benchmark
sudo cmake .
sudo make
sudo ./vpi_sample_05_benchmark 

The VPI benchmarking sample outputs the latency in milliseconds for the Gaussian algorithm. Table 2 shows the results for each of the targeted modules.

Algorithm: 5X5
Gaussian Filter Input Image Size: 1920 X 1080
Input Format: U16
Emulated as Jetson AGX Orin 64GB Emulated as Jetson Orin NX 16GB Emulated as Jetson Orin Nano 8GB
Running on CPU 0.331 0.492 0.838
Running on GPU 0.065 0.143 0.210
Running on PVA 1.169 1.888
Table 2. Latency in milliseconds for targeted modules

Similarly, you can run multimedia samples for encode and decode.

For decode, run the following commands:

cd /usr/src/jetson_multimedia_api/samples/00_video_decode
sudo make
sudo ./video_decode H264 --disable-rendering --stats --max-perf 

For encode, run the following commands:

cd /usr/src/jetson_multimedia_api/samples/01_video_encode
sudo make
sudo ./video_encode input.yuv 1920 1080 H264 out.h264 -fps 30 1 -ifi 1000 -idri 1000 --max-perf --stats

Table 3 reports the FPS numbers after running these encode and decode samples using H.264 1080P 30FPS video streams.

Encode/Decoder Sample Emulated as Jetson AGX Orin 64GB Emulated as Jetson Orin NX 16GB Emulated as Jetson Orin Nano 8GB
Encode 178 142 110*
Decode 400 374 231
Table 3. FPS numbers after running encode and decode samples

*Jetson Orin Nano does not include a NVEncoder. For Table 3, encoding for Jetson Orin Nano was done on CPU using ffmpeg. 110 FPS is achieved when using four CPU cores. When using two CPU cores, FPS of 73 was achieved, and when using a single CPU core, FPS of 33 was achieved.

To demonstrate the accuracy of emulation, we ran some AI model benchmarks on the Jetson AGX Orin Developer Kit emulated as Jetson AGX Orin 32GB. We then compared it with results obtained by running the same benchmarks on the real Jetson AGX Orin 32GB module. As you can see from the results, the difference between emulated and real performance is insignificant.

Although Jetson AGX Orin Developer Kit includes a 32GB module, it provides the same level of performance and comes with 275 TOPS matching the Jetson AGX Orin 64GB. There is no special flashing configuration required for emulation of Jetson AGX Orin 64 GB but you must use the appropriate flashing configuration to emulate Jetson AGX Orin 32GB on Jetson AGX Orin Developer Kit.

Model Jetson AGX Orin 32GB Emulated Jetson AGX Orin 32GB Real
PeopleNet (V2.5) 327 320
Action Recognition 2D 1161 1151
Action Recognition 3D 70 71
LPR Net 2776 2724
Dashcam Net 1321 1303
BodyPose Net 359 363
Table 4. Performance comparison between real and emulated Jetson AGX Orin modules

Do end-to-end development for any Jetson Orin module

You can work with the entire Jetson software stack while emulating a Jetson Orin module. Frameworks such as NVIDIA DeepStream, NVIDIA Isaac, and NVIDIA Riva work in emulation mode, and tools like TAO Toolkit perform as expected with pretrained models from NGC. The software stack is agnostic of the emulation and the performance accurately matches that of the target being emulated.

Diagram shows the stack architecture with the NVIDIA JetPack SDK, frameworks such as NVIDIA DeepStream, NVIDIA Isaac, and NVIDIA Riva, and NVIDIA tools like Train-Adapt-Optimize and pretrained models.
Figure 4. NVIDIA Jetson software stack

If you are developing a robotics use case or developing a vision AI pipeline, you can do end-to-end development today for any Jetson Orin module using the Jetson AGX Orin Developer Kit and emulation mode.

Develop robotics applications with NVIDIA Isaac ROS for any Jetson Orin module. Just use the right flashing configuration to flash and start your ROS development. Figure 5 shows running Isaac ROS Stereo Disparity DNN on the Jetson AGX Orin Developer Kit emulated as Jetson Orin Nano 8GB.

Stereo disparity is running on a warehouse scene where a front loader is traveling across the screen.
Figure 5. NVIDIA Isaac ROS Stereo Disparity DNN running on Jetson AGX Orin Developer Kit emulated as Jetson Orin Nano 8GB

Develop vision AI pipelines using DeepStream on the Jetson AGX Orin Developer Kit for any Jetson Orin module. It just works!

Figure 6 shows an IVA pipeline running people detection using DeepStream on the Jetson AGX Orin Developer Kit emulated as Jetson Orin Nano 8GB with four streams of H.265 1080P 30FPS.

People are detected with a blue bounding box and cars with a red bounding box.
Figure 6. DeepStream vision pipeline running people and car detection running on Jetson AGX Orin Developer Kit emulated as Jetson Orin Nano 8GB

Get to market faster with the Jetson AGX Orin Developer Kit 

With the emulation support, you can get to production faster by starting and finishing your application development on the Jetson AGX Orin Developer Kit. Buy the kit and start your development. We will also cover emulation in detail in the upcoming NVIDIA JetPack 5.0.2 webinar. Register for the webinar today!

Categories
Misc

Detecting Threats Faster with AI-Based Cybersecurity

Network traffic continues to increase, as the number of Internet users across the globe reached 5 billion in 2022 and continues to rise. As the number of users…

Network traffic continues to increase, as the number of Internet users across the globe reached 5 billion in 2022 and continues to rise. As the number of users expands, so does the number of connected devices, which is expected to grow into the trillions.

The ever-increasing number of connected users and devices leads to an overwhelming amount of data generated across the network. According to IDC, data is growing exponentially every year, and it’s projected the world will generate 179.6 zettabytes of data by 2025. This equates to an average of 493 exabytes of data generated per day.

All this data and network traffic poses a cybersecurity challenge. Enterprises are generating more data than they can collect and analyze, and the vast majority of the data coming in goes untapped.

Without tapping into this data, an enterprise can’t build robust and rich models and detect abnormal deviations in their environment. The inability to examine this data leads to undetected security breaches, long remediation times, and ultimately huge financial losses for the company being breached.

With cyberattack attempts per week rising by an alarming 50% in 2021, cybersecurity teams must find ways to better protect these vast networks, data, and devices.

To address the cybersecurity data problem, security teams might implement smart sampling, or smart filtering, where they either analyze a subset of the data or filter out data deemed insignificant. These methods are typically adopted because it is cost-prohibitive and extremely challenging to analyze all the data across a network.

Companies may not have the infrastructure to process that scale of data or to do so in a timely manner. In fact, it takes 277 days on average to identify and contain a breach. To provide the best protection against cyberthreats, analyzing all the data quickly yields better results.

The NVIDIA Morpheus GPU-accelerated cybersecurity AI framework enables, for the first time, the ability to inspect all network traffic in real time to address the cybersecurity data problem on a scale previously impossible.

With Morpheus, you can build optimized AI pipelines to filter, process, and classify these large volumes of real-time data, enabling cybersecurity analysts to detect and remediate threats faster.

New visualization capabilities help pinpoint threats faster

The latest release of NVIDIA Morpheus provides visualizations for cybersecurity data, enabling cybersecurity analysts to detect and remediate threats more efficiently. Previously, cybersecurity analysts would have examined large amounts of raw data, potentially parsing through hundreds of thousands of events per week, looking for anomalies.

Morpheus includes several prebuilt, end-to-end workflows to address different cybersecurity use cases. Digital fingerprinting is one of the prebuilt workflows, designed to analyze the behavior of every human and machine across the network to detect anomalous behavior. 

The Morpheus digital fingerprinting pretrained model enables up to 100% data visibility and uniquely fingerprints every user, service, account, and machine across the enterprise data center. It uses unsupervised learning to flag when user and machine activities shift.

The digital fingerprinting workflow includes fine-tunable explainability, providing metrics behind highlighted anomalies, and thresholding to determine when certain events should be flagged. Both are customizable for your environment.

Digital fingerprinting also now includes a new visualization tool that provides insights to a security analyst on deviations from normal behavior, including how it has deviated, and what is related to that deviation. Not only do analysts get an issue alert, they can quickly dive into the details and determine a set of actionable next steps.

This gives organizations orders of magnitude improvement in data analysis, potentially reducing the time to detect a threat from weeks to minutes for certain attack patterns.

Figure 1 shows a closer look at the visualization for the digital fingerprinting use case in Morpheus. This example looks at cybersecurity data on a massive scale: tens of thousands of users, where each hexagon represents events related to a user over a period of time. No human can keep track of this many users.

Screenshot shows a visual representation of users and events across the network, as hexagons colored in a range from bright yellow to dark red.
Figure 1. NVIDIA Morpheus visualization for digital fingerprinting workflow

NVIDIA Morpheus has parsed and prioritized the data so it’s easy to see when an anomaly has been identified. In the visualization, the data is organized such that the most important data is at the top and colors indicate the anomaly score: darker colors are good and lighter colors are bad. A security analyst can easily identify an anomaly because it’s prioritized and easy to spot. The security analyst can select a light-colored hexagon and quickly view data associated with the event.

With NVIDIA Morpheus, AI performs massive data filtration and reduction, surfacing critical behavior anomalies as they propagate throughout the network. It can provide security analysts with more context around individual events to help connect the dots to other potentially bad things happening.

NVIDIA Morpheus digital fingerprinting workflow in action

The following video shows a breach. With Morpheus, you can reduce from hundreds of millions of events per week to just 8–10 potentially actionable things to investigate daily. This cuts the time to detect threats from weeks to minutes for certain attack patterns.

Video 1. NVIDIA Morpheus digital fingerprinting workflow deployed across an enterprise of 25K employees

Morpheus helps keep sensitive information safe

Another prebuilt workflow included with Morpheus is sensitive information detection, to help find and classify leaked credentials, keys, passwords, credit card numbers, bank account numbers, and more.

The sensitive information detection workflow for Morpheus now includes a visual graph-based explainer to enable security analysts to spot leaked sensitive data more easily. In the visualization for sensitive information detection, you see a representation of a network, where dots are servers and lines are packets flowing between the servers.

With Morpheus deployed, AI inferencing is enabled across the entire network. The sensitive information detection model is trained to identify sensitive information, such as AWS credentials, GitHub credentials, private keys, and passwords. If any of these are observed in the packet, they appear as red lines.

The AI model in Morpheus searches through every packet, continually flagging when it encounters sensitive data. Rather than using pattern matching, this is done with a deep neural network trained to generalize and identify patterns beyond static rulesets.

Notice all the individual lines; you can see how quickly a human can become overwhelmed by all of the data coming in. With the visualization capabilities in Morpheus, you immediately see the lines that represent leaked sensitive information. Hovering over one of the red lines shows information about the credential, making it easier to triage and remediate. 

With Morpheus, cybersecurity applications can integrate and collect information for automated incident management and action prioritization. To speed recovery, the originating servers, destination servers, exposed credentials, and even the raw data is available.

Video 2. NVIDIA Morpheus visualization for sensitive information detection workflow

Multi-process pipeline support enables new cybersecurity workflows

Multi-process pipeline support enables Morpheus to support new cybersecurity workflows, which can be intelligently batched to reduce latency. For example, a cybersecurity workflow with both deep learning and machine learning might use the same data, but with different derived features. The ensemble must ultimately come together, but machine learning is much faster than deep learning. Morpheus can now dynamically batch throughout multiple pipelines to optimize end-to-end times and minimize latency.

Enabling new AI-based cybersecurity solutions

With Morpheus, cybersecurity practitioners can access prebuilt AI workflows such as digital fingerprinting, sensitive information detection, and more:

  • Crypto-mining malware detection
  • Phishing detection
  • Fraudulent transaction and identity detection
  • Ransomware detection

Morpheus enables cybersecurity developers and ISVs to build AI-based solutions. It includes developer toolkits and fine-tuning scripts to make it easy to integrate Morpheus into existing models. NVIDIA also partners with leading systems integrators who are enabling any organization to leverage AI-based cybersecurity.

Democratizing AI-based cybersecurity

Morpheus enables enterprises to develop AI-based cybersecurity tools more easily and better protect data centers. Systems integrators and cybersecurity vendors are using Morpheus to build more advanced, higher-performing cybersecurity solutions to take to organizations across every industry.

Best Buy

Best Buy deployed Morpheus on NVIDIA DGX to improve phishing detection capabilities and accelerate proactive response. Their deployment of Morpheus for the phishing detection use case enabled them to increase suspicious message detection by 15%.

Booz Allen Hamilton

Booz Allen Hamilton is helping to better enable incident response teams, particularly those tasked with threat hunting at the tactical edge. They’ve developed a highly customized, GPU-accelerated Cyber Precog platform that integrates operationally honed cyber tooling, AI models, and modular pipelines for rapid capability deployment.

Built using the NVIDIA Morpheus framework, Cyber Precog offers an initial suite of core capabilities along with a flexible software fabric for developing, testing, and deploying new GPU-accelerated analytics for incident response.

During incident response, operators may have to evaluate data on a disconnected edge network under conditions in which they cannot exfil data, so they can bring with them an uncompromised flyaway kit to securely access cyberdata.

Using NVIDIA GPUs and Morpheus, Cyber Precog enables up to a 300x speedup in data ingest and pipeline, 32x faster training, and 24x faster inference compared to CPU-based solutions. Booz Allen benchmarks show that a single NVIDIA GPU-accelerated server replaces up to 135 CPU-only server nodes, providing expedited decision-making for cyber operators.

The Cyber Precog platform is available to public and private sector customers.

CyberPoint

CyberPoint is focused on zero trust across a spectrum of cybersecurity use cases with dozens of mission partners and networks across various organizations, making analysis incredibly challenging.

Delivering AI-based solutions to identify threat entities and malicious behavior is critical to security operations center analysts, enabling them to pivot and focus on only the most salient threats.

Using NVIDIA Morpheus, they’ve built user behavior models to help analysts identify threats on live data as it comes in. They’ve developed their own stages within Morpheus to fit their use cases, leveraging graph neural networks and natural language processing models, and integrating them with Graphistry to provide 360-degree views of users and devices.

By using Morpheus, CyberPoint has seen a 17x speedup for their cybersecurity workflows.

IntelliGenesis

IntelliGenesis has a flyaway kit built on NVIDIA Morpheus designed to be environment-agnostic and tailored for immediate detection and remediation at the edge. They’ve built an enterprise solution to conduct AI-based real-time threat detection at scale. It’s customizable yet simple enough to enable any level of data scientist or domain expert to use. Using Morpheus and GPU-acceleration, they immediately saw exponential increases in performance.

Splunk

Splunk created a Copilot for Splunk SPL, enabling users to write a description of what they want to achieve in plain English and get suggested queries to execute. The Splunk team spoke about this at .conf22 and, notably, there were many machine learning engineers in the crowd. The feedback was overwhelmingly positive and indicated that we’re only scratching the surface of what can be done with NLP today.

At first blush, this may not seem like a cybersecurity project. However, in implementing this, they’re able to identify sensitive information leakage, which is a stellar example of Morpheus’s flexibility in extracting insights from data. Using Morpheus, Splunk achieved 5–10x speedups to their pipeline.

World Wide Technology

World Wide Technology (WWT) is using Morpheus and NVIDIA converged accelerators for their AI Defined Networking (AIDN) solution. AIDN expands upon existing IT monitoring infrastructure to observe and correlate telemetry, system, and application data points over time to build actionable insights and alert network operators. Alerts are then used as event triggers for scripted actions, allowing AIDN to assist operators with repetitive tasks, such as ticket submission.

Morpheus at GTC and beyond

For more information about how NVIDIA and our partners are helping to address cybersecurity challenges with AI, add the following GTC sessions to your calendar:

  • Learn About the Latest Developments with AI-Powered Cybersecurity:  Learn about the latest innovations available with NVIDIA Morpheus, being introduced in the Fall 2022 release, and find out how today’s security analysts are using Morpheus in their everyday investigations and workflows. Bartley Richardson, Director of Cybersecurity Engineering, NVIDIA
  • Deriving Cyber Resilience from the Data Supply Chain:  Hear how NVIDIA tackles these challenges through the application of zero-trust architectures in combination with AI and data analytics, combating our joint adversaries with a data-first response with the application of DPU, GPU, and AI SDKs and tools. Learn where the promise of cyber-AI is working in application. Daniel Rohrer, Vice President of Software Product Security, NVIDIA
  • Accelerating the Next Generation of Cybersecurity Research:  Discover how to apply prebuilt models for digital fingerprinting to analyze behavior of every user and machine, analyze raw emails to automatically detect phishing, find and classify leaked credentials and sensitive information, profile behaviors to detect malicious code and behavior, and leverage graph neural networks to identify fraud. Killian Sexsmith, Senior Developer Relations Manager, NVIDIA

Learn how to use Morpheus by enrolling in the free, self-paced Sensitive Information Detection with Morpheus DLI course.

Get started with Morpheus on the /nvidia/morpheus GitHub repo or NGC, or try it on NVIDIA LaunchPad.

Categories
Misc

Low-Code Building Blocks for Speech AI Robotics

When examining an intricate speech AI robotic system, it’s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, “Any…

When examining an intricate speech AI robotic system, it’s easy for developers to feel intimidated by its complexity. Arthur C. Clarke claimed, “Any sufficiently advanced technology is indistinguishable from magic.”

From accepting natural-language commands to safely interacting in real-time with its environment and the humans around it, today’s speech AI robotics systems can perform tasks to a level previously unachievable by machines.

Join experts from Google, Meta, NVIDIA, and more at the first annual NVIDIA Speech AI Summit. Register now.

Take Spot, a speech AI-enabled robot that can fetch drinks on its own, for example. To easily add speech AI skills, such as automatic speech recognition (ASR) or text-to-speech (TTS), many developers leverage simpler low-code building blocks when building complex robot systems.

Photograph of Spot, an intelligent robotic system, after it has successfully completed a drink order.
Figure 1. Spot, a robotic dog, fetches a drink in real time after processing an order using ASR and TTS skills provided by NVIDIA Riva.

For developers creating robotic applications with speech AI skills, this post breaks down the low-code building blocks provided by the NVIDIA Riva SDK.

By following along with the provided code examples, you learn how speech AI technology makes it possible for intelligent robots to take food orders, relay those orders to a restaurant employee, and finally navigate back home when prompted.

Design an AI robotic system using building blocks

Complex systems consist of several building blocks. Each building block is much simpler to understand on its own.

When you understand the function of each component, the end product becomes less daunting. If you’re using low-code building blocks, you can now focus on domain-specific customizations requiring more effort.

Our latest project uses “Spot,” a four-legged robot, and an NVIDIA Jetson Orin, which is connected to Spot through an Ethernet cable. This project is a prime example of using AI building blocks to form a complex speech AI robot system.

Architectural diagram of an AI robotics system with Riva low-code speech AI blocks shown as core components for platform, navigation, and interaction.
Figure 2. A speech AI robot system with Riva low-code speech AI blocks to add ASR and TTS skills

Our goal was to build a robot that could fetch us snacks on its own from a local restaurant, with as little intervention from us as possible. We also set out to write as little code as possible by using what we could from open-source libraries and tools. Almost all the software used in this project was freely available.

To achieve this goal, an AI system must be able to interact with humans vocally, perceive its environment (in our case, with an embedded camera), and navigate through the surroundings safely. Figure 2 shows how interaction, platform, and navigation represent our Spot robot’s three fundamental operation components, and how those components are further subdivided into low-code building blocks.

This post focuses solely on the human interaction blocks from the Riva SDK.

Add speech recognition and speech synthesis skills using Riva

We have so many interactions with people every day that it is easy to overlook how complex those interactions actually are. Speaking comes naturally to humans but is not nearly so simple for an intelligent machine to understand and talk.

Riva is a fully customizable, GPU-accelerated speech AI SDK that handles ASR and TTS skills, and is deployable on-premises, in all clouds, at the edge, and on embedded devices. It facilitates human-machine speech interactions.

Riva runs entirely locally on the Spot robot. Therefore, processing is secure and does not require internet access. It is also completely configurable with a simple parameter file, so no extra coding is needed.

Riva code examples for each speech AI task

Riva provides ready-to-use Python scripts and command-line tools for real-time transformation of audio data captured by a microphone into text (ASR, speech recognition, or speech-to-text) and for converting text into an audio output (TTS, or speech synthesis).

Adapting these scripts for compatibility with Open Robotics (ROS) requires only minor changes. This helps simplify the robotic system development process.

ASR customizations

The Riva OOTB Python client ASR script is named transcribe_mic.py. By default, it prints ASR output to the terminal. By modifying it, the ASR output is routed to a ROS topic and can be read by anything in the ROS network. The critical additions to the script’s main() function are shown in the following code example:

   inter_pub = rospy.Publisher('intermediate', String, queue_size=10)
   final_pub = rospy.Publisher('final', String, queue_size=10)
   rospy.init_node('riva_asr', anonymous=True)

The following code example includes more critical additions to main:

       for response in responses:
           if not response.results:
               continue
           partial_transcript = ""
           for result in response.results:
               if not result.alternatives:
                   continue
               transcript = result.alternatives[0].transcript
               if result.is_final:
                   for i, alternative in enumerate(result.alternatives):
                       final_pub.publish(alternative.transcript)
              else:
                  partial_transcript += transcript
           if partial_transcript:
               inter_pub.publish(partial_transcript)

TTS customizations

Riva also provides the talk.py script for TTS. By default, you enter text in a terminal or Python interpreter, from which Riva generates audio output. For Spot to speak, the input text talk.py script is modified so that the text comes from a ROS callback rather than a human’s keystrokes. The key changes to the OOTB script include this function for extracting the text:

def callback(msg):
   global TTS
   TTS = msg.data

They also include these additions to the main() function:

   rospy.init_node('riva_tts', anonymous=True)
   rospy.Subscriber("speak", String, callback)

These altered conditional statements in the main() function are also key:

       while not rospy.is_shutdown():
           if TTS != None:
               text = TTS

Voice interaction script

Simple scripts like voice_control.py consist primarily of the callback and talker functions. They tell Spot what words to listen for and how to respond. 

def callback(msg):
   global pub, order
   rospy.loginfo(msg.data)
   if "hey spot" in msg.data.lower() and "fetch me" in msg.data.lower():
       order_start = msg.data.index("fetch me")
       order = msg.data[order_start + 9:]
       pub.publish("Fetching " + order)

def talker():
   global pub
   rospy.init_node("spot_voice_control", anonymous=True)
   pub = rospy.Publisher("speak", String, queue_size=10)
   rospy.Subscriber("final", String, callback)
   rospy.spin()

In other words, if the text contains “Hey Spot, … fetch me…” Spot saves the rest of the sentence as an order. After the ASR transcript indicates that the sentence is finished, Spot activates the TTS client and recites the word “Fetching” plus the contents of the order. Other scripts then engage a ROS action server instructing Spot to navigate to the restaurant, while taking care to avoid cars and other obstacles.

When Spot reaches the restaurant, it waits for a person to take its order by saying “Hello Spot.” If the ASR analysis script detects this sequence, Spot recites the order and ends it with “please.” The restaurant employee places the ordered food and any change in the appropriate container on Spot’s back. Spot returns home after Riva ASR recognizes that the restaurant staffer has said, “Go home, Spot.”

The technology behind a speech AI SDK like Riva for building and deploying fully customizable real-time speech AI applications deployable on-premises, in all clouds, at the edge, and embedded, brings AI robotics into the real world.

When a robot seamlessly interacts with people, it opens up a world of new areas where robots can help without needing a technical person on a computer to do the translation.

Deploy your own speech AI robot with a low-code solution

Teams such as NVIDIA, Open Robotics, and the robotics community, in general, have done a fantastic job in solving speech AI and robotics problems and making that technology available and accessible for everyday robotics users.

Anyone eager to get into the industry or to improve the technology they already have can look to these groups for inspiration and examples of cutting-edge technology. These technologies are usable through free SDKs (Riva, ROS, NVIDIA DeepStream, NVIDIA CUDA) and capable hardware (robots, NVIDIA Jetson Orin, sensors).

I am thrilled to see this level of community support from technology leaders and invite you to build your own speech AI robot. Robots are awesome!

For more information, see the following related resources:

Categories
Misc

Inside AI: NVIDIA DRIVE Ecosystem Creates Pioneering In-Cabin Features with NVIDIA DRIVE IX

As personal transportation becomes electrified and automated, time in the vehicle has begun to resemble that of a living space rather than a mind-numbing commute. Companies are creating innovative ways for drivers and passengers to make the most of this experience, using the flexibility and modularity of NVIDIA DRIVE IX. In-vehicle technology companies Cerence, Smart Read article >

The post Inside AI: NVIDIA DRIVE Ecosystem Creates Pioneering In-Cabin Features with NVIDIA DRIVE IX appeared first on NVIDIA Blog.

Categories
Misc

Enhancing Digital Twin Models and Simulations with NVIDIA Modulus v22.09

The latest version of NVIDIA Modulus, an AI framework that enables users to create customizable training pipelines for digital twins, climate models, and…

The latest version of NVIDIA Modulus, an AI framework that enables users to create customizable training pipelines for digital twins, climate models, and physics-based modeling and simulation, is now available for download. 

This release of the physics-ML framework, NVIDIA Modulus v22.09, includes key enhancements to increase composition flexibility for neural operator architectures, features to improve training convergence and performance, and most importantly, significant improvements to the user experience and documentation. 

You can download the latest version of the Modulus container from DevZone, NGC, or access the Modulus repo on GitLab.

Neural network architectures

This update extends the Fourier Neural Operator (FNO), physics-informed neural operator (PINO), and DeepONet network architecture implementations to support customization using other built-in networks in Modulus. More specifically, with this update, you can:

  • Achieve better initialization, customization, and generalization across problems with improved FNO, PINO, and DeepONet architectures.
  • Explore new network configurations by combining any point-wise network within Modulus such as Sirens, Fourier Feature networks, and Modified Fourier Feature networks for the decoder portion of FNO/PINO with the spectral encoder.
  • Use any network in the branch net and trunk net of DeepONet to experiment with a wide selection of architectures. This includes the physics-informed neural networks (PINNs) in the trunk net. FNO can be used in the branch net of DeepONet as well.
  • Demonstrate DeepONet improvements with a new DeepONet example for modeling Darcy flow through porous media.

Model parallelism has been introduced as a beta feature with model-parallel AFNO. This enables parallelizing the model across multiple GPUs along the channel dimension. This decomposition distributes the FFTs and IFFTs in a highly parallel fashion. The matrix multiplies are partitioned so each GPU holds a different portion of each MLP layer’s weights with appropriate gather, scatter, reductions, and other communication routines implemented for the forward and backward passes.

In addition, support for the self-scalable tanh (Stan) activation function is now available. Stan has been known to show better convergence characteristics and increase accuracy for PINN training models.

Finally, support for kernel fusion of the Sigmoid Linear Unit (SiLU) through TorchScript is now added with upstream changes to the PyTorch symbolic gradient formula. This is especially useful for problems that require computing higher-order derivatives for physics-informed training, providing up to 1.4x speedup in such instances.

Modeling enhancements and training features

Each NVIDIA Modulus release improves the modeling aspects to better map the partial differential equations (PDEs) to neural network models as well as improve the training convergence. 

New recommended practices in Modulus are available to facilitate scaling and nondimensionalizing PDEs to help you properly scale your system’s units, including: 

  • Defining a physical quantity with its value and its unit
  • Instantiating a nondimensionalized object to scale the quantity 
  • Tracking the nondimensionalized quantity through the algebraic manipulations
  • Scaling back the nondimensionalized quantity to any target quantity with user-specified units for post-processing purposes

You now also have the ability to effectively handle different scales within a system with Selective Equations Term Suppression (SETS). This enables you to create different instances of the same PDE and freeze certain terms in a PDE. That way, the losses for the smaller scales are minimized, improving convergence on stiff PDEs in the PINNs. 

In addition, new Modulus APIs, configured in the Hydra configuration YAML file, enable the end user to terminate the training based on convergence criteria like total loss or individual loss terms or another metric that they can specify.

The new causal weighting scheme addresses the bias of continuous time PINNs that violate physical causality for transient problems. By reformulating the losses for the residual and initial conditions, you can get better convergence and better accuracy of PINNS for dynamic systems.

Modulus training performance, scalability, and usability 

Each NVIDIA Modulus release focuses on improving training performance and scalability. With this latest release, FuncTorch was integrated into Modulus for faster gradient calculations in PINN training. Regular PyTorch Autograd uses reverse mode automatic differentiation and has to calculate Jacobian and Hessian terms row by row in a for loop. FuncTorch removes unnecessary weight gradient computations and can calculate Jacobian and Hessian more efficiently using a combination of reverse and forward mode automatic differentiation, thereby improving the training performance.

The Modulus v22.09 documentation improvements provide more context and detail about the key concepts of the framework’s workflow to help new users. 

Enhancements have been made to the Modulus Overview with more example-guided workflows for physics-only driven, purely data-driven, and both physics– and data-driven modeling approaches. Modulus users can now follow improved introductory examples to build step-by-step in line with each workflow’s key concepts. 

Get more details about all Modulus functionalities by visiting the Modulus User Guide and the Modulus Configuration page. You can also provide feedback and contributions as part of the Modulus GitLab repo

Check out the NVIDIA Deep Learning Institute self-paced course, Introduction to Physics-Informed Machine Learning with ModulusJoin us for these GTC 2022 featured sessions to learn more about NVIDIA Modulus research and breakthroughs.

Categories
Misc

Achieving Supercomputing-Scale Quantum Circuit Simulation with the NVIDIA DGX cuQuantum Appliance

Quantum circuit simulation is critical for developing applications and algorithms for quantum computers. Because of the disruptive nature of known quantum…

Quantum circuit simulation is critical for developing applications and algorithms for quantum computers. Because of the disruptive nature of known quantum computing algorithms and use cases, quantum algorithms researchers in government, enterprise, and academia are developing and benchmarking novel quantum algorithms on ever-larger quantum systems.

In the absence of large-scale, error-corrected quantum computers, the best way of developing these algorithms is through quantum circuit simulation. Quantum circuit simulations are computationally intensive, and GPUs are a natural tool for calculating quantum states. To simulate larger quantum systems, it is necessary to distribute the computation across multiple GPUs and multiple nodes to leverage the full computational power of a supercomputer.

NVIDIA cuQuantum is a software development kit (SDK) that enables users to easily accelerate and scale quantum circuit simulations with GPUs, facilitating a new capacity for exploration on the path to quantum advantage. 

This SDK includes the recently released NVIDIA DGX cuQuantum Appliance, a deployment-ready software container, with multi-GPU state vector simulation support. Generalized multi-GPU APIs are also now available in cuStateVec, for easy integration into any simulator. For tensor network simulation, the slicing API provided by the cuQuantum cuTensorNet library enables accelerated tensor network contractions distributed across multiple GPUs or multiple nodes. This has allowed users to take advantage of DGX A100 systems with nearly linear strong scaling. 

The NVIDIA cuQuantum SDK features libraries for state vector and tensor network methods.  This post focuses on cuStateVec for multi-node state vector simulation and the DGX cuQuantum Appliance. If you are interested in learning more about cuTensorNet and tensor network methods, see Scaling Quantum Circuit Simulation with NVIDIA cuTensorNet.

What is a multi-node, multi-GPU state vector simulation

A node is a single package unit made up of tightly interconnected processors optimized to work together while maintaining a rack-ready form factor. Multi-node multi-GPU state vector simulations take advantage of multiple GPUs within a node and multiple nodes of GPUs to provide faster time to solution and larger problem sizes than would otherwise be possible. 

DGX enables users to take advantage of high memory, low latency, and high bandwidth. The DGX H100 system is made up of eight H100 Tensor Core GPUs, leveraging the fourth-generation NVLink and third-generation NVSwitch. This node is a powerhouse for quantum circuit simulation. 

Running on a DGX A100 node with the NVIDIA multi-GPU-enabled DGX cuQuantum Appliance on all eight GPUs resulted in 70x to 290x speedups over dual 64-core AMD EPYC 7742 processors for three common quantum computing algorithms: Quantum Fourier Transform, Shor’s Algorithm, and the Sycamore Supremacy circuit. This has enabled users to simulate up to 36 qubits with full state vector methods using a single DGX A100 node (eight GPUs). The results shown in Figure 1 are 4.4x higher since we last announced benchmarks for this capability, due to software-only enhancements our team has implemented.

Graph showing acceleration ratio of simulations for popular quantum circuits show a speed up of 70-294x with GPUs over data center- and HPC-grade CPUs. Performance of GPU simulations was measured on DGX A100 and compared to the performance of two sockets of the EPYC 7742 CPU.
Figure 1. DGX cuQuantum Appliance multi-GPU speedup over state-of-the-art dual socket CPU server

The NVIDIA cuStateVec team has intensively investigated a performant means of leveraging multiple nodes in addition to multiple GPUs within a single node. Because most gate applications are perfectly parallel operations, GPUs within and across nodes can be orchestrated to divide and conquer. 

During the simulation, the state vector is split and distributed among GPUs, and each GPU can apply a gate to its part of the state vector in parallel. In many cases this can be handled locally; however, gate applications to high-order qubits require communication among distributed state vectors. 

One typical approach is to first reorder qubits and then apply gates in each GPU without accessing other GPUs or nodes. This reordering itself needs data transfers between devices. To do this efficiently, high interconnect bandwidth becomes incredibly important. Efficiently taking advantage of this parallelizability is non-trivial across multiple nodes.

Introducing the multi-node DGX cuQuantum Appliance

The answer to performantly and arbitrarily scale state vector-based quantum circuit simulation is here. NVIDIA is pleased to announce the multi-node, multi-GPU capability delivered in the new DGX cuQuantum Appliance. In our next release, any cuQuantum container user will be able to quickly and easily leverage an IBM Qiskit frontend to simulate quantum circuits on the largest NVIDIA systems in the world. 

The cuQuantum mission is to enable as many users as possible to easily accelerate and scale quantum circuit simulation. To that end, the cuQuantum team is working to productize the NVIDIA multi-node approach into APIs, which will be in general availability early next year. With this approach, you will be able to leverage a wider range of NVIDIA GPU-based systems to scale your state vector quantum circuit simulations. 

The NVIDIA multi-node DGX cuQuantum Appliance is in its final stages of development, and you will soon be able to take advantage of the best-in-class performance available with NVIDIA DGX SuperPOD systems. This will be offered as an NGC-hosted container image that you can quickly deploy with the help of Docker and a few lines of code.

With the fastest I/O architecture of any DGX system, NVIDIA DGX H100 is the foundational building block for large AI clusters like NVIDIA DGX SuperPOD, the enterprise blueprint for scalable AI, and now, quantum circuit simulation infrastructure. The eight NVIDIA H100 GPUs in the DGX H100 use the new high-performance fourth-generation NVLink technology to interconnect through four third-generation NVSwitches. 

The fourth-generation NVLink technology delivers 1.5x the communications bandwidth of the prior generation and is up to 7x faster than PCIe Gen5. It delivers up to 7.2 TB/sec of total GPU-to-GPU throughput, an improvement of almost 1.5x compared to the prior generation DGX A100. 

Along with the eight included NVIDIA ConnectX-7 InfiniBand / Ethernet adapters, each running at 400 GB/sec, the DGX H100 system provides a powerful high-speed fabric saving overhead in global communications among state vectors distributed across multiple nodes. The combination of multi-node, multi-GPU cuQuantum with massive GPU-accelerated compute leveraging state-of-the-art networking hardware and software optimizations means that DGX H100 systems can scale to hundreds or thousands of nodes to meet the biggest challenges, such as scaling full state vector quantum circuit simulation past 50 qubits. 

To benchmark this work, the multi-node DGX cuQuantum Appliance is run on the NVIDIA Selene Supercomputer, the reference architecture for NVIDIA DGX SuperPOD systems. As of June 2022, Selene is ranked eighth on the TOP500 list of supercomputing systems executing the High Performance Linpack (HPL) benchmark with 63.5 petaflops, and number 22 on the Green500 list with 24.0 gigaflops per watt.  

NVIDIA ran benchmarks leveraging the multi-node DGX cuQuantum Appliance: Quantum Volume, the Quantum Approximate Optimization Algorithm (QAOA), and Quantum Phase Estimation. The Quantum Volume circuit ran with a depth of 10 and 30. QAOA is a common algorithm used to solve combinatorial optimization problems on, relatively, near-term quantum computers. We ran it with two parameters. 

Both weak and strong scaling are demonstrated in the preceding algorithms. It is clear that scaling to a supercomputer like the NVIDIA DGX SuperPOD is valuable for both accelerating time-to-solution and extending the phase space researchers can explore with state vector quantum circuit simulation techniques. 

Chart showing the scaling state vector based quantum circuit simulations from 32 to 40 qubits, of Quantum Volume with a depth of 30, Quantum Approximate Optimization Algorithm and Quantum Phase Estimation. All runs were conducted on multiple GPUs, going up to 256 total NVIDIA A100 GPUs on NVIDIA’s Selene supercomputer, made easy by the DGX cuQuantum Appliance multi-node capability.
Figure 2. DGX cuQuantum Appliance multi-node weak scaling performance from 32 to 40 qubits 

We are further enabling users to achieve scale with our updated DGX cuQuantum Appliance. By introducing multi-node capabilities, we are enabling users to move beyond 32 qubits on one GPU, and 36 qubits on one NVIDIA Ampere Architecture node. We simulate a total of 40 qubits with 32 DGX A100 nodes. Users will now be able to scale out even further depending upon system configurations, with a software limit of 56 qubits or millions of DGX A100 nodes. Our other preliminary tests on NVIDIA Hopper GPUs have shown us that these numbers will be even better on our next-generation architecture. 

We also measured the strong scaling of our multi-node capabilities. We focused on Quantum Volume for simplicity. Figure 3 describes the performance when we solved the same problem multiple times changing the number of GPUs. Compared to state-of-the-art dual-socket server CPU, we obtained 320x to 340x speedups when leveraging 16 DGX A100 nodes. This is also 3.5x faster than the previous state-of-the-art implementation of quantum volume (depth=10 for 36 qubits with just two DGX A100 nodes). When adding more nodes, this speedup becomes more dramatic.  

Graph showing accelerating 32 qubit implementations of Quantum Volume with a depth of 10 and a depth of 30, with GPUs in NVIDIA’s Selene, we show that you can easily take advantage of GPUs and scale to speed up quantum circuit simulations with cuStateVec when compared to CPUs. We leveraged up to 16 nodes of NVIDIA DGX A100s, which totaled 128 NVIDIA A100 GPUs.
Figure 3. DGX cuQuantum Appliance multi-node speedup for 32 qubit Quantum Volume compared to a state-of-the-art CPU server

Simulate and scale quantum circuits on the largest NVIDIA systems

The cuQuantum team at NVIDIA is scaling up state vector simulation to multi-node, multi-GPU. This enables end users to conduct quantum circuit simulation for full state vectors larger than ever before. Not only has cuQuantum enabled scale, but also performance, showing weak scaling and strong scaling across nodes. 

In addition, cuQuantum introduced the first cuQuantum-powered IBM Qiskit image. In our next release, you will be able to pull this container, making it easier and faster to scale up quantum circuit simulations with this popular framework.

While the multi-node DGX cuQuantum Appliance is in private beta today, NVIDIA expects to release it publicly over the coming months. The cuQuantum team intends to release multi-node APIs within the cuStateVec library by spring 2023.

Getting started with DGX cuQuantum Appliance

When the multi-node DGX cuQuantum Appliance is in general availability later this year, you will be able to pull the Docker image from the NGC catalog for containers

You can reach out to the cuQuantum team with questions through the Quantum Computing Forum. Contact us on the NVIDIA/cuQuantum GitHub repo with feature requests or to report a bug. 

For more information, see the following resources:

GTC 2022 and cuQuantum

Join us for these GTC 2022 sessions to learn more about NVIDIA cuQuantum and other advancements:

Categories
Offsites

View Synthesis with Transformers

A long-standing problem in the intersection of computer vision and computer graphics, view synthesis is the task of creating new views of a scene from multiple pictures of that scene. This has received increased attention [1, 2, 3] since the introduction of neural radiance fields (NeRF). The problem is challenging because to accurately synthesize new views of a scene, a model needs to capture many types of information — its detailed 3D structure, materials, and illumination — from a small set of reference images.

In this post, we present recently published deep learning models for view synthesis. In “Light Field Neural Rendering” (LFNR), presented at CVPR 2022, we address the challenge of accurately reproducing view-dependent effects by using transformers that learn to combine reference pixel colors. Then in “Generalizable Patch-Based Neural Rendering” (GPNR), to be presented at ECCV 2022, we address the challenge of generalizing to unseen scenes by using a sequence of transformers with canonicalized positional encoding that can be trained on a set of scenes to synthesize views of new scenes. These models have some unique features. They perform image-based rendering, combining colors and features from the reference images to render novel views. They are purely transformer-based, operating on sets of image patches, and they leverage a 4D light field representation for positional encoding, which helps to model view-dependent effects.

We train deep learning models that are able to produce new views of a scene given a few images of it. These models are particularly effective when handling view-dependent effects like the refractions and translucency on the test tubes. This animation is compressed; see the original-quality renderings here. Source: Lab scene from the NeX/Shiny dataset.

Overview
The input to the models consists of a set of reference images and their camera parameters (focal length, position, and orientation in space), along with the coordinates of the target ray whose color we want to determine. To produce a new image, we start from the camera parameters of the input images, obtain the coordinates of the target rays (each corresponding to a pixel), and query the model for each.

Instead of processing each reference image completely, we look only at the regions that are likely to influence the target pixel. These regions are determined via epipolar geometry, which maps each target pixel to a line on each reference frame. For robustness, we take small regions around a number of points on the epipolar line, resulting in the set of patches that will actually be processed by the model. The transformers then act on this set of patches to obtain the color of the target pixel.

Transformers are especially useful in this setting since their self-attention mechanism naturally takes sets as inputs, and the attention weights themselves can be used to combine reference view colors and features to predict the output pixel colors. These transformers follow the architecture introduced in ViT.

To predict the color of one pixel, the models take a set of patches extracted around the epipolar line of each reference view. Image source: LLFF dataset.

Light Field Neural Rendering
In Light Field Neural Rendering (LFNR), we use a sequence of two transformers to map the set of patches to the target pixel color. The first transformer aggregates information along each epipolar line, and the second along each reference image. We can interpret the first transformer as finding potential correspondences of the target pixel on each reference frame, and the second as reasoning about occlusion and view-dependent effects, which are common challenges of image-based rendering.

LFNR uses a sequence of two transformers to map a set of patches extracted along epipolar lines to the target pixel color.

LFNR improved the state-of-the-art on the most popular view synthesis benchmarks (Blender and Real Forward-Facing scenes from NeRF and Shiny from NeX) with margins as large as 5dB peak signal-to-noise ratio (PSNR). This corresponds to a reduction of the pixel-wise error by a factor of 1.8x. We show qualitative results on challenging scenes from the Shiny dataset below:

LFNR reproduces challenging view-dependent effects like the rainbow and reflections on the CD, reflections, refractions and translucency on the bottles. This animation is compressed; see the original quality renderings here. Source: CD scene from the NeX/Shiny dataset.
Prior methods such as NeX and NeRF fail to reproduce view-dependent effects like the translucency and refractions in the test tubes on the Lab scene from the NeX/Shiny dataset. See also our video of this scene at the top of the post and the original quality outputs here.

Generalizing to New Scenes
One limitation of LFNR is that the first transformer collapses the information along each epipolar line independently for each reference image. This means that it decides which information to preserve based only on the output ray coordinates and patches from each reference image, which works well when training on a single scene (as most neural rendering methods do), but it does not generalize across scenes. Generalizable methods are important because they can be applied to new scenes without needing to retrain.

We overcome this limitation of LFNR in Generalizable Patch-Based Neural Rendering (GPNR). We add a transformer that runs before the other two and exchanges information between points at the same depth over all reference images. For example, this first transformer looks at the columns of the patches from the park bench shown above and can use cues like the flower that appears at corresponding depths in two views, which indicates a potential match. Another key idea of this work is to canonicalize the positional encoding based on the target ray, because to generalize across scenes, it is necessary to represent quantities in relative and not absolute frames of reference. The animation below shows an overview of the model.

GPNR consists of a sequence of three transformers that map a set of patches extracted along epipolar lines to a pixel color. Image patches are mapped via the linear projection layer to initial features (shown as blue and green boxes). Then those features are successively refined and aggregated by the model, resulting in the final feature/color represented by the gray rectangle. Park bench image source: LLFF dataset.

To evaluate the generalization performance, we train GPNR on a set of scenes and test it on new scenes. GPNR improved the state-of-the-art on several benchmarks (following IBRNet and MVSNeRF protocols) by 0.5–1.0 dB on average. On the IBRNet benchmark, GPNR outperforms the baselines while using only 11% of the training scenes. The results below show new views of unseen scenes rendered with no fine-tuning.

GPNR-generated views of held-out scenes, without any fine tuning. This animation is compressed; see the original quality renderings here. Source: IBRNet collected dataset.
Details of GPNR-generated views on held-out scenes from NeX/Shiny (left) and LLFF (right), without any fine tuning. GPNR reproduces more accurately the details on the leaf and the refractions through the lens when compared against IBRNet.

Future Work
One limitation of most neural rendering methods, including ours, is that they require camera poses for each input image. Poses are not easy to obtain and typically come from offline optimization methods that can be slow, limiting possible applications, such as those on mobile devices. Research on jointly learning view synthesis and input poses is a promising future direction. Another limitation of our models is that they are computationally expensive to train. There is an active line of research on faster transformers which might help improve our models’ efficiency. For the papers, more results, and open-source code, you can check out the projects pages for “Light Field Neural Rendering” and “Generalizable Patch-Based Neural Rendering“.

Potential Misuse
In our research, we aim to accurately reproduce an existing scene using images from that scene, so there is little room to generate fake or non-existing scenes. Our models assume static scenes, so synthesizing moving objects, such as people, will not work.

Acknowledgments
All the hard work was done by our amazing intern – Mohammed Suhail – a PhD student at UBC, in collaboration with Carlos Esteves and Ameesh Makadia from Google Research, and Leonid Sigal from UBC. We are thankful to Corinna Cortes for supporting and encouraging this project.

Our work is inspired by NeRF, which sparked the recent interest in view synthesis, and IBRNet, which first considered generalization to new scenes. Our light ray positional encoding is inspired by the seminal paper Light Field Rendering and our use of transformers follow ViT.

Video results are from scenes from LLFF, Shiny, and IBRNet collected datasets.