Categories
Misc

Nvidia Releases CUDA Python

Nvidia Releases CUDA Python submitted by /u/lindaarden
[visit reddit] [comments]
Categories
Misc

Unlocking Operational Consistency with the NVIDIA User Experience CLI Object Model

Cumulus Linux 4.4 introduces a new CLI, NVUE, that is more than just a CLI. NVUE provides a complete object model for Linux, unlocking incredible operational potential.

Cumulus Linux 4.4 is the first release with the NVIDIA User Experience (NVUE), a brand new CLI for Cumulus Linux. Being excited about a new networking CLI sounds a bit like being excited about your new 56k modem. What makes NVUE special isn’t just that it’s a new CLI but it’s the principles it was built on that make it unique. At its core, NVUE has created a full object model of Cumulus Linux enabling advanced programmability, extensibility, and usability.

What is an object model?

Object models aren’t exactly the kind of thing network engineers think about daily. I didn’t know what an object model was before I got involved in helping the team design NVUE.

An object model defines the components of a system and their relationships to each other. For example, an interface is an object. It has components like an IP address or MTU setting. It’s not just the fact that an object model exists that is important, but the thought that is put into how those relationships between objects and components fit together.

An interface and IP address are an easy example, but what about something more complicated? Think about a “bond” interface, also called a port-channel. Is the bond a top-level interface like an Ethernet port with the components of other Ethernet interfaces as children or is being a member in a bond an element of the interface?

A circular relationship between interfaces, the bond, and Ethernet.
Figure 1. Ethernet interfaces and bonds are at the same level with relationships between them.
A hierarchical relationship between objects.
Figure 2. A bond is a property of an interface, like the MTU or IP address.

These relationships get complicated fast. Failing to think through them creates a poor user experience where you may have to define the same setting multiple times to achieve an end goal or an inconsistent configuration. An imaginary network CLI could have you define any route inside a VRF under a VRF object but any route in the global routing table at the top level, like the following example:

ip vrf red
   ip route 10.1.1.0/24 via 169.254.1.1
 !
 ip route 192.168.1.0/24 via 172.16.1.1 

This is a trivial example, but now the way that a route is defined is not uniform, depending on where you are in the system.

What do you get with an object model?

With an understanding of what an object model is, the next question is, “Why should you care?” By having an object model, it makes building ways to interact with the system extremely easy. Systems talk to an API that represents the object model. The first interface is, of course, the CLI, but anything can now be an interface to the system: REST, gRPC, or even RFC1149 Avian Carriers.

CLI, REST, gRPC, Terraform, or RFC1149 Carrier Pigeons all interface with the same NVUE API.
Figure 3. CLI and REST interfaces are available in Cumulus Linux 4.4.

By having all the interfaces use the same object model, it guarantees consistent results regardless of how you interface with the system. The CLI and REST API use the same methods to configure a BGP peer. There is never a chance of seeing different behaviors based on which interface you use. Because the object model is the same no matter how you interact with it, this means that going from playing with the CLI to building full automation is an evolution, not a completely new process.

REST and CLI are expected for any network device today. Where can we think beyond this? An object model can be directly imported into a programming language like Python or Java. This enables you to use true programming concepts to build configurations for one device or an entire fabric of devices. You can enforce inputs, values, and relationships like never before. The following code example shows what an NVUE Python interface might look like:

from nvue import Switch
  
 spine01 = Switch()
 x = 1
 while x 



In this example, I load the nvue library and create a new Switch object called spine01. I have the object tell me how many interfaces exist on the system with len(spine01.interfaces). For each interface, I put it in the up state and assign an IP address with the subnet value matching the interface number. For example, port 3 would have an IP address of 10.1.3.1/24.

This doesn’t exist yet, but it is absolutely in the realm of possibility because an object model exists. Unlike all other networking vendor systems, where the model is determined by the CLI, this CLI is based on the model. The object model is a standalone element that can be imported into programming languages, APIs, or any other system.

Try it out

One of the most valuable pieces of Cumulus Linux is the ability to try all our features and functions virtually. You can use NVIDIA Air to start using NVUE today and see what you think of the future of network CLIs and programmability.

Categories
Offsites

SoundStream: An End-to-End Neural Audio Codec

Audio codecs are used to efficiently compress audio to reduce either storage requirements or network bandwidth. Ideally, audio codecs should be transparent to the end user, so that the decoded audio is perceptually indistinguishable from the original and the encoding/decoding process does not introduce perceivable latency.

Over the past few years, different audio codecs have been successfully developed to meet these requirements, including Opus and Enhanced Voice Services (EVS). Opus is a versatile speech and audio codec, supporting bitrates from 6 kbps (kilobits per second) to 510 kbps, which has been widely deployed across applications ranging from video conferencing platforms, like Google Meet, to streaming services, like YouTube. EVS is the latest codec developed by the 3GPP standardization body targeting mobile telephony. Like Opus, it is a versatile codec operating at multiple bitrates, 5.9 kbps to 128 kbps. The quality of the reconstructed audio using either of these codecs is excellent at medium-to-low bitrates (12–20 kbps), but it degrades sharply when operating at very low bitrates (⪅3 kbps). While these codecs leverage expert knowledge of human perception as well as carefully engineered signal processing pipelines to maximize the efficiency of the compression algorithms, there has been recent interest in replacing these handcrafted pipelines by machine learning approaches that learn to encode audio in a data-driven manner.

Earlier this year, we released Lyra, a neural audio codec for low-bitrate speech. In “SoundStream: an End-to-End Neural Audio Codec”, we introduce a novel neural audio codec that extends those efforts by providing higher-quality audio and expanding to encode different sound types, including clean speech, noisy and reverberant speech, music, and environmental sounds. SoundStream is the first neural network codec to work on speech and music, while being able to run in real-time on a smartphone CPU. It is able to deliver state-of-the-art quality over a broad range of bitrates with a single trained model, which represents a significant advance in learnable codecs.

Learning an Audio Codec from Data
The main technical ingredient of SoundStream is a neural network, consisting of an encoder, decoder and quantizer, all of which are trained end-to-end. The encoder converts the input audio stream into a coded signal, which is compressed using the quantizer and then converted back to audio using the decoder. SoundStream leverages state-of-the-art solutions in the field of neural audio synthesis to deliver audio at high perceptual quality, by training a discriminator that computes a combination of adversarial and reconstruction loss functions that induce the reconstructed audio to sound like the uncompressed original input. Once trained, the encoder and decoder can be run on separate clients to efficiently transmit high-quality audio over a network.

SoundStream training and inference. During training, the encoder, quantizer and decoder parameters are optimized using a combination of reconstruction and adversarial losses, computed by a discriminator, which is trained to distinguish between the original input audio and the reconstructed audio. During inference, the encoder and quantizer on a transmitter client send the compressed bitstream to a receiver client that can then decode the audio signal.

Learning a Scalable Codec with Residual Vector Quantization
The encoder of SoundStream produces vectors that can take an indefinite number of values. In order to transmit them to the receiver using a limited number of bits, it is necessary to replace them by close vectors from a finite set (called a codebook), a process known as vector quantization. This approach works well at bitrates around 1 kbps or lower, but quickly reaches its limits when using higher bitrates. For example, even at a bitrate as low as 3 kbps, and assuming the encoder produces 100 vectors per second, one would need to store a codebook with more than 1 billion vectors, which is infeasible in practice.

In SoundStream, we address this issue by proposing a new residual vector quantizer (RVQ), consisting of several layers (up to 80 in our experiments). The first layer quantizes the code vectors with moderate resolution, and each of the following layers processes the residual error from the previous one. By splitting the quantization process in several layers, the codebook size can be reduced drastically. As an example, with 100 vectors per second at 3 kbps, and using 5 quantizer layers, the codebook size goes from 1 billion to 320. Moreover, we can easily increase or decrease the bitrate by adding or removing quantizer layers, respectively.

Because network conditions can vary while transmitting audio, ideally a codec should be “scalable” so that it can change its bitrate from low to high depending on the state of the network. While most traditional codecs are scalable, previous learnable codecs need to be trained and deployed specifically for each bitrate.

To circumvent this limitation, we leverage the fact that the number of quantization layers in SoundStream controls the bitrate, and propose a new method called “quantizer dropout”. During training, we randomly drop some quantization layers to simulate a varying bitrate. This pushes the decoder to perform well at any bitrate of the incoming audio stream, and thus helps SoundStream to become “scalable” so that a single trained model can operate at any bitrate, performing as well as models trained specifically for these bitrates.

Comparison of SoundStream models (higher is better) that are trained at 18 kbps with quantizer dropout (bitrate scalable), without quantizer dropout (not bitrate scalable) and evaluated with a variable number of quantizers, or trained and evaluated at a fixed bitrate (bitrate specific). The bitrate-scalable model (a single model for all bitrates) does not lose any quality when compared to bitrate-specific models (a different model for each bitrate), thanks to quantizer dropout.

A State-of-the-Art Audio Codec
SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches the quality of EVS at 9.6 kbps, while using 3.2x–4x fewer bits. This means that encoding audio with SoundStream can provide a similar quality while using a significantly lower amount of bandwidth. Moreover, at the same bitrate, SoundStream outperforms the current version of Lyra, which is based on an autoregressive network. Unlike Lyra, which is already deployed and optimized for production usage, SoundStream is still at an experimental stage. In the future, Lyra will incorporate the components of SoundStream to provide both higher audio quality and reduced complexity.

SoundStream at 3kbps vs. state-of-the-art codecs. MUSHRA score is an indication of subjective quality (the higher the better).

The demonstration of SoundStream’s performance compared to Opus, EVS, and the original Lyra codec is presented in these audio examples, a selection of which are provided below.

Speech

Reference
Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)  

Music

Reference
Lyra (3kbps)
Opus (6kbps)
EVS (5.9kbps)
SoundStream (3kbps)  

Joint Audio Compression and Enhancement
In traditional audio processing pipelines, compression and enhancement (the removal of background noise) are typically performed by different modules. For example, it is possible to apply an audio enhancement algorithm at the transmitter side, before audio is compressed, or at the receiver side, after audio is decoded. In such a setup, each processing step contributes to the end-to-end latency. Conversely, we design SoundStream in such a way that compression and enhancement can be carried out jointly by the same model, without increasing the overall latency. In the following examples, we show that it is possible to combine compression with background noise suppression, by activating and deactivating denoising dynamically (no denoising for 5 seconds, denoising for 5 seconds, no denoising for 5 seconds, etc.).

Original noisy audio  
Denoised output*
* Demonstrated by turning denoising on and off every 5 seconds.

Conclusion
Efficient compression is necessary whenever one needs to transmit audio, whether when streaming a video, or during a conference call. SoundStream is an important step towards improving machine learning-driven audio codecs. It outperforms state-of-the-art codecs, such as Opus and EVS, can enhance audio on demand, and requires deployment of only a single scalable model, rather than many.

SoundStream will be released as a part of the next, improved version of Lyra. By integrating SoundStream with Lyra, developers can leverage the existing Lyra APIs and tools for their work, providing both flexibility and better sound quality. We will also release it as a separate TensorFlow model for experimentation.

AcknowledgmentsThe work described here was authored by Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund and Marco Tagliasacchi. We are grateful for all discussions and feedback on this work that we received from our colleagues at Google.

Categories
Misc

Explore the Latest in Omniverse Create: From Material Browsers to the Animation Sequencer

NVIDIA Omniverse Create 2021.3 is now available in open beta, delivering a new set of features for Omniverse artists, designers, developers, and engineers to enhance graphics and content creation workflows. We sat down with Frank DeLise, Senior Director of Product Management for Omniverse, to get a tour of some of the exciting new features. Get … Continued

NVIDIA Omniverse Create 2021.3 is now available in open beta, delivering a new set of features for Omniverse artists, designers, developers, and engineers to enhance graphics and content creation workflows.

We sat down with Frank DeLise, Senior Director of Product Management for Omniverse, to get a tour of some of the exciting new features. Get an overview through the clips or view the entirety of the livestream here.

A Beginner’s Look at Omniverse

Let’s start with a quick overview of the Omniverse Platform.


Introduction to Omniverse Create

NVIDIA Omniverse Create is an app that allows users to assemble, light, simulate, and render large-scale scenes. It is built using NVIDIA Omniverse Kit, and the scene description and in-memory model is based on Pixar’s USD

Omniverse Create can be used on its own or as a companion application alongside popular content creation tools in a connected, collaborative workflow. Omniverse Connectors, or plug-ins to applications, can provide real-time, synchronized feedback. Being an extra viewport with physically accurate path tracing and physics simulation, greatly increases any creative and design workflow.

Zero Gravity Mode, powered by PhysX 5

Frank shows us Zero Gravity, a physics-based manipulation tool built to make scene composition intuitive for creators. With physics interactions based on NVIDIA PhysX 5, users can now nudge, slide, bump, and push objects into position with no interpenetration. Zero Gravity easily makes objects solid, making precise positioning, scattering and grouping of objects a breeze. 

Features to Simplify Workflows
Next on our tour is a trio of features: 

  • Browser Extension: A new set of windows were added for easy browsing of assets, textures, materials, samples, and more. 
  • Paint Scattering: Users can select assets and randomly scatter using a paint brush. The ability to flood fill areas with percentage ratios makes it easy to create lifelike environments with realistic variety.
  • Quick Search: Users can now search for anything with Create, including connected libraries, functions, and tools, by simply typing the name. Quick Search also uses skills to provide contextual suggestions, like suggesting an HDRI map after you place a dome light. It’s a highly extensible system and can be enhanced through AI integration.

Sun Study Simulations

Omniverse users can further explore lighting options with the Sun Study extension, which offers a quick way to review a model with accurate sunlight. When the Sun Study Timeline is invoked, it will appear on the bottom of the viewport and allow the user to “scrub” or “play” through a given day/night cycle. It even includes dynamic skies with animated clouds for added realism.

Animation and Sequencer

Animation gets a massive push forward with the addition of a sequencer and key framer.

The new sequencer enables users to assemble animations through clips, easily cut from one camera to another, apply motion data to characters, and add a soundtrack or sound effects. 

The key framer extension provides a user-friendly way of adding keyframes and animations to prims in your scenes.

UsdShade Graphic Editor for Material Definition Language (MDL)

New with Create 2021.3 is the UsdShade graph editor for Material Definition Language (MDL) materials. Provided with the Material Graph is a comprehensive list of MDL BSDFs and functions. Materials and functions are represented as drag and droppable nodes in the Material Graph Node List. Now, you can easily create custom materials by connecting shading nodes together and storing them in USD.

OpenVDB Support, Accelerated by NanoVDB

Support for OpenVDB volumes has also been added, making use of NanoVDB for acceleration. This feature helps artists visualize volumetric data created with applications like SideFX Houdini or Autodesk Bifrost.

What is Omniverse Create versus Omniverse View?

Lastly, Frank finishes our tour with an explanation of Omniverse View compared to Omniverse Create.

To learn more, look at the new features in Omniverse View 2021.3.

More Resources

  • Watch the full recording for coverage of additional features including installation using the launcher, USDZ and point cloud support, version control, Iray rendering, payloads, and more! 
  • You can get more details about the latest Create and View apps by reading the release notes in our online documentation: 
  • Download the Omniverse Open Beta today and explore these new features!
  • Join us live on Twitch for interactive answers to your questions. 
  • Visit our forums or Discord server to discuss features or seek assistance.
  • Binge watch our tutorials for a deep dive into Omniverse Create and Omniverse View. 
Categories
Misc

Hooked on a Feeling: GFN Thursday Brings ‘NARAKA: BLADEPOINT’ to GeForce NOW

Calling all warriors. It’s a glorious week full of new games. This GFN Thursday comes with the exciting release of the new battle royale NARAKA: BLADEPOINT, as well as the Hello Neighbor franchise as part of the 11 great games joining the GeForce NOW library this week. Plus, the newest Assassin’s Creed Valhalla DLC has Read article >

The post Hooked on a Feeling: GFN Thursday Brings ‘NARAKA: BLADEPOINT’ to GeForce NOW appeared first on The Official NVIDIA Blog.

Categories
Misc

If I compile tensorflow from source, will it run faster than if I install it with pip

During the configuration before compilation it asks for what cuda capability your graphics card has if you enable cuda so wouldn’t that mean that if I compile it myself and select the correct capability then it will be a better fit for my graphics card than the generic tensorflow-gpu package?

submitted by /u/NotSamar
[visit reddit] [comments]

Categories
Misc

Looking for beginner regression exercise

Hey guys, I have just getting started with tensorflow regression, now I want to do some practice. Can you guys suggest any simple dataset for me to practice on? Are there ‘beginner’ dataset on kaggle?

submitted by /u/nyyirs
[visit reddit] [comments]

Categories
Misc

NVIDIA Supercharges Precision Timing for Facebook’s Next-Generation Time Keeping

Facebook is open-sourcing the Open Compute Project Time Appliance Project (OCP TAP), which provides very precise time keeping and time synchronization across data centers in a cost-effective manner.

NVIDIA ConnectX NIC enables precise timekeeping for social network’s mission-critical distributed applications

Facebook is open-sourcing the Open Compute Project Time Appliance Project (OCP TAP), which provides very precise time keeping and time synchronization across data centers in a cost-effective manner. The solution includes a Time Card that can turn almost any commercial off-the-shelf (COTS) server into an accurate time appliance, enabled by the NVIDIA ConnectX-6 Dx network card (NICs), with precision timing protocol, to share the precise time keeping with other servers across the data center.

The combination of Facebook’s Time Card and NVIDIA’s NIC gives data center operators a modern, affordable, time synchronization solution that is open-sourced, secure, reliable, and scalable.

Why Accurate Time Matters in the Data Center

As applications scale-out and IT operations span the globe, keeping data synchronized across different servers within a data center, or different data centers across continents, it becomes more important and more difficult. If a database is distributed, it must track the exact order of events to maintain consistency and show causality. If two people try to buy the same stock, fairness (and compliance) requires knowing with certainty which order arrived first. Likewise when thousands of people post content and millions of users like/laugh/love those posts every hour, Facebook needs to know the actual order in which each post, thumbs up, reply or emoji happened.

One way to keep data synchronized is to have each data center send its updates to the others after each transaction, but this rapidly becomes untenable because the latency between data centers is too high to support millions of events per hour.

A better way is to have each server and data center synchronized to the exact time, within less than a microsecond of each other. This enables each site to keep track of time, and when they share events with other data centers, the ordering of each event is already correct.

The more accurate the time sync, the faster the performance of the applications. A recent test showed that making the timekeeping 80x more precise (making any time discrepancies 80x smaller) made a distributed database run 3x faster — an incredible performance boost on the same server hardware, just from keeping more accurate and more reliable time.

The Role of the NIC and Network in Time Synchronization

The OCP TAP project (and Facebook’s blog post on Open Sourcing the Time Appliance) defines exactly how the Time Card receives and processes time signals from a GPS satellite network, keeps accurate time even when the satellite signal is temporarily unavailable, and shares this accurate time with the time server. But the networking — and the network card used — is also of critical importance.

Figure 1. The OCP Time Card maintains accurate time and shares it with a NIC that supports PPS in/out, such as the NVIDIA ConnectX-6 Dx (source: Facebook engineering blog).

The NIC in the time appliance must have a time pulse per second (PPS) port to connect to the Time Card. This ensures exact time synchronization between the Time Card and NIC in each Time Server, accurate to within a few nanoseconds. ConnectX-6 Dx is one of the first modern 25/50/100/200 Gb/s NICs to support this. It also filters and checks the incoming PPS signal and maintains time internally using hardware in its ASIC to ensure accuracy and consistency.

Time Appliances with sub-microsecond accurate timing can share that timing with hundreds of regular servers using the network time protocol (NTP) or tens of thousands of servers using the precision time protocol (PTP). Since the network adds latency to the time signal, NTP and PTP timestamp packets to measure the travel time in both directions, factor in jitter and latency, and calculate the correct time on each server (PTP is far more accurate so it is starting to displace NTP).

Figure 2. The NVIDIA ConnectX-6 Dx with PPS in/out ports to enable direct time synchronization with the Time Card. It also performs precision hardware time stamping of packets in hardware.

An alternative is to timestamp with software solutions, but timestamping with software at today’s speed is too unpredictable and inaccurate or even impossible, varying by up to milliseconds due to congestion or CPU distractions. Instead, the ConnectX-6 Dx NIC and BlueField-2 DPU apply hardware timestamps to inbound packets as soon as they arrive and outbound packets right before they hit the network, at speeds up to 100Gb/s. ConnectX-6 Dx can timestamp every packet with less than 4 nanoseconds (4ns) of variance in time stamping precision, even under heavy network loads. Most other time-capable NICs stamp only some packets and show a much greater variance in precision, becoming less precise with their timestamps when network traffic is heavy.

NVIDIA networking delivers the most precise latency measurements available from a commercial NIC, leading to the most accurate time across all the servers, with application time variance typically lower than one microsecond (

Figure 3. Deploying NTP or PTP with OCP Time Servers and NVIDIA NICs or DPUs propagates extremely accurate time to all servers across the data center.

Accurate Time Synchronization, for Everyone

The OCP Time Appliance Project makes time keeping precise, accurate, and accessible to any organization.  The Open Time Servers and open management tools from Facebook, NVIDIA, and OCP provide an easy to adopt recipe everyone can use just like a hyperscaler.

NVIDIA provides precision time-capable NICs and data processing units (DPUs) that deliver ultra-precise timestamping and network synchronization features needed for precision timing appliances. If the BlueField DPU is used, it can run the PTP stack on its Arm cores, isolating the time stack from other server software and continuously verifying the accuracy of time within that server and continuously calculating the maximum time error bound across the data center.

Cloud services and databases are already adding new time-based commands and APIs to take advantage of better time servers and time synchronization. Together, this solution enables a new era of more accurate time keeping that can improve the performance of distributed applications and enable new types of solutions in both cloud and enterprise.

Specifics about OCP TAP, including specifications, schematics, mechanics, bill of materials, and source code can be found at www.ocptap.com.

Categories
Misc

Optimizing DX12 Resource Uploads to the GPU Using CPU-Visible VRAM

How to optimize DX12 resource uploads from the CPU to the GPU over the PCIe bus is an old problem with many possible solutions, each with their pros and cons. In this post, I show how moving cherry-picked DX12 UPLOAD heaps to CPU-Visible VRAM (CVV) using NVAPI can be a simple solution to speed up … Continued

How to optimize DX12 resource uploads from the CPU to the GPU over the PCIe bus is an old problem with many possible solutions, each with their pros and cons. In this post, I show how moving cherry-picked DX12 UPLOAD heaps to CPU-Visible VRAM (CVV) using NVAPI can be a simple solution to speed up PCIe limited workloads.

CPU-Visible VRAM: A new tool in the toolbox

Take the example of a vertex buffer (VB) upload, for which the data cannot be reused across frames. The simplest way to upload a VB to the GPU is to read the CPU memory directly from the GPU:

  • First, the application creates a DX12 UPLOAD heap, or an equivalent CUSTOM heap. DX12 UPLOAD heaps are allocated in system memory, also known as CPU memory, with WRITE_COMBINE (WC) pages optimized for CPU writes. The CPU writes the VB data to this system memory heap first.
  • Second, the application binds the VB within the UPLOAD heap to a GPU draw command, by using an IASetVertexBuffers command.

When the draw executes in the GPU, vertex shaders are launched. Next, the vertex attribute fetch (VAF) unit reads the VB data through the GPU’s L2 cache, which itself loads the VB data from the DX12 UPLOAD heap stored in system memory:

The CPU writes to System Memory through the CPU Write-Combining Cache. The VAF unit fetches data from System Memory via the PCIe Bus and the GPU L2 Cache.
Figure 1. Fetching a VB directly from a DX12 UPLOAD heap.

L2 accesses from system memory have high latency, so it is preferable to hide that latency by copying the data from system memory to VRAM before the draw command is executed.

The preupload from CPU to GPU can be done by using a copy command, either asynchronously by using a COPY queue, or synchronously on the main DIRECT queue.

The CPU writes to System Memory through the CPU Write-Combining Cache. A DX12 Copy command then copies the data from System Memory to VRAM over the PCIe bus. Finally, the VAF unit fetches the data from VRAM through the GPU L2 cache in a Draw command.
Figure 2. Preloading a VB to VRAM using a copy command

Copy engines can execute copy commands in a COPY queue concurrently with other GPU work, and multiple COPY queues can be used concurrently. One problem with using async COPY queues though is that you must take care of synchronizing the queues with DX12 Fences, which may be complicated to implement and may have significant overhead.

In the The Next Level of Optimization Advice with Nsight Graphics: GPU Trace session at GTC 2021, we announced that an alternative solution for DX12 applications on NVIDIA GPUs is to effectively use a CPU thread as a copy engine. This can be achieved by creating the DX12 UPLOAD heap in CVV by using NVAPI. CPU writes to this special UPLOAD heap are then forwarded directly to VRAM, over the PCIe bus (Figure 3).

The CPU writes to CPU-Visible VRAM through the CPU WC Cache and the PCIe Bus directly. The VAF unit then fetches the data from VRAM through the GPU L2 cache.
Figure 3. Preloading a VB to VRAM using CPU writes in a CPU thread

For DX12, the following NVAPI functions are available for querying the amount of CVV available in the system, and for allocating heaps of this new flavor (CPU-writable VRAM, with fast CPU writes and slow CPU reads):

  • NvAPI_D3D12_QueryCpuVisibleVidmem
  • NvAPI_D3D12_CreateCommittedResource
  • NvAPI_D3D12_CreateHeap2

These new functions require recent drivers: 466.11 or later.

NvAPI_D3D12_QueryCpuVisibleVidmem should report the following amount of CVV memory:

Detecting and quantifying GPU performance-gain opportunities from CPU-Visible VRAM using Nsight Graphics

The GPU Trace tool within NVIDIA Nsight Graphics 2021.3 makes it easy to detect GPU performance-gain opportunities. When Advanced Mode is enabled, the Analysis panel within GPU Trace color codes perf markers within the frame based on the projected frame-reduction percentage by fixing a specific issue in this GPU workload.

Here’s how it looks like for a frame from a prerelease build of Watch Dogs: Legion (DX12), on NVIDIA RTX 3080, after choosing Analyze:

A screenshot from the GPU Trace Analysis tool showing a breakdown of the GPU frame time by marker. The left-side panel shows the marker tree. The bottom panel shows GPU metrics and detected performance opportunities for the selected marker (by default for the whole frame).
Figure 4. The GPU Trace Analysis tool with color-coded GPU workloads
(the greener, the higher the projected gain on the frame).

Now, selecting a user interface draw command at the end of the frame, the analysis tool shows that there is a 0.9% projected reduction in the GPU frame time from fixing the L2 Misses To System Memory performance issue. The tool also shows that most of the system memory traffic transiting through the L2 cache is requested by the Primitive Engine, which includes the vertex attribute fetch unit:

 the L1 L2 tab in the bottom panel shows L2 Misses To System Memory were detected to be a performance opportunity, with a 0.20ms projected gain.
Figure 5. GPU Trace Analysis tool, focusing on a single workload.

By allocating the VB of this draw command in CVV instead of system memory using a regular DX12 UPLOAD heap, the GPU time for this regime went from 0.2 ms to under 0.01 ms. The GPU frame time was also reduced by 0.9%. The VB data is now fetched directly from VRAM in this workload:

The bottom panel shows the L2 requested sectors by aperture, with 97.5% being in aperture VRAM.
Figure 6. GPU Trace Analysis tool, after having optimized the workload.

Avoiding CPU reads from CPU-Visible VRAM using Nsight Systems

Regular DX12 UPLOAD heaps are not supposed to be read by the CPU but only written to. Like the regular heaps, CPU memory pages for CVV heaps have write combining enabled. That provides fast CPU write performance, but slow uncached CPU read performance. Moreover, because CPU reads from CVV make a round-trip through PCIe, GPU L2, and VRAM, the latencies of reads from CVV is much greater than the latency of reads from regular DX12 upload heaps.

To detect whether an application CPU performance is negatively impacted by CPU reads from CVV and to get information on what CPU calls are causing that, I recommend using Nsight Systems 2021.3.

Example 1: CVV CPU Reads through ReadFromSubresource

Here’s an example of a disastrous CPU read from a DX12 ReadFromSubresource, in a Nsight Systems trace. For capturing this trace, I enabled the new Collect GPU metrics option in the Nsight Systems project configuration when taking the trace, along with the default settings, which include Sample target process.

Here is what Nsight Systems shows after zooming in on one representative frame:

The bottom panel shows the L2 requested sectors by aperture, with 97.5% being in aperture VRAM.
Figure 7. Nsight Systems showing a 2.6 ms ReadFromSubresource call in a CPU thread correlated with high PCIe Read Request Counts from BAR1.

In this case (a single-GPU machine), the PCIe Read Requests to BAR1 GPU metric in Nsight Systems measures the number of CPU read requests sent to PCIe for a resource allocated in CVV (BAR1 aperture). Nsight Systems shows a clear correlation between a long DX12 ReadFromSubresource call on a CPU thread and a high number of PCIe read requests from CVV. So you can conclude that this call is most likely doing a CPU readback from CVV, and fix that in the application.

Example 2: CVV CPU reads from a mapped pointer

CPU reads from CVV are not limited to DX12 commands. They can happen in any CPU thread when using any CPU memory pointer returned by a DX12 resource Map call. That is why using Nsight Systems is recommended for debugging them, because Nsight Systems can periodically sample call stacks per CPU thread, in addition to selected GPU hardware metrics.

Here is an example of Nsight Systems showing CPU reads from CVV correlated with no DX12 API calls, but with the start of a CPU thread activity:

Nsight Systems showing GPU metric graphs and CPU thread activities.
Figure 8. Nsight Systems showing correlation between a CPU thread doing a Map call and PCIe read requests to BAR1 increasing right after.

By hovering over the orange sample points right under the CPU thread, you see that this thread is executing a C++ method named RenderCollectedTrees, which can be helpful to locate the code that is doing read/write operations to the CVV heap:

Nsight Systems showing GPU metric graphs and CPU thread activities.
Figure 9. Nsight Systems showing a call stack sample point for the CPU thread that is correlated to the high PCIe read requests to BAR1.

One way to improve the performance in this case would be to perform the read/write accesses to a separate chunk of CPU memory, not in a DX12 UPLOAD heap. When all read/write updates are finished, do a memcpy call from the CPU read/write memory to the UPLOAD heap.

Conclusion

All PC games running on Windows 11 PCs can use 256 MB of CVV on NVIDIA RTX 20xx and 30xx GPUs. NVAPI can be used to query the total amount of available CVV memory in the system and to allocate DX12 memory in this space. This makes it possible to replace DX12 UPLOAD heaps with CVV heaps by simply changing the code that allocates the heap, if the CPU never reads from the original DX12 UPLOAD heap.

To detect GPU performance-gain opportunities from moving a DX12 UPLOAD heap to CVV, I recommend using the GPU Trace Analysis tool, which is part of Nsight Graphics. To detect and debug CPU performance loss from reading from CVV, I recommend using Nsight Systems with its GPU metrics enabled.

Acknowledgments

I would like to acknowledge the following NVIDIA colleagues, who have contributed to this post: Avinash Baliga, Dana Elifaz, Daniel Horowitz, Patrick Neill, Chris Schultz, and Venkatesh Tammana.

Categories
Offsites

Demonstrating the Fundamentals of Quantum Error Correction

The Google Quantum AI team has been building quantum processors made of superconducting quantum bits (qubits) that have achieved the first beyond-classical computation, as well as the largest quantum chemical simulations to date. However, current generation quantum processors still have high operational error rates — in the range of 10-3 per operation, compared to the 10-12 believed to be necessary for a variety of useful algorithms. Bridging this tremendous gap in error rates will require more than just making better qubits — quantum computers of the future will have to use quantum error correction (QEC).

The core idea of QEC is to make a logical qubit by distributing its quantum state across many physical data qubits. When a physical error occurs, one can detect it by repeatedly checking certain properties of the qubits, allowing it to be corrected, preventing any error from occurring on the logical qubit state. While logical errors may still occur if a series of physical qubits experience an error together, this error rate should exponentially decrease with the addition of more physical qubits (more physical qubits need to be involved to cause a logical error). This exponential scaling behavior relies on physical qubit errors being sufficiently rare and independent. In particular, it’s important to suppress correlated errors, where one physical error simultaneously affects many qubits at once or persists over many cycles of error correction. Such correlated errors produce more complex patterns of error detections that are more difficult to correct and more easily cause logical errors.

Our team has recently implemented the ideas of QEC in our Sycamore architecture using quantum repetition codes. These codes consist of one-dimensional chains of qubits that alternate between data qubits, which encode the logical qubit, and measure qubits, which we use to detect errors in the logical state. While these repetition codes can only correct for one kind of quantum error at a time1, they contain all of the same ingredients as more sophisticated error correction codes and require fewer physical qubits per logical qubit, allowing us to better explore how logical errors decrease as logical qubit size grows.

In “Removing leakage-induced correlated errors in superconducting quantum error correction”, published in Nature Communications, we use these repetition codes to demonstrate a new technique for reducing the amount of correlated errors in our physical qubits. Then, in “Exponential suppression of bit or phase flip errors with repetitive error correction”, published in Nature, we show that the logical errors of these repetition codes are exponentially suppressed as we add more and more physical qubits, consistent with expectations from QEC theory.

Layout of the repetition code (21 qubits, 1D chain) and distance-2 surface code (7 qubits) on the Sycamore device.

Leaky Qubits
The goal of the repetition code is to detect errors on the data qubits without measuring their states directly. It does so by entangling each pair of data qubits with their shared measure qubit in a way that tells us whether those data qubit states are the same or different (i.e., their parity) without telling us the states themselves. We repeat this process over and over in rounds that last only one microsecond. When the measured parities change between rounds, we’ve detected an error.

However, one key challenge stems from how we make qubits out of superconducting circuits. While a qubit needs only two energy states, which are usually labeled |0⟩ and |1⟩, our devices feature a ladder of energy states, |0⟩, |1⟩, |2⟩, |3⟩, and so on. We use the two lowest energy states to encode our qubit with information to be used for computation (we call these the computational states). We use the higher energy states (|2⟩, |3⟩ and higher) to help achieve high-fidelity entangling operations, but these entangling operations can sometimes allow the qubit to “leak” into these higher states, earning them the name leakage states.

Population in the leakage states builds up as operations are applied, which increases the error of subsequent operations and even causes other nearby qubits to leak as well — resulting in a particularly challenging source of correlated error. In our early 2015 experiments on error correction, we observed that as more rounds of error correction were applied, performance declined as leakage began to build.

Mitigating the impact of leakage required us to develop a new kind of qubit operation that could “empty out” leakage states, called multi-level reset. We manipulate the qubit to rapidly pump energy out into the structures used for readout, where it will quickly move off the chip, leaving the qubit cooled to the |0⟩ state, even if it started in |2⟩ or |3⟩. Applying this operation to the data qubits would destroy the logical state we’re trying to protect, but we can apply it to the measure qubits without disturbing the data qubits. Resetting the measure qubits at the end of every round dynamically stabilizes the device so leakage doesn’t continue to grow and spread, allowing our devices to behave more like ideal qubits.

Applying the multi-level reset gate to the measure qubits almost totally removes leakage, while also reducing the growth of leakage on the data qubits.

Exponential Suppression
Having mitigated leakage as a significant source of correlated error, we next set out to test whether the repetition codes give us the predicted exponential reduction in error when increasing the number of qubits. Every time we run our repetition code, it produces a collection of error detections. Because the detections are linked to pairs of qubits rather than individual qubits, we have to look at all of the detections to try to piece together where the errors have occurred, a procedure known as decoding. Once we’ve decoded the errors, we then know which corrections we need to apply to the data qubits. However, decoding can fail if there are too many error detections for the number of data qubits used, resulting in a logical error.

To test our repetition codes, we run codes with sizes ranging from 5 to 21 qubits while also varying the number of error correction rounds. We also run two different types of repetition codes — either a phase-flip code or bit-flip code — that are sensitive to different kinds of quantum errors. By finding the logical error probability as a function of the number of rounds, we can fit a logical error rate for each code size and code type. In our data, we see that the logical error rate does in fact get suppressed exponentially as the code size is increased.

Probability of getting a logical error after decoding versus number of rounds run, shown for various sizes of phase-flip repetition code.

We can quantify the error suppression with the error scaling parameter Lambda (Λ), where a Lambda value of 2 means that we halve the logical error rate every time we add four data qubits to the repetition code. In our experiments, we find Lambda values of 3.18 for the phase-flip code and 2.99 for the bit-flip code. We can compare these experimental values to a numerical simulation of the expected Lambda based on a simple error model with no correlated errors, which predicts values of 3.34 and 3.78 for the bit- and phase-flip codes respectively.

Logical error rate per round versus number of qubits for the phase-flip (X) and bit-flip (Z) repetition codes. The line shows an exponential decay fit, and Λ is the scale factor for the exponential decay.

This is the first time Lambda has been measured in any platform while performing multiple rounds of error detection. We’re especially excited about how close the experimental and simulated Lambda values are, because it means that our system can be described with a fairly simple error model without many unexpected errors occurring. Nevertheless, the agreement is not perfect, indicating that there’s more research to be done in understanding the non-idealities of our QEC architecture, including additional sources of correlated errors.

What’s Next
This work demonstrates two important prerequisites for QEC: first, the Sycamore device can run many rounds of error correction without building up errors over time thanks to our new reset protocol, and second, we were able to validate QEC theory and error models by showing exponential suppression of error in a repetition code. These experiments were the largest stress test of a QEC system yet, using 1000 entangling gates and 500 qubit measurements in our largest test. We’re looking forward to taking what we learned from these experiments and applying it to our target QEC architecture, the 2D surface code, which will require even more qubits with even better performance.


1A true quantum error correcting code would require a two dimensional array of qubits in order to correct for all of the errors that could occur.