Categories
Misc

Join the Virtual MONAI Bootcamp, Sept. 22-24

Apply for the Sept. 22-24 MONAI virtual bootcamp featuring presentations, hands-on labs, and a mini-challenge day.

Due to the success of the 2020 MONAI Virtual Bootcamp, MONAI is hosting another Bootcamp this year from September 22 to September 24, 2021—the week before MICCAI.

The MONAI Bootcamp will be a three-day virtual event with presentations, hands-on labs, and a mini-challenge day. Applicants are encouraged but not required to have some basic knowledge in deep learning and Python programming. 

Everyone is welcome to join and learn more about MONAI!

With the growth of MONAI, there will be content for everyone. 

  • Day one begins with a beginner-friendly introduction to MONAI, for those just getting started with deep learning in medical imaging.  
  • Day two focuses on more advanced topics in MONAI and expands on other releases, like MONAI Label and the upcoming MONAI Deploy project. 
  • Day three consists of a series of increasingly difficult mini-challenges, with beginner-friendly challenges and some that hopefully challenge even experienced researchers.

Find a tentative schedule below. A more detailed agenda will be available closer to the event.

Agenda

Day 1: September 22, 2021, 7:30am – 12:30 pm PST
Welcome and Introductions
What is MONAI? 
Lab 1 – Getting started with MONAI
Lightning Talks
Lab 2 – MONAI Deep Dive
Lab 3 – End-to-End Workflow with MONAI

Day 2: September 23, 2021, 7:30am – 12:30 pm PST
Opening Remarks and Overview
MONAI – Advanced Topics on Medical Imaging
MONAI Label
MONAI Deploy

Day 3: September 24, 2021, 7:30am – 12:30 pm PST
MONAI Mini-Challenges Day


The deadline to register is September 8. Apply today!

Categories
Misc

One-click Deployment of NVIDIA Triton Inference Server to Simplify AI Inference on Google Kubernetes Engine (GKE)

NVIDIA and Google Cloud have collaborated to make it easier for enterprises to take AI to production by combining the power of NVIDIA Triton Inference Server with Google Kubernetes Engine(GKE).

The rapid growth in artificial intelligence is driving up the size of data sets, as well as the size and complexity of networks. AI-enabled applications like e-commerce product recommendations, voice-based assistants, and contact center automation require tens to hundreds of trained AI models. Inference serving helps infrastructure managers deploy, manage and scale these models with a guaranteed real-time quality-of-service (QoS) in production. Additionally, infrastructure managers look to provision and manage the right compute infrastructure on which to deploy these AI models, with maximum utilization of compute resources and flexibility to scale up or down to optimize operational costs of deployment. Taking AI to production is both an inference serving and infrastructure management challenge.

NVIDIA and Google Cloud have collaborated to make it easier for enterprises to take AI to production by combining the power of NVIDIA Triton Inference Server, a universal inference serving platform for CPUs and GPUs with Google Kubernetes Engine(GKE), a managed environment to deploy, scale and manage containerized AI applications in a secure Google infrastructure.

Inference Serving on CPUs and GPUs on Google Cloud with NVIDIA Triton Inference Server

Operationalizing AI models within enterprise applications poses a number of challenges – serving models trained in multiple frameworks, handling different types of inference query types and building a serving solution that can optimize across multiple deployment platforms like CPUs and GPUs.

Triton Inference Server addresses these challenges by providing a single standardized inference platform that can deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, OpenVINO or a custom C++/Python framework), from local storage or Google Cloud’s managed storage on any GPU- or CPU-based infrastructure.

Figure 1. Triton Inference Server deployed on Google Kubernetes Engine (GKE)

One-Click Deployment of NVIDIA Triton Inference Server on GKE Clusters

Triton on Google Kubernetes Engine (GKE) delivers the benefit of a universal inference serving platform for AI models deployed on both CPUs and GPUs combined with the ease of Kubernetes cluster management, load balancing, and auto scaling compute based on demand.

Triton can be seamlessly deployed as a containerized microservice on a Google Kubernetes Engine (GKE) managed cluster using the new One-Click Triton Inference Server App for GKE on Google Marketplace.

The Triton Inference Server App for GKE is a helm chart deployer that automatically installs and configures Triton for use on a GKE cluster with NVIDIA GPU node pools, including the NVIDIA A100 Tensor Core GPUs and NVIDIA T4 Tensor Core GPUs, and leverages Istio on Google Cloud for traffic ingress and load balancing. It also includes a horizontal pod autoscaler (HPA) which relies on stack driver custom metrics adapter to monitor GPU duty cycle and auto scale the GPU nodes in the GKE cluster based on inference queries and SLA requirements.

To learn more about the One-Click Triton Inference Server in Google Kubernetes Engine (GKE), check out this in-depth blog by Google Cloud and NVIDIA and see how the solution scales to meet stringent latency budgets, and optimize operational costs for your AI deployments.

You can also register for “Building a Computer Vision Service Using NVIDIA NGC and Google Cloud” webinar on August 25 to learn how to build an end-to-end computer vision service on Google Cloud by combining NVIDIA GPU-optimized pretrained models and Transfer Learning Toolkit (TLT) from NGC Catalog and the Triton Inference Server App for GKE.

Categories
Misc

Accelerating IO in the Modern Data Center: Magnum IO Storage

This is the fourth post in the Accelerating IO series. It addresses storage issues and shares recent results and directions with our partners. We cover the new GPUDirect Storage release, benefits, and implementation. Accelerated computing needs accelerated IO. Otherwise, computing resources get starved for data. Given that the fraction of all workflows for which data … Continued

This is the fourth post in the Accelerating IO series. It addresses storage issues and shares recent results and directions with our partners. We cover the new GPUDirect Storage release, benefits, and implementation.

Accelerated computing needs accelerated IO. Otherwise, computing resources get starved for data. Given that the fraction of all workflows for which data fits in memory is shrinking, optimizing storage IO is of increasing importance. The value of stored data, efforts to pilfer or corrupt data, and regulatory requirements to protect it are also all ratcheting up. To that end, there is growing demand for data center infrastructure that can provide greater isolation of users from data that they shouldn’t access.

GPUDirect Storage

GPUDirect Storage streamlines the flow of data between storage and GPU buffers for applications that consume or produce data on the GPU without needing CPU processing. No extra copies that add latency and impede bandwidth are needed. This simple optimization leads to game-changing role reversals where data can be fed to GPUs faster from remote storage rather than CPU memory.

The newest member of the GPUDirect family

The GPUDirect family of technologies enables access and efficient data movement into and out of the GPU. Until recently, it was focused on memory-to-memory transfers. With the addition of GPUDirect Storage (GDS), access and data movement with storage are also accelerated. GPUDirect Storage makes the significant step of adding file IO between local and remote storage to CUDA.

Release v1.0 with CUDA 11.4

GPUDirect Storage has been vetted for more than two years and is currently available as production software. Previously available only through a separate installation, GDS is now incorporated into CUDA version 11.4 and later, and it can be either part of the CUDA installation or installed separately. For an installation of CUDA version X-Y, the libcufile-X-Y.so user library, gds-tools-X-Y are installed by default and the nvidia-fs.ko kernel driver is an optional install. For more information, see the GDS troubleshooting and installation documentation.

GDS is now available in RAPIDS. It is also available in a PyTorch container and an MXNet container.

GDS description and benefits

GPUDirect Storage enables a direct datapath between storage and GPU memory. Data is moved using the direct memory access (DMA) engine in local NVMe drives or in a NIC that communicates with remote storage.

Use of that DMA engine means that, although the setup of the DMA is a CPU operation, the CPU and GPU are totally uninvolved in the datapath, leaving them free and unencumbered (Figure 1). On the left, data from storage comes in through a PCIe switch, goes through the CPU to system memory and all the way back down to the GPU. On the right, the datapath skips the CPU and system memory. The benefits are summarized at the bottom.

datapath without GDS.

WITHOUT GPUDIRECT STORAGE

Limited by bandwidth into and out of the CPU. Incurs the latency of a CPU bounce buffer. Memory capacity is limited to O(1TB). Storage is not part of CUDA. No topology-based optimization.

datapath with GDS

WITH GPUDIRECT STORAGE

Bandwidth into GPUs limited only by NICs. Lower latency due to direct copy. Access to O(PB) capacity. Simple CUDA programming model. Adaptively route through NVLink, GPU buffers.

Figure 1. GDS software stack, where the applications use cuFile APIs, and the GDS-enabled storage drivers call out to the nvidia-fs.ko kernel driver to obtain the correct DMA address.

GPUDirect storage offers three basic performance benefits:

  • Increased bandwidth: By removing the need to go through a bounce buffer in the CPU, alternate paths become available on some platforms, including those that may offer higher bandwidth through a PCIe switch or over NVLink. While DGX platforms have both PCIe switches and NVLinks, not all platforms do. We recommend using both to maximize performance. The Mars lander example achieved an 8x bandwidth gain.
  • Decreased latency: Reduce latency by avoiding the delay of an extra copy through CPU memory and the overhead of managing memory, which can be severe in extreme cases. A 3x reduction in latency is common.
  • Decreased CPU utilization: Use of a bounce buffer introduces extra operations on the CPU, both to perform the extra copy and to manage the memory buffers. When CPU utilization becomes the bottleneck, effective bandwidth can drop significantly. We’ve measured 3x improvements in CPU utilization with multiple file systems.

Without GDS, there’s only one available datapath: from storage to the CPU and from the CPU to the relevant GPU with cudaMemcpy. With GDS, there are additional optimizations available:

  • The CPU threads used to interact with the DMA engine are affinitized to the closest CPU core.
  • If the storage and GPU hang off different sockets and NVLink is an available connection, then data may be staged through a fast bounce buffer in the memory of a GPU near the storage, and then transferred using CUDA to the final GPU memory target buffer. This can be considerably faster than using the intersocket path, for example, UPI.
  • There is no cudaMemcpy involved to take care of segmenting the IO transfer to fit in the GPU BAR1 aperture, whose size varies by GPU SKU, or into prepinned buffers in case the target buffer is not pinned with cuFileBufRegister. These operations are managed with the libcufile.so user library code.
  • Handle unaligned accesses, where the offset of the data within the file to be transferred does not align with a page boundary.
  • In future GDS releases, the cuFile APIs will support asynchronous and batched operations. This enables a CUDA kernel to be sequenced after a read in the CUDA stream that provides inputs to that kernel, and a write to be sequenced after a kernel that produces data to be written. In time, cuFile APIs will be usable in the context of CUDA Graphs as well.

Table 1 shows the peak and measured bandwidths on NVIDIA DGX-2 and DGX A100 systems. This data shows that the achievable bandwidth into GPUs from local storage exceeds the maximum bandwidth from up to 1 TB of CPU memory in ideal conditions. Commonly measured bandwidths from petabytes of remote memory can be well more than double of the bandwidth that CPU memory provides in practice.

Spilling data that won’t fit in GPU memory out to even petabytes of remote storage can exceed the achievable performance of paging it back to the 1 TB of memory in the CPU. This is a remarkable reversal of history.

Endpoint DGX-2
(Gen3), GB/s
DGX A100
(Gen4), GB/s
CPU 50, peak 100, peak
Switch/GPU 100, peak 200*, peak
Endpoint Capacity Measured
CPU sysmem O(1TB) 48-50 @ 4 PCIe 96-100 @ 4 PCIe
Local storage O(100TB) 53+ @ 16 drives 53+ @ 8 drives
RAID cards O(100TB) 112 (MicroChip) @ 8 N/A
NICs O(1PB) 90+ @ 8 NICs 185+ @ 8 NICs
Table 1. Access to petabytes of data is possible at bandwidths that exceed those to only 1 TB of CPU memory.

* Performance numbers shown here with NVIDIA GPUDirect Storage on NVIDIA DGX A100 slots 0-3 and 6-9 are not the officially supported network configuration and are for experimental use only. Sharing the same network adapters for both compute and storage may impact the performance of standard or other benchmarks previously published by NVIDIA on DGX A100 systems.

How GDS works

NVIDIA seeks to embrace existing standards wherever possible, and to judiciously extend them where necessary. The POSIX standard’s pread and pwrite provide copies between storage and CPU buffers, but do not yet enable copies to GPU buffers. This shortcoming of not supporting GPU buffers in the Linux kernel will be addressed over time.

A solution, called dma_buf, that enables copies among devices like a NIC or NVMe and GPU, which are peers on the PCIe bus, is in progress to address that gap. In the meantime, the performance upside from GDS is too large to wait for an upstreamed solution to propagate to all users. Alternate GDS-enabled solutions have been provided by a variety of vendors, including MLNX_OFED (Table 2). The GDS solution involves new APIs, cuFileRead or cuFileWrite, that are similar to POSIX pread and pwrite.

Optimizations like dynamic routing, use of NVLink, and async APIs for use in CUDA streams that are only available from GDS makes the cuFile APIs an enduring feature of the CUDA programming model, even after gaps in the Linux file system are addressed.

Here’s what the GDS implementation does. First, the fundamental problem with the current Linux implementation is passing a GPU buffer address as a DMA target down through the virtual file system (VFS) so that the DMA engine in a local NVMe or a network adapter can perform a transfer to or from GPU memory. This leads to an error condition. We have a way around this problem for now: Pass down an address for a buffer in CPU memory instead.

When the cuFile APIs like cuFileRead or cuFileWrite are used, the libcufile.so user-level library captures the GPU buffer address and substitutes a proxy CPU buffer address that’s passed to VFS. Just before the buffer address is used for a DMA, a call from a GDS-enabled driver to nvidia-fs.ko identifies the CPU buffer address and provides a substitute GPU buffer address again so that the DMA can proceed correctly.

The logic in libcufile.so performs the various optimizations described earlier, like dynamic routing, use of prepinned buffers, and alignment. Figure 2 shows the stack used for this optimization. The cuFile APIs are an example of the Magnum IO architectural principles of flexible abstraction that enable platform-specific innovation and optimization, like selective buffering and use of NVLink.

The software stack to enable GDS includes the application, cuFile user library, NVIDIA kernel driver, and standard or proprietary storage drivers.
Figure 2. GDS software stack, where the applications use cuFile APIs, and the GDS-enabled storage drivers call out to the nvidia-fs.ko kernel driver to obtain the correct DMA address.

To learn more

The GPUDirect Storage post was the original introduction to GPUDirect Storage. We recommend the NVIDIA GPUDirect Storage Design Guide to end customers and OEMs, and the NVIDIA GPUDirect Storage Overview Guide to end customers, OEMs, and partners. For more information about programming with GDS, see the cuFile API Reference Guide.

Categories
Misc

Software Ate the World — That Means Hardware Matters Again

Software ate the world, now new silicon is taking a seat at the table. Ten years ago venture capitalist Marc Andreessen proclaimed that “software is eating the world.” His once-radical concept — now a truism — is that innovation and corporate value creation lie in software. That led some to believe that hardware matters less. Read article >

The post Software Ate the World — That Means Hardware Matters Again appeared first on The Official NVIDIA Blog.

Categories
Misc

NVIDIA Shines at 2021 SONiC PlugFest

Using the 100% open-source Pure SONiC, NVIDIA excelled in all the tests at the 2021 SONiC PlugFest.

The SONiC community came together for a unique virtual event to help define test requirements and evaluate the performance of Software for Open Networking in the Cloud (SONiC).

As an open-source network OS, SONiC runs on many different switches from multiple vendors. One of the problems with many open-source projects is that not all vendors are as committed to those projects as others. It can be difficult, and time-consuming to figure out the functionality and scale each vendor provides, as data sheets cannot be trusted. 

To address this issue, AVIZ Networks and Keysight Technologies created the first annual SONiC PlugFest, where an independent group tests the interoperability, functionality, and scale of the many vendors. The goal of the SONiC PlugFest was to test and confirm whether the currently available feature set of community SONiC is production-ready.

PlugFest testing

The PlugFest had a suite of test modules, the results of which fell into three general areas of feature, scale, and operations. Results were divided in the categories of platform, management, layer 2 and 3, and system/ops.

Using the 100% open-source Pure SONiC, NVIDIA excelled in all the tests at the SONiC PlugFest. Other vendors had to provide proprietary versions.

The goals of SONiC include:

Decouple hardware from software

NVIDIA is committed to open source at all layers of the stack. Keeping SONiC, SAI and SDK APIs public, removes potential vendor lock-in. In other words, NVIDIA Pure SONiC ensures the freedom to choose the best ASIC/switch platform/applications remains in the hands of the users.

Accelerate software evolution

NVIDIA is home to a team of professionals fluent in SONiC, who are able to craft a unique Pure SONiC release, which is 100% based on upstream open-source code. NVIDIA is the only vendor that provides a production hardened GitHub hash, from which users can build and own their SONiC image.

Drive adoption via a strong ecosystem

At this point of the SONiC project, NVIDIA believes that open source has the power to accelerate innovation and PlugFests like these enable customers to test these solutions in real-world scenarios. NVIDIA will continue with its mission to enhance SONiC features supporting the next generation of networking and to enable the ecosystem based on 100% upstream open-source code.  


NVIDIA has qualified and deployed 100% open source SONiC in many of the biggest data centers in the world. Now that the SONiC PlugFest confirmed Pure SONiC delivers everything promised, why consider going a path other than Pure SONiC?

Considering deploying Pure SONiC in 2022?  Feel free to leave comments in the forum for discussion or feedback.

Read more about Pure SONiC >>

Categories
Misc

Unstung Heroes: Startup’s AI-Powered Tomato Pollinator Gives Bees a Break

There are nearly a half million acres of greenhouse tomato crops in the world, an area about 35 times the size of Manhattan. In other words, lots of tomatoes. Growing them requires more than soil, water and sunlight. The plants are self-pollinating, but they need a little help getting the pollen to drop onto the Read article >

The post Unstung Heroes: Startup’s AI-Powered Tomato Pollinator Gives Bees a Break appeared first on The Official NVIDIA Blog.

Categories
Misc

Why does TensorFlow not pick up on my GPU? Sorry if this is not the right place for this, but I’m kind of desperate, I tried everything I could find online…please help. More info in the comments

Why does TensorFlow not pick up on my GPU? Sorry if this is not the right place for this, but I'm kind of desperate, I tried everything I could find online...please help. More info in the comments submitted by /u/mad-myco
[visit reddit] [comments]
Categories
Misc

Encoding for DirectX 12 with NVIDIA Video Codec SDK 11.1

DirectX 12 is a low-level programming API from Microsoft that reduces driver overhead in comparison to its predecessors. DirectX 12 provides more flexibility and fine-grained control on the underlying hardware using command queues, command lists, and so on, which results in better resource utilization. You can take advantage of these functionalities and optimize your applications … Continued

DirectX 12 is a low-level programming API from Microsoft that reduces driver overhead in comparison to its predecessors. DirectX 12 provides more flexibility and fine-grained control on the underlying hardware using command queues, command lists, and so on, which results in better resource utilization. You can take advantage of these functionalities and optimize your applications and get better performance over earlier DirectX versions. At the same time, the application, on its own, must take care of resource management, synchronization, and so on.

 More and more game titles and other graphics applications are adopting DirectX12 APIs. Video Codec SDK 11.1 introduces DirectX 12 support for encoding on Windows 20H1 and later OS. This enables DirectX 12 applications to use NVENC across all generations of supported GPUs. The Video Codec SDK package contains NVENCODEAPI headers, sample applications demonstrating the usage, and the programming guide for using the APIs. The sample application contains C++ wrapper classes, which can be reused or modified as required.

typedef struct _NV_ENC_FENCE_POINT_D3D12
 {
     void*                   pFence; /**



The client application must also specify the input buffer format while initializing the NVENC.

Even though most of the parameters passed to the Encode picture API in DirectX 12 are same as those in other interfaces, there are certain functional differences. Synchronization at the input (the client application writing to the input surface and NVENC reading the input surface) and the output (NVENC writing the bitstream surface and the application reading it out) must be managed using fences. This is unlike previous DirectX interfaces, where it was automatically taken care by the OS runtime and driver.

In DirectX 12, additional information about fence and fence values are required as input parameters to the Encode picture API.  These fence and fence values are used to synchronize the CPU-GPU and GPU-GPU operations.  The application must send the following input and output struct pointers in NV_ENC_PIC_PARAMS::inputBuffer and NV_ENC_PIC_PARAMS:: outputBitstream, containing the fence and fence values:

typedef struct _NV_ENC_INPUT_RESOURCE_D3D12
 {
     NV_ENC_REGISTERED_PTR       pInputBuffer
     NV_ENC_FENCE_POINT_D3D12    inputFencePoint;       
     …
 } NV_ENC_INPUT_RESOURCE_D3D12;
  
 typedef struct _NV_ENC_OUTPUT_RESOURCE_D3D12
 {
     NV_ENC_REGISTERED_PTR      pOutputBuffer;
     NV_ENC_FENCE_POINT_D3D12   outputFencePoint;     
     …
 } NV_ENC_OUTPUT_RESOURCE_D3D12; 

To retrieve the encoded output in asynchronous mode of operation, the application should wait on a completion event before calling NvEncLockBitstream. In the synchronous mode of operation, the application can call NvEncLockBitstream, as the NVENCODE API makes sure that encoding has finished before returning the encoded output. However, in both cases, the client application should pass a pointer to NV_ENC_OUTPUT_RESOURCE_D3D12, which was used in NvEncEncodePicture API, in NV_ENC_LOCK_BITSTREAM::outputBitstream.

For more information, see the Video Codec SDK programming guide. Encoder performance in DirectX 12 is close when compared to the other DirectX interfaces. The encoder quality is same across all interfaces. 

Categories
Misc

Predicting Metastatic Cancer Risk with AI

Using movies of living cancer cells, scientists create a convolutional neural network that can identify and predict aggressive metastatic melanomas.

Using a newly developed AI algorithm, researchers from the University of Texas Southwestern Medical Center are making early detection of aggressive forms of skin cancer possible. The study, recently published in Cell Systems, creates a deep learning model capable of predicting if a melanoma will aggressively spread, by examining cell features undetectable to the human eye.

“We now have a general framework that allows us to take tissue samples and predict mechanisms inside cells that drive disease, mechanisms that are currently inaccessible in any other way,” said senior author Gaudenz Danuser, the Patrick E. Haggerty Distinguished Chair in Basic Biomedical Science at the University of Texas Southwestern.

Melanoma—a serious form of skin cancer caused by changes in melanocyte cells­— is the most likely of all skin cancers to spread if not caught early. Quickly identifying it helps doctors create effective treatment plans, and when diagnosed early, has a 5-year survival rate of about 99%.

Doctors often use biopsies, blood tests, or X-rays, CT, and PET scans to determine the stage of melanoma and whether it has spread to other areas of the body, known as metastasizing. Changes in cellular behavior could hint at the likelihood of the melanoma to spread, but they are too subtle for experts to observe. 

The researchers thought using AI to help determine the metastatic potential of melanoma could be very valuable, but up until now AI models have not been able to interpret these cellular characteristics.

“We propose an algorithm that combines unsupervised deep learning and supervised conventional machine learning, along with generative image models to visualize the specific cell behavior that predicts the metastatic potential. That is, we map the insight gained by AI back into a data cue that is interpretable by human intelligence,” said Andrew Jamieson, study coauthor and assistant professor in bioinformatics at UT Southwestern. 

Using tumor images from seven patients with a documented timeline of metastatic melanoma, the researchers compiled a time-lapse dataset of more than 12,000 single melanoma cells in petri dishes. Resulting in approximately 1,700,000 raw images, the researchers used a deep learning algorithm to identify different cellular behaviors.

Time lapse of single melanoma cells cropped from the field of view and used as the input for the autoencoder.
Credit: Danuser et al/Cell Systems

Based on these features, the team then “reverse engineered’’ a deep convolutional neural network able to tease out the physical properties of aggressive melanoma cells and predict whether cells have high metastatic potential.

The experiments were run on the UT Southwestern Medical Center BioHPC cluster with CUDA-accelerated NVIDIA V100 Tensor Core GPUs. They trained multiple deep learning models on the 1.7 million cell images to visualize and explore the massive data set that started at over five TBs of raw microscopy data.

The researchers then tracked the spread of melanoma cells in mice and tested whether these specific predictors lead to highly metastatic cells. They found the cell types they’d classified as highly metastatic spread throughout the entire animal, while those classified low did not. 

There is more work to be done before the research can be deployed in a medical setting. The team also points out that the study raises questions about whether this applies to other cancers, or if melanoma metastasis is an outlier. 

“The result seems to suggest that the metastatic potential, at least of melanoma, is set by cell-autonomous rather than environmental factors,” Jamieson said. 

Applications of the study could also go beyond cancer, and transform diagnoses of other diseases.


Read the full article in Cell Systems>>
Read more >>   

Categories
Misc

Writing Portable Rendering Code with NVRHI

Modern graphics APIs, such as Direct3D 12 and Vulkan, are designed to provide relatively low-level access to the GPU and eliminate the GPU driver overhead associated with API translation. This low-level interface allows applications to have more control over the system and provides the ability to manage pipelines, shader compilation, memory allocations, and resource descriptors … Continued

Modern graphics APIs, such as Direct3D 12 and Vulkan, are designed to provide relatively low-level access to the GPU and eliminate the GPU driver overhead associated with API translation. This low-level interface allows applications to have more control over the system and provides the ability to manage pipelines, shader compilation, memory allocations, and resource descriptors in a way that is best for each application.

On the other hand, this closer-to-the-hardware access to the GPU means that the application must manage these things on its own, instead of relying on the GPU driver. A basic “hello world” program that draws a single triangle using these APIs can grow to a thousand lines of code or more. In a complex renderer, managing the GPU memory, descriptors, and so on, can quickly become overwhelming if not done in a systematic way.

If an application or an engine must work with more than one graphics API, it can be done in two ways:

  • Duplicate the rendering code to work with each API separately. This approach has an obvious drawback of having to develop and maintain multiple independent implementations.
  • Implement an abstraction layer over the graphics APIs that provides the necessary functionality in a common interface. This has a different drawback of the development and maintenance of the abstraction layer. Most major game engines implement the second approach.

NVIDIA Rendering Hardware Interface (NVRHI) is a library that handles these drawbacks. It defines a custom, higher-level graphics API that maps well to the three supported native graphics APIs: Vulkan, D3D12, and D3D11. It manages resources, pipelines, descriptors, and barriers in a safe and automatic way that can be easily disabled or bypassed when necessary to reduce CPU overhead. On top of that, NVRHI provides a validation layer that ensures that the application’s use of the API is correct, similar to what the Direct3D debug runtime or the Vulkan validation layers do, but on a higher level.

There are some features related to portability that NVRHI doesn’t provide. First, it doesn’t compile shaders at run time or read shader reflection data to bind resources dynamically. In fact, NVRHI doesn’t process shaders at run time at all. The application provides a platform-specific shader binary, that is, a DXBC, DXIL or SPIR-V blob. NVRHI passes that directly to the underlying graphics API. Matching the binding layouts is left up to the application and is validated by the underlying graphics API. Secondly, NVRHI doesn’t create graphics devices or windows. That is also left up to the application or other libraries, such as GLFW.

In this post, I go over the main features of NVRHI and explain how each feature helps graphics engineers be more productive and write safer code.

  • Resource lifetime management
  • Binding layouts and binding sets
  • Automatic resource state tracking
  • Upload management
  • Interaction with graphics APIs
  • Shader permutations

Resource lifetime management

In Vulkan and D3D12, the application must take care to destroy only the device resources that the GPU is no longer using. This can be done with little overhead if the resource usage is planned carefully, but the problem is in the planning.

NVRHI follows the D3D11 resource lifetime model almost exactly. Resources, such as buffers, textures, or pipelines, have a reference count. When a resource handle is copied, the reference count is incremented. When the handle is destroyed, the reference count is decremented. When the last handle is destroyed and the reference count reaches zero, the resource object is destroyed, including the underlying graphics API resource. But that’s what D3D12 does as well, right? Not quite.

NVRHI also keeps internal references to resources that are used in command lists. When a command list is opened for recording, a new instance of the command list is created. That instance holds references to each resource it uses. When the command list is closed and submitted for execution, the instance is stored in a queue along with a fence or semaphore value that can be used to determine if the instance has finished executing on the GPU. The same command list can be reopened for recording immediately after that, even while the previous instance is still executing on the GPU.

The application should call the nvrhi::IDevice::runGarbageCollection method occasionally, at least one time per frame. This method looks at the in-flight command list instance queue and clears the instances that have finished executing. Clearing the instance automatically removes the internal references to the resources used in the instance. If a resource has no other references left, it is destroyed at that time.

This behavior can be shown with the following code example:

{
       // Create a buffer in a scope, which starts with reference count of 1
       nvrhi::BufferHandle buffer = device->createBuffer(...);
  
       // Creates an internal instance of the command list
       commandList->open(); 
  
       // Adds a buffer reference to the instance, which increases reference count to 2
       commandList->clearBufferUInt(buffer, 0); 
  
       commandList->close();
  
       // The local reference to the buffer is released here, decrements reference count to 1
 }
  
 // Puts the command list instance into the queue
 device->executeCommandList(commandList); 
  
 // Likely doesn't do anything with the instance
 // because it's just been submitted and still executing on the GPU
 device->runGarbageCollection();
  
 device->waitForIdle();
  
 // This time, the buffer should be destroyed because
 // waitForIdle ensures that all command list instances
 // have finished executing, so when the finished instance
 // is cleared, the buffer reference count is decremented to zero
 // and it can be safely destroyed
 device->runGarbageCollection(); 

The “fire and forget” pattern shown here, when the application creates a resource, uses it, and then immediately releases it, is perfectly fine in NVRHI, unlike D3D12 and Vulkan.

You might wonder whether this type of resource tracking becomes expensive if the application performs many draw calls with lots of resources bound for each draw call. Not really. Draw calls and dispatches do not deal with individual resources. Textures and buffers are grouped into immutable binding sets, which are created, hold permanent references to their resources, and are tracked as a single object.

So, when a certain binding set is used in a command list, the command list instance only stores a reference to the binding set. And that store is skipped if the binding set is already bound, so that repeated draw calls with the same bindings do not add tracking cost. I explain binding sets in more detail in the next section.

Another thing that can help reduce the CPU overhead imposed by resource lifetime tracking is the trackLiveness setting that is available on binding sets and acceleration structures. When this parameter is set to false, the internal references are not created for that particular resource. In this case, the application is responsible for keeping its own reference and not releasing it while the resource is in use.

Binding layouts and binding sets

NVRHI features a unique resource binding model designed for safety and runtime efficiency. As mentioned earlier, various resources that are used by graphics or compute pipelines are grouped into binding sets.

Put simply, a binding set is an array of resource views that are bound to particular slots in a pipeline. For example, a binding set may contain a structured buffer SRV bound to slot t1, a UAV for a single texture mip level bound to slot u0, and a constant buffer bound to slot b2. All the bindings in a set share the same visibility mask (which shader stages will see that binding) and register space, both dictated by the binding layout.

Binding layouts are the NVRHI version of D3D12 root signatures and Vulkan descriptor set layouts. A binding layout is like a template for a binding set. It declares what resource types are bound to which slots, but does not tell which specific resources are used.

Like the root signatures and descriptor set layouts, NVHRI binding layouts are used to create pipelines. A single pipeline may be created with multiple binding layouts. These can be useful to bin resources into different groups according to their modification frequency, or to bind different sets of resources to different pipeline stages.

The following code example shows how a basic compute pipeline can be created with one binding layout:

auto layoutDesc = nvrhi::BindingLayoutDesc()
     .setVisibility(nvrhi::ShaderType::All)
     .addItem(nvrhi::BindingLayoutItem::Texture_SRV(0))     // texture at t0
     .addItem(nvrhi::BindingLayoutItem::ConstantBuffer(2)); // constants at b2
  
// Create a binding layout.
nvrhi::BindingLayoutHandle bindingLayout = device->createBindingLayout(layoutDesc);
  
auto pipelineDesc = nvrhi::ComputePipelineDesc()
       .setComputeShader(shader)
       .addBindingLayout(bindingLayout);
  
// Use the layout to create a compute pipeline.
nvrhi::ComputePipelineHandle computePipeline = device->createComputePipeline(pipelineDesc); 

Binding sets can only be created from a matching binding layout. Matching means that the layout must have the same number of items, of the same types, bound to the same slots, in the same order. This may look redundant, and the D3D12 and Vulkan APIs have less redundancy in their descriptor systems. This redundancy is useful: it makes the code more obvious, and it allows the NVRHI validation layer to catch more bugs.

auto bindingSetDesc = nvrhi::BindingSetDesc()
       // An SRV for two mip levels of myTexture.
       // Subresource specification is optional, default is the entire texture.
     .addItem(nvrhi::BindingSetItem::Texture_SRV(0, myTexture, nvrhi::Format::UNKNOWN,
       nvrhi::TextureSubresourceSet().setBaseMipLevel(2).setNumMipLevels(2)))
     .addItem(nvrhi::BindingSetItem::ConstantBuffer(2, constantBuffer));
  
// Create a binding set using the layout created in the previous code snippet.
nvrhi::BindingSetHandle bindingSet = device->createBindingSet(bindingSetDesc, bindingLayout); 

Because the binding set descriptor contains almost all the information necessary to create the binding layout as well, it is possible to create both with one function call. That may be useful when creating some render passes that only need one binding set.

#include 
...
nvrhi::BindingLayoutHandle bindingLayout;
nvrhi::BindingSetHandle bindingSet;
nvrhi::utils::CreateBindingSetAndLayout(device, /* visibility = */ nvrhi::ShaderType::All,
       /* registerSpace = */ 0, bindingSetDesc, /* out */ bindingLayout, /* out */ bindingSet);
  
// Now you can create the pipeline using bindingLayout. 

Binding sets are immutable. When you create a binding set, NVRHI allocates the descriptors from the heap on D3D12 or creates a descriptor set on Vulkan and populates it with the necessary resource views.

Later, when the binding set is used in a draw or dispatch call, the binding operation is lightweight and translates to the corresponding graphics API binding calls. There is no descriptor creation or copying happening at render time.

Automatic resource state tracking

Explicit barriers that change resource states and introduce dependencies in the graphics pipelines are an important part of both D3D12 and Vulkan APIs. They allow applications to minimize the number of pipeline dependencies and bubbles and to optimize their placement. They reduce CPU overhead at the same time by removing that logic from the driver. That’s relevant mostly to tight render loops that draw lots of geometry. Most of the time, especially when writing new rendering code, dealing with barriers is just annoying and bug-prone.

NVHRI implements a system that tracks the state of each resource and, optionally, subresource per command list. When a command interacts with a resource, the resource is transitioned into the state required for that command, if it’s not already in that state. For example, a writeTexture command transitions the texture into the CopyDest state, and a subsequent draw operation that reads from the texture transitions it into the ShaderResources state.

Special handling is applied when a resource is in the UnorderedAccess state for two consecutive commands: there is no transition involved, but a UAV barrier is inserted between the commands. It is possible to disable the insertion of UAV barriers temporarily, if necessary.

I said earlier that NVRHI tracks the state of each resource per command list. An application may record multiple command lists in any order or in parallel and use the same resource differently in each command list. Therefore, you can’t track the resource states globally or per-device because the barriers need to be derived while the command lists are being recorded. Global tracking may not happen in the same order as actual resource usage on the device command queue when the command lists are executed.

So, you can track resource states in each command list separately. In a sense, this can be viewed as a differential equation. You know how the state changes inside the command list, but you don’t know the boundary conditions, that is, which state each resource is in when you enter and exit the command list in their order of execution.

The application must provide the boundary conditions for each resource. There are two ways to do that:

  • Explicit: Use the beginTrackingTextureState and beginTrackingBufferState functions after opening the command list and the setTextureState and setBufferState functions before closing it.
  • Automatic: Use the initialState and keepInitialState fields of the TextureDesc and BufferDesc structures when creating the resource. Then, each command list that uses the resource assumes that it’s in the initial state upon entering the command list, and transition it back into the initial state before leaving the command list.

Here, you might wonder about avoiding the CPU overhead of resource state tracking, or manually optimizing barrier placement. Well, you can! The command lists have the setEnableAutomaticBarriers function that can completely disable automatic barriers. In this mode, use the setTextureState and setBufferState functions where a barrier is necessary. It still uses the same state tracking logic but potentially at a lower frequency.

Upload management

NVRHI automates another aspect of modern graphics APIs that is often annoying to deal with. That’s the management of upload buffers and the tracking of their usage by the GPU.

Typically, when some texture or buffer must be updated from the CPU on every frame or multiple times per frame, a staging buffer is allocated whose size is multiple times larger than the resource memory requirements. This enables multiple frames in-flight on the GPU. Alternately, portions of a large staging buffer are suballocated at run time. It is possible to implement the same strategy using NVRHI, but there is a built-in implementation that works well for most use cases.

Each NVRHI command list has its own upload manager. When writeBuffer or writeTexture is called, the upload manager tries to find an existing buffer that is no longer used by the GPU that can fit the necessary data. If no such buffer is available, a new buffer is created and added to the upload manager’s pool. The provided data is copied into that buffer, and then a copy command is added to the command list. The tracking of which buffers are used by the GPU is performed automatically.

ConstantBufferStruct myConstants;
myConstants.member = value;
  
// This is all that's necessary to fill the constant buffer with data and have it ready for rendering.
commandList->writeBuffer(constantBuffer, myConstants, sizeof(myConstants)); 

The upload manager never releases its buffers, nor shares them with other command lists. Perhaps an application is doing a significant number of uploads, such as during scene loading, and then switching to a less upload-intensive mode of operation. In that case, it’s better to create a separate command list for the uploading activity and release it when the uploads are done. That releases the upload buffers associated with the command list.

It’s not necessary to wait for the GPU to finish copying data from the upload buffers. The resource lifetime tracking system described earlier does not release the upload buffers until the copies are done.

Interaction with graphics APIs

Sometimes, it is necessary to escape the abstraction layers and do something with the underlying graphics API directly. Maybe you have to use some feature that is not supported by NVRHI, demonstrate some API usage in a sample application, or make the portable rendering code work with a native resource coming from elsewhere. NVRHI makes it relatively easy to do these things.

Every NVRHI object has a getNativeObject function that returns an underlying API resource of the necessary type. The expected type is passed to that function, and it only returns a non-NULL value if that type is available, to provide some type safety.

Supported types include interfaces like ID3D11Device or ID3D12Resource and handles like vk::Image. In addition, the NVRHI texture objects have a getNativeView function that can create and return texture views, such as SRV or UAV.

For example, to issue some native D3D12 rendering commands in the middle of an NVRHI command list, you might use code like the following example:

ID3D12GraphicsCommandList* d3dCmdList = nvrhiCommandList->getNativeObject(
       nvrhi::ObjectTypes::D3D12_GraphicsCommandList);
  
D3D12_CPU_DESCRIPTOR_HANDLE d3dTextureRTV = nvrhiTexture->getNativeView(
       nvrhi::ObjectTypes::D3D12_RenderTargetViewDescriptor);
  
const float clearColor[4] = { 0.f, 0.f, 0.f, 0.f };
d3dCmdList->ClearRenderTargetView(d3dTextureRTV, clearColor, 0, nullptr); 

Shader permutations

The final productivity feature to mention here is the batch shader compiler that comes with NVRHI. It is an optional feature, and NVRHI can be completely functional without it. NVRHI accepts shaders compiled through other means. Still, it is a useful tool.

It is often necessary to compile the same shader with multiple combinations of preprocessor definitions. However, the native tools that Visual Studio provides for shader compilation, for example, do not make this task easy at all.

The NVRHI shader compiler solves exactly this problem. Driven by a text file that lists the shader source files and compilation options, it generates option permutations and calls the underlying compiler (DXC or FXC) to generate the binaries. The binaries for different versions of the same shader are then packaged into one file of a custom chunk-based format that can be processed using the functions declared in .

The application can load the file with all the shader permutations and pass it to nvrhi::utils::createShaderPermutation or nvrhi::utils::createShaderLibraryPermutation, along with the list of preprocessor definitions and their values. If the requested permutation exists in the file, the corresponding shader object is created. If it doesn’t, an error message is generated.

In addition to permutation processing, the shader compiler has other nice features. First, it scans the source files to build a tree of headers included in each one. It detects if any of the headers have been modified, and whether a particular shader must be rebuilt. Second, it can build all the outdated shaders in parallel using all available CPU cores.

Conclusion

In this post, I covered some of the most important features of NVRHI that, in my opinion, make it a pleasure to use. For more information about NVHRI, see the NVIDIAGameWorks/nvrhi GitHub repo, which includes a tutorial and a more detailed programming guide. The Donut Examples repository on GitHub has several complete applications written with NVRHI.

If you have questions about NVRHI, post an issue on GitHub or send me a message on Twitter (@more_fps).