Reducing Acceleration Structure Memory with NVIDIA RTXMU

RTXMU (RTX Memory Utility) combines both compaction and suballocation techniques to optimize and reduce memory consumption of acceleration structures for any DXR or Vulkan Ray Tracing application.

Acceleration structures spatially organize geometry to accelerate ray tracing traversal performance. When you create an acceleration structure, a conservative memory size is allocated.

Upon initial build, the graphics runtime doesn’t know how optimally a geometry fits into the oversized acceleration structure memory allocation.
After the build is executed on the GPU, the graphics runtime reports back the smallest memory allocation that the acceleration structure can fit into.

This process is called compacting the acceleration structure and it is important for reducing the memory overhead of acceleration structures.

Another key ingredient to reducing memory is suballocating acceleration structures. Suballocation enables acceleration structures to be tightly packed together in memory by using a smaller memory alignment than is required by the graphics API.

Typically, buffer allocation alignment is at a minimum of 64 KB while the acceleration structure memory alignment requirement is only 256 B. Games using many small acceleration structures greatly benefit from suballocation, enabling the tight packaging of many small allocations.

The NVIDIA RTX Memory Utility (RTXMU) SDK is designed to reduce the coding complexity associated with optimal memory management of acceleration structures. RTXMU provides compaction and suballocation solutions for both DXR and Vulkan Ray Tracing while the client manages synchronization and execution of acceleration structure building. The SDK provides sample implementations of suballocators and compaction managers for both APIs while providing flexibility for the client to implement their own version.

For more information about why compaction and suballocation are so important in reducing acceleration structure memory overhead, see Tips: Acceleration Structure Compaction.

Why Use RTXMU?

RTXMU allows you to quickly integrate acceleration structure memory reduction techniques into their game engine. Below is a summary of these techniques along with some key benefits in using RTXMU

Reduces the memory footprint of acceleration structures involves both compaction and suballocation code, which are not trivial to implement. RTXMU can do the heavy lifting.
Abstracts away memory management of bottom level acceleration structures (BLASes) but is also flexible enough to enable users to provide their own implementation based on their engine’s needs.
Manages all barriers required for compaction size readback and compaction copies.
Passes back handles to the client that refer to complex BLAS data structures. This prevents any mismanagement of CPU memory, which could include accessing a BLAS that has already been deallocated or doesn’t exist.
Can help reduce BLAS memory by up to 50%.
Gives the benefit of less translation lookaside buffer (TLB) misses by packing more BLASes into 64 KB or 4 MB pages.

RTXMU Design

RTXMU has a design philosophy that should reduce integration complexities for most developers. The key principles of that design philosophy are as follows:

All functions are thread-safe. If simultaneous access occurs, they are blocking.
The client passes in client-owned command lists and RTXMU populates them.
The client is responsible for synchronizing command list execution.

API Function Calls

RTXMU abstracts away the coding complexities associated with compaction and suballocation. The functions detailed in this section describe the interface entry points for RTXMU.

Initialize—Specifies the suballocator block size.
PopulateBuildCommandList—Receives an array of D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_INPUTS and returns a vector of acceleration structure handles for the client to fetch acceleration structure GPUVAs later during top-level acceleration structure (TLAS) construction, and so on.
PopulateUAVBarriersCommandList – Receives acceleration structure inputs and places UAV barriers for them
PopulateCompactionSizeCopiesCommandList – Performs copies to bring over any compaction size data
PopulateUpdateCommandList—Receives an array of D3D12_BUILD_RAYTRACING_ACCELERATION_STRUCTURE_INPUTS and valid acceleration structure handles so that updates can be recorded.
PopulateCompactionCommandList—Receives an array of valid acceleration structure handles and records compaction commands and barriers.
RemoveAccelerationStructures—Receives an array of acceleration structure handles that specify which acceleration structure can be completely deallocated.
GarbageCollection—Receives an array of acceleration structure handles that specify that build resources (scratch and result buffer memory) can be deallocated.
GetAccelStructGPUVA—Receives an acceleration structure handle and returns a GPUVA of the result or compacted buffer based on state.
Reset—Deallocates all memory associated with current acceleration structure handles.

Suballocator DXR design

The BLAS suballocator works around the 64 KB and 4 MB buffer alignment requirement by placing small BLAS allocations within a larger memory heap. The BLAS suballocator still must fulfill the 256 B alignment required for BLAS allocations.

If the application requests 4 MB or larger suballocation blocks, then RTXMU uses placed resources with heaps that can provide 4 MB alignment.
If the application requests fewer than 4 MB suballocation blocks, then RTXMU uses committed resources, which only provide 64 KB alignment.

The BLAS suballocator reuses freed suballocations within blocks by maintaining a free list. If a memory request is greater than the suballocator block size, then a single allocation is created that can’t be suballocated.

Compaction DXR design

If the build requests compaction, then RTXMU requests that the compaction size be written out to a chunk of video memory. After the compaction size has been copied from video memory to system memory, then RTXMU allocates a suballocated compaction buffer to be used as the destination for the compaction copy.

The compaction copy takes the original build containing unused memory segments and truncates it down to the smallest memory footprint it can fit in. The original noncompacted build and scratch memory gets released back to the suballocator after compaction is complete. The only thing you have worry about is passing in the allow compaction flag and calling GetGPUVA with a BLAS handle. The GPUVA could either be the original build or the compacted build, based on what state the BLAS is in.

How to Use RTXMU

In this section, I detail the RTXMU sequence loop and synchronization.

RTXMU sequence loop

Figure 1 shows a normal usage pattern of RTXMU. The client manages the command list execution, while everything else is a call into RTXMU

First, initialize RTXMU by passing in the suballocator block size and the device responsible for allocating suballocation blocks. In each frame, the engine builds new acceleration structures while also compacting acceleration structures built in previous frames.

After RTXMU populates the client’s command lists, the client is free to execute and manage the synchronization of the initial build to the final compaction copy build. It’s important to make sure that each acceleration structure build has been fully executed before calling PopulateCompactionCommandList. This is left to the client to manage properly.

When an acceleration structure has finally reached the compaction state, then the client can choose to call GarbageCollection, which notifies RTXMU that the scratch and original acceleration structure buffer can be deallocated. If the engine does heavy streaming of assets, then the client can deallocate all acceleration structure resources by calling RemoveAS with a valid acceleration structure handle.

RTXMU process loop beginning with Initialize and ending with RemoveAS. — *Figure 1. RTXMU flow chart of a typical use case delineating client and RTXMU code*

Client Acceleration Structure Build Synchronization

Figure 2 shows the synchronization required by the client to manage compaction-ready workloads properly. The example here is a triple frame buffered loop in which the client can have up to three asynchronous frames being built on the CPU and executed on the GPU.

To get the compaction size available on the CPU side, Build 0 must have finished executing on the GPU. After the client has received a fence signal back from the GPU, the client can then call RTXMU to start the compaction command list recording.

A helpful way to manage the synchronization of compaction for acceleration structures is to employ a key/value pair data structure of some kind that tracks the state of each acceleration structure handle given by RTXMU. The four basic states of an acceleration structure can be described as follows:

Prebuilt—The build command is recorded on a command list but hasn’t finished executing on the GPU.
Built— The initial build has been executed on the GPU and is ready for compaction commands.
Compacted—The compaction copy has been finished on the GPU and is ready for GarbageCollection to release the scratch and initial build buffers.
Released—The client releases the acceleration structure from memory because it is no longer in the scene. At that point, all memory associated with an acceleration structure handle is freed back to the OS.

Diagram shows staggered RTXMU client code GPU execution by CPU threads. — *Figure 2. Client code can only initiate compaction workloads after the initial acceleration structures builds have finished execution on the GPU.*

RTXMU Test Scenes

RTXMU was tested with six text scenes to provide real use case data about the benefits of compaction and suballocation. The following figures show just some of the scenes.

Figure 3. Zero Day scene
10,740 BLAS, uncompacted acceleration structure memory size 458.9 MB, compacted acceleration structure memory size 208.3 MB, 55% memory reduction, suballocating memory saved 71.3 MB

Figure 4. Emerald Square scene
281 BLAS, uncompacted acceleration structure memory size 189.7 MB, compacted acceleration structure memory size 106.1 MB, 44% memory reduction, suballocating memory saved 8.3 MB

Figure 5. Sun Temple scene
1,056 BLAS, uncompacted acceleration structure memory size 45.8 MB, compacted acceleration structure memory size 20.7 MB, 55% memory reduction, suballocating memory saved 56.8 MB

RTXMU integration results

On average, compaction on NVIDIA RTX cards reduced acceleration structure by 52% for the test scenes. The standard deviation of compaction memory reduction was 2.8%, which is quite stable.

*Figure 6. Bar graph comparing compaction on versus off on NVIDIA RTX 3000 series GPUs*

When enabling compaction on NVIDIA and AMD HW, the memory savings on NVIDIA HW is much improved compared to AMD. NVIDIA ends up being on average 3.26x smaller than AMD for acceleration structure memory when enabling compaction. The reason for such a huge reduction in memory footprint on NVIDIA is that AMD without compaction uses double the memory as is when compared to NVIDIA. Compaction then also reduces the NVIDIA memory by another 50% on average while AMD tends to reduce memory only by 75%.

NVIDIA GPUs significantly reduce the memory size of acceleration structures compared to AMD GPUs. — *Figure 7. Bar graph comparing compaction of NVIDIA 3000 series versus AMD 6000 series GPUs*

Suballocation tells a slightly different story here in which scenes with many small acceleration structures like Zero Day benefit greatly. The average memory savings from suballocation ends up being 123 MB but the standard deviation is rather large at 153 MB. From this data, we can assert that suballocation is highly dependent on the scene geometry and benefits from thousands of small triangle count BLAS geometry.

Suballocation results are dependent upon the scene geometry. — *Figure 8. Bar graph showing the memory savings of suballocation for specific scenes*

Source Code

NVIDIA is open-sourcing the RTXMU SDK along with a sample application integrating RTXMU. Maintaining RTXMU as an open-source project on GitHub helps developers understand the logic flow and provides access to modifying the underlying implementation. The RT Bindless sample application provides an example of an RTXMU integration for both Vulkan Ray Tracing and DXR backends.

Here’s how to build and run the sample application integrating RTXMU. You must have the following resources:

Windows, Linux, or an OS that supports DXR or Vulkan Ray Tracing
CMake 3.12
C++ 17
Git

First, clone the repository using the following command:

git clone --recursive https://github.com/NVIDIAGameWorks/donut_examples.git

Next, open CMake. For Where is the source code, enter the /donut_examples folder. Create a build folder in the /donut_examples folder. For Where to build the binaries, enter the new build folder. Select the cmake variable NVRHI_WITH_RTXMU to ON, choose Configure, wait for it to complete and then click Generate.

If you are building with Visual Studio, then select 2019 and x64 version. Open the donut_examples.sln file in Visual Studio and build the entire project.

Find the rt_bindless application folder under /Examples/Bindless Ray Tracing, choose the project context (right-click) menu, and choose Startup Project.

By default, bindless ray tracing runs on DXR. To run the Vulkan version, add -vk as a command-line argument in the project.

Summary

RTXMU combines both compaction and suballocation techniques to optimize and reduce memory consumption of acceleration structures for any DXR or Vulkan Ray Tracing application. The data shows that using RTXMU significantly reduces acceleration structure memory. This enables you to either add more geometry to your ray-traced scenes or use the extra memory for other resources.

Get started using the RTXMU SDK today: