Improve Shader Performance and In-Game Frame Rates with Shader Execution Reordering

NVIDIA is excited to introduce a new feature available in the next generation of GPUs called Shader Execution Reordering (SER). SER is a performance…

NVIDIA is excited to introduce a new feature available in the next generation of GPUs called Shader Execution Reordering (SER). SER is a performance optimization that unlocks the potential for better ray and memory coherency in ray tracing shaders, and thus increased shading efficiency.

Background and overview

Shading divergence is a long-standing problem in ray tracing. With increasingly complex renderer implementations, more workloads are becoming limited by shader execution rather than the tracing of rays. One way to mitigate this problem is to reduce the divergence affecting the GPU when executing shader code.

SER helps to alleviate two types of divergence: execution divergence and data divergence. Execution divergence occurs when different threads execute different shaders or branches within a shader. Data divergence occurs when different threads access memory resources in patterns that are hard to cache.

SER mitigates divergence by reordering threads, on the fly, across the GPU so that they can continue execution with increased coherence. It also enables the decoupling of ray intersection and shading.

For a more comprehensive overview of SER, see the Shader Execution Reordering whitepaper.

Integration using NVIDIA API

Prerequisites

To access the feature set and optimizations provided by SER, you will need the following:

GPU that supports DXR 1.0 or higher
Driver that supports SER, R520 and newer
HLSL extension headers, which can be found in the latest NVIDIA API
Link against nvapi64.lib, included in the packages containing the headers above
A recent version of DXC / dxcompiler.dll that supports templates (optional). If you’re compiling shaders from Visual Studio, make sure that your project is configured to use this version of the compiler executable.

Get started with SER and NVIDIA API.

Initialize NVAPI and enable SER API

First, initialize / deinitialize NVAPI using the following call:

NvAPI_Initialize(); 
NvAPI_Unload();

Next, verify that the SER API is supported, using the following call:

bool supported = false; 
NvAPI_D3D12_IsNvShaderExtnOpCodeSupported(pDevice, NV_EXTN_OP_HIT_OBJECT_REORDER_THREAD, &supported); 
if (!supported)  
{  
    /* Don't use SER */  
}

Host side integration

Before ray tracing state object creation, set up a fake UAV slot and register it:

#define NV_SHADER_EXTN_SLOT           999999   // pick an arbitrary unused slot 
#define NV_SHADER_EXTN_REGISTER_SPACE 999999   // pick an arbitrary unused space 

NvAPI_D3D12_SetNvShaderExtnSlotSpace(pDevice, NV_SHADER_EXTN_SLOT,
NV_SHADER_EXTN_REGISTER_SPACE);

If you need a thread-local variant, use the related function: NvAPI_D3D12_SetNvShaderExtnSlotSpaceLocalThread.

Next, add the fake UAV slot to the global root signature used to compile ray tracing pipelines. You do not need to allocate and / or bind a resource for this. Below is an example of augmenting D3D12 sample code with a fake UAV slot, denoted in bold. The root signature creation in your application will likely look quite different.

// Global Root Signature 
// This is a root signature that is shared across all raytracing shaders invoked during a
// DispatchRays() call. 
{ 
    // Performance TIP: Order from most frequent to least frequent.
    CD3DX12_DESCRIPTOR_RANGE ranges[5]; 

    ranges[0].Init(D3D12_DESCRIPTOR_RANGE_TYPE_UAV, 1, 0);     // output texture 
    ranges[1].Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 2, 0, 1);  // static index buffers 
    ranges[2].Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 2, 0, 2);  // static vertex buffers 
    ranges[3].Init(D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 1, 0, 3);  // static vertex buffers 
     
    // fake UAV for shader execution reordering 
    ranges[4].Init(
        D3D12_DESCRIPTOR_RANGE_TYPE_UAV, 1, NV_SHADER_EXTN_SLOT, NV_SHADER_EXTN_REGISTER_SPACE);   
 
    CD3DX12_ROOT_PARAMETER rootParameters[GlobalRootSignatureParams::Count]; 
    rootParameters[GlobalRootSignatureParams::OutputViewSlot].InitAsDescriptorTable(1, &ranges[0]); 
    rootParameters[GlobalRootSignatureParams::AccelerationStructureSlot].InitAsShaderResourceView(0); 
    rootParameters[GlobalRootSignatureParams::SceneConstantSlot].InitAsConstantBufferView(0); 
    rootParameters[GlobalRootSignatureParams::VertexBuffersSlot].InitAsDescriptorTable(3, &ranges[1]); 
     
    rootParameters[GlobalRootSignatureParams::SerUavSlot].InitAsDescriptorTable(1, &ranges[4]); 
     
    CD3DX12_ROOT_SIGNATURE_DESC globalRootSignatureDesc(ARRAYSIZE(rootParameters), rootParameters); 
    SerializeAndCreateRaytracingRootSignature(
        globalRootSignatureDesc, &m_raytracingGlobalRootSignature); 
}

Use of API in shader code

In shader code, define the fake UAV slot and register again, using the same values:

#define NV_SHADER_EXTN_SLOT           u999999       // matches slot number in NvAPI_D3D12_SetNvShaderExtnSlotSpace 
#define NV_SHADER_EXTN_REGISTER_SPACE space999999   // matches space number in NvAPI_D3D12_SetNvShaderExtnSlotSpace 

#include "SER/nvHLSLExtns.h"

Now the SER API may be used in ray generation shaders:

NvHitObject hitObject = NvTraceRayHitObject(TLAS, RAY_FLAG_NONE, 0xff, 0, 1, 0, ray, payload); 
NvReorderThread(hitObject, 0, 0); 
NvInvokeHitObject(TLAS, hitObject, payload);

When compiling HLSL to DXIL, either:

1) ensure that templates are enabled in DXC by specifying the command line argument -HV 2021

2) use the macro version of the API that does not require templates. The macro version can be enabled by #defining NV_HITOBJECT_USE_MACRO_API before #including nvHLSLExtns.h. This is intended for use in legacy codebases which have difficulty switching to HLSL 2021. The recommended path is using templates if the codebase can support it.

Integration of Unreal Engine 5 NvRTX

Unreal Engine developers can take advantage of SER within the NVIDIA branch of Unreal Engine (NvRTX). The following section explains how SER provides performance gains in ray tracing operations and provides optimization tips for specific use cases.

The NVIDIA Unreal Engine 5 NvRTX 5.0.3 release will feature SER integration to support optimization of many of its ray tracing paths. With SER, NvRTX developers will see additional frame rate optimization on 40 series cards with up to 40% increased speeds in ray tracing operations and zero impact on quality or content authoring. This improves efficiency of complex ray tracing calculations and will provide greater gains in scenes that take full advantage of what ray tracing has to offer.

Benefits of SER in Unreal Engine 5

SER in Unreal Engine 5 (UE5) enables better offline path tracing, arguably the most complex tracing operation in UE5. Likewise, hardware ray traced reflections and translucency, which have complex interactions with materials and lighting, will also see benefits.

SER also improves Lumen performance when hardware ray tracing is enabled. In some cases, the changes required to do this, independent of initial system complexity, are trivial. In other cases, it has added substantial complexity. Three different examples are explored in more detail below.

Simple case: Path tracing

Path tracing presents a highly divergent workflow, making it a great candidate for applying SER.

An image showing Grandma’s Kitchen by Richard Cowgill, path traced in Unreal Engine 5. — *Figure 1.* Grandma’s Kitchen by Richard Cowgill, path traced in Unreal Engine 5. Path tracing is 40% faster with SER for this view.

Applying SER allows the path tracer to reduce divergence in its material evaluation, instead of just on the number of bounces. This offers a 20-50% gain in performance with the code change provided below:

#if !PATH_TRACER_USE_REORDERING 

// Trace the ray, including evaluation of hit data 
TraceRay( 
    TLAS, 
    RayFlags, 
    RAY_TRACING_MASK_ALL, 
    RAY_TRACING_SHADER_SLOT_MATERIAL, 
    RAY_TRACING_NUM_SHADER_SLOTS, 
    MissShaderIndex, 
    PathState.Ray, 
    PackedPayload); 
#else 
{ 
    NvHitObject Hit; 
 
    // Trace ray to produce hit object 
    NvTraceRayHitObject(
        TLAS,
        RayFlags,
        RAY_TRACING_MASK_ALL,
        RAY_TRACING_SHADER_SLOT_MATERIAL,
        RAY_TRACING_NUM_SHADER_SLOTS,
        MissShaderIndex,
        PathState.Ray,
        PackedPayload, Hit); 
 
    // Reorder threads to have coherent hit evaluation 
    NvReorderThread(Hit);
 
    // Evaluate hit data in the now coherent environment 
    NvInvokeHitObject(TLAS, Hit, PackedPayload); 
}
#endif

This improvement can be accomplished by replacing the DXR TraceRay function with an equivalent set of NvTraceRayHitObject, NvReorderThread, and NvInvokeHitObject. A key aspect is that the optimization is only applied selectively. The change only applies to the TraceTransparentRay function within the UE5 path tracing code, as this is the source of most material evaluation divergence. Other rays are performing cheaper operations and are less important to reorder, so they may not be worth the extra cost of attempting to reorder.

This example is the tip of the iceberg when it comes to the potential of the path tracer code. More careful analysis will almost certainly allow additional gains, including possibly eliminating the need to use multiple passes to compact longer rays.

Unusual case: Work compaction in Lumen global illumination

Typically, one thinks of reordering to handle the execution divergence experienced by hit shading. While the ray tracing passes used in Lumen global illumination do not run a divergent hit shader, they still benefit from the mechanisms provided by SER.

For large scenes, like the UE5 City Sample, traces are broken into the near and far field, which are run as separate tracing passes with compaction in between. The multiple passes and compaction can be replaced by a single NVReorderThread call. This avoids the idle bubbles on the GPU required to compact the results of near-field tracing, and then launch far-field rays.

Removing the extra overhead of storing, compacting, and relaunching work is often worth a 20% savings. The shader changes can be more intensive due to assumptions in the original code (functions using macros to permute behaviors rather than arguments). However, the logical changes amounted to adding two reorder calls with a single Boolean expression for whether a trace had hit or missed.

Complex case: Lumen reflections

Lumen is a system contained in UE5 which implements global illumination and reflections. It has a high degree of complexity, and a thorough discussion of it is well beyond the scope of this blog post. The description below is heavily distilled and focuses on one specific configuration: Lumen reflections with hardware ray tracing (HWRT) hit lighting enabled. Note that Lumen is also able to leverage software ray tracing by way of signed distance fields, which will not be discussed here.

To render reflections, the Lumen HWRT hit lighting path uses multiple passes:

Near field tracing – extract material ID

Compact rays

Far field tracing (optional) – extract material ID

Compact rays

Append far field rays (optional)

Sort rays by material

Re-trace with hit lighting

In addition, the following important details about how Lumen works help explain the differences in approach between SER and non-SER.

Near field and far field in lumen correspond with different sections of the TLAS for objects close to the camera, and objects far away from the camera respectively. Both near field and far field are contained in the same TLAS.
Two different ray tracing pipelines are used in the passes above. Near and far field both use a simplified (fast) tracing path, while hit lighting has full material support. This is the reason for the separate re-tracing path with hit lighting.

For further technical details on these passes, see Lumen Technical Details. With SER enabled, the passes can be combined because separate compaction and sorting phases are no longer necessary. The pass roughly becomes trace near field, if not a hit trace far field, if either hit then uses the hit object to evaluate the material and perform lighting. This is possible due to the decoupling of tracing and shading.

The relevant sections of the shader are provided below:

NvHitObject SERHitObject; 
 
// Near field 
NvTraceRayHitObject(..., SERHitObject); 
NvReorderThread(SERHitObject); 
Result.bIsHit = SERHitObject.IsHit(); 
 
// Far field 
if (!Result.bIsHit) 
{ 
    // Transform ray into far field space of TLAS 
    ... 
 
    NvTraceRayHitObject(..., SERHitObject); 
    NvReorderThread(SERHitObject); 
    Result.bIsHit = SERHitObject.IsHit(); 
} 
 
// Compute result 
if (Result.bIsHit) 
{ 
    NvInvokeHitObject(Context.TLAS, SERHitObject, Payload); 
    Result.Radiance = CalculateRayTracedLighting(); 
} 
 
// Handle miss

This is one example of the availability of SER creating a higher-level implication on the rendering architecture, rather than just replacing TraceRay with the respective NVAPI equivalent. The implementation described above resulted in a 20-30% speed increase in Lumen reflections on the GPU, measured when profiling a typical workload in UE5 City Sample.

Conclusion

Shading divergence can pose performance problems when considering both data and execution. The Shader Execution Reordering API gives developers a powerful tool to mitigate these penalties, with relatively little effort required to get started. The optimizations discussed above represent only the initial stages of introducing the possibilities provided by SER to a large codebase, such as Unreal Engine 5. We look forward to seeing SER realize more of its potential as its use evolves.