DataBloom - Part 184

Misc

Supercharge Ransomware Detection with AI-Enhanced Cybersecurity Solutions

Post author By
Post date September 6, 2023
No Comments on Supercharge Ransomware Detection with AI-Enhanced Cybersecurity Solutions

Cybersecurity abstract image Ransomware attacks have become increasingly popular, more sophisticated, and harder to detect. For example, in 2022, a destructive ransomware attack took 233…

Ransomware attacks have become increasingly popular, more sophisticated, and harder to detect. For example, in 2022, a destructive ransomware attack took 233 days to identify and 91 days to contain, for a total lifecycle of 324 days. Going undetected for this amount of time can cause irreversible damage. Faster and smarter detection capabilities are critical to addressing these attacks.

Behavioral ransomware detection with NVIDIA DPUs and GPUs

Adversaries and malware are evolving faster than defenders, making it hard for security teams to track changes and maintain signatures for known threats. To address this, a combination of AI and advanced security monitoring is needed. Developers can build solutions for detecting ransomware attacks faster using advanced technologies including NVIDIA BlueField Data Processing Units (DPUs), the NVIDIA DOCA SDK with DOCA App Shield, and NVIDIA Morpheus cybersecurity AI framework.

Intrusion detection with BlueField DPU

BlueField DPUs are ideal for enabling best-in-class, zero-trust security, and extending that security to include host-based protection. With built-in isolation, this creates a separate trust domain from the host system, where intrusion detection system (IDS) security agents are deployed. If a host is compromised, the isolation layer between the security control agents on the DPU and the host prevents the attack from spreading throughout the data center.

DOCA App-Shield is one of the libraries provided with the NVIDIA DOCA software framework. It is a security framework for host monitoring, enabling cybersecurity vendors to create IDS solutions that can quickly identify an attack on any physical server or virtual machine.

DOCA App-Shield runs on the NVIDIA DPU as an out-of-band (OOB) device in a separate domain from the host CPU and OS and is:

Resilient against attacks on a host machine.
Least disruptive to the execution of host applications.

DOCA App Shield exposes an API to users developing security applications. For detecting malicious activities from the DPU Arm processor, it uses DMA without involving the host OS or CPU. In contrast, a standard agent of anti-virus or endpoint-detection-response runs on the host and can be seen or‌ compromised by an attacker or malware.

Image of an NVIDIA BlueField-3 DPU. — *Figure 1. NVIDIA BlueField-3* *DPU* *400 Gb/s infrastructure compute platform*

Morpheus AI framework for cybersecurity

Morpheus is part of the NVIDIA AI Enterprise software product family and is designed to build complex ML and AI-based pipelines. It provides significant acceleration of AI pipelines to deal with high data volumes, classify data, and identify anomalies, vulnerabilities, phishing, compromised machines, and many other security issues.

Morpheus can be deployed on-premise with a GPU-accelerated server like the NVIDIA EGX Enterprise Platform, and it is also accessible through cloud deployment.

A workflow showing Morpheus consisting of a GPU-accelerated server with SmartNic/DPU and software stack of RAPIDS, Cyber Logs Accelerator, NVIDIA Triton, and NVIDIA TensorRT for real-time telemetry from BlueField DPUs. — *Figure 2. NVIDIA Morpheus with BlueField DPU Telemetry*

Addressing ransomware with AI

One of the pretrained AI models in Morpheus is the ransomware detection pipeline that leverages NVIDIA DOCA App-Shield as a data source. This brings a new level of security for detecting ransomware attacks that were previously impossible to detect in real time.

Ransomware detection AI pipeline showing a DPU monitoring virtual machines. The Morpheus AI server receives DOCA AppShield events and alerts high anomaly processes. — *Figure 3. Ransomware detection AI pipeline*

Inside BlueField DPU

BlueField DPU offers the new OS-Inspector app to leverage DOCA App-Shield host monitoring capabilities and enables a constant collection of OS attributes from the monitored host or virtual machine. OS-Inspector app is now available through early access. Contact us for more information.

The collected operating system attributes include processes, threads, libraries, handles, and vads (for a complete API list, see the App-Shield programming guide).

OS-Inspector App then uses DOCA Telemetry Service to stream the attributes to the Morpheus inference server using the Kafka event streaming platform.

Inside the Morpheus Inference Framework

The Morpheus ransomware detection AI pipeline processes the data using GPU acceleration and feeds the data to the ransomware detection AI model.

This tree-based model detects ransomware attacks based on suspicious attributes in the servers. It uses N-gram features to capture the change in attributes through time and detect any suspicious anomaly.

When an attack is detected, Morpheus generates an inference event and triggers a real-time alert to the security team for further mitigation steps.

A ransomware detection model detects a ransomware process named sample.exe. — *Figure 4. Ransomware detection model*

FinSec lab use case

NVIDIA partner FinSec Innovation Lab, a joint venture between Mastercard and Enel X, demonstrated their solution for combating ransomware attacks at NVIDIA GTC 2023.

FinSec ran a POC, which used BlueField DPUs and the Morpheus cybersecurity AI framework to train a model that detected a ransomware attack in less than 12 seconds. This real-time response enabled them to isolate a virtual machine and save 80% of the data on the infected servers.

Learn more

BlueField DPU running DOCA App Shield enables OOB host monitoring. Together with Morpheus, developers can quickly build AI models to protect against cyber attacks, better than ever before. OS-Inspector app is now available through early access. Contact us for more information.

Misc

GPUs for ETL? Optimizing ETL Architecture for Apache Spark SQL Operations

Post author By
Post date September 6, 2023
No Comments on GPUs for ETL? Optimizing ETL Architecture for Apache Spark SQL Operations

Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings…

Extract-transform-load (ETL) operations with GPUs using the NVIDIA RAPIDS Accelerator for Apache Spark running on large-scale data can produce both cost savings and performance gains. We demonstrated this in our previous post, GPUs for ETL? Run Faster, Less Costly Workloads with NVIDIA RAPIDS Accelerator for Apache Spark and Databricks. In this post, we dive deeper to identify precisely which Apache Spark SQL operations are accelerated for a given processing architecture.

This post is part of a series on GPUs and extract-transform-load (ETL) operations.

Migrating ETL to GPUs

Should all ETL be migrated to GPUs? Or is there an advantage to evaluating which processing architecture is best suited to specific Spark SQL operations?

CPUs are optimized for sequential processing with significantly fewer yet faster individual cores. There are clear computational advantages for memory management, handling I/O operations, running operating systems, and so on.

GPUs are optimized for parallel processing with significantly more yet slower cores. GPUs excel at rendering graphics, training, machine learning and deep learning models, performing matrix calculations, and other operations that benefit from parallelization.

Experimental design

We created three large, complex datasets modeled after real client retail sales data using computationally expensive ETL operations:

Aggregation (SUM + GROUP BY)
CROSS JOIN
UNION

Each dataset was specifically curated to test the limits and value of specific Spark SQL operations. All three datasets were modeled based on a transactional sales dataset from a global retailer. The row size, column count, and type were selected to balance experimental processing costs while performing tests that would demonstrate and evaluate the benefits of both CPU and GPU architectures under specific operating conditions. See Table 1 for data profiles.

Operation	Rows	# COLUMNS: Structured data	# COLUMNS: Unstructured data	Size (MB)
Aggregation (SUM + GROUP BY)	94.4 million	2	0	3,200
CROSS JOIN	63 billion	6	1	983
UNION	447 million	10	2	721

Table 1. Summary of experimental datasets

The following computational configurations were evaluated for this experiment:

Worker and driver type
Workers [minimum and maximum]
RAPIDS or Photon deployment
Maximal hourly limits on Databricks units (DBUs)—a proprietary measure of Databricks compute cost

Worker and driver type	Workers [min/max]	RAPIDS Accelerator / PHOTON	Max DBUs / hour
Standard_NC4as_T4_v3	1/1	RAPIDS Accelerator	2
Standard_NC4as_T4_v3	2/8	RAPIDS Accelerator	9
Standard_NC8as_T4_v3	2/2	RAPIDS Accelerator	4.5
Standard_NC8as_T4_v3	2/8	RAPIDS Accelerator	14
Standard_NC16as_T4_v3	2/2	RAPIDS Accelerator	7.5
Standard_NC16as_T4_v3	2/8	RAPIDS Accelerator	23
Standard_E16_v3	2/2	Photon	24
Standard_E16_v3	2/8	Photon	72

Table 2. Experimental computational configurations

Other experimental considerations

In addition to building industry-representative test datasets, other experimental factors are listed below.

Datasets are run using several different worker and driver configurations on pay-as-you-go instances–as opposed to spot instances–as their inherent availability establishes pricing consistency across experiments.
For GPU testing, we leveraged RAPIDS Accelerator on T4 GPUs, which are optimized for analytics-heavy loads, and carry a substantially lower cost per DBU.
The CPU worker type is an in-memory optimized architecture which uses Intel Xeon Platinum 8370C (Ice Lake) CPUs.
We also leveraged Databricks Photon, a native CPU accelerator solution and accelerated version of their traditional Java runtime, rewritten in C++.

These parameters were chosen to ensure experimental repeatability and applicability to common use cases.

Results

To evaluate experimental results in a consistent fashion, we developed a composite metric named adjusted DBUs per minute (ADBUs). ADBUs are based on DBUs and computed as follows:

$text{emph{Adjusted DBUs per Minute}} = frac{text{emph{Runtime (mins)}}}{text{emph{Cluster DBUs Cost per Hour}}}$

Experimental results demonstrate that there is no computational Spark SQL task in which one chipset–GPU or CPU–dominates. As Figure 1 shows, dataset characteristics and the suitability of a cluster configuration have the strongest impact on which framework to choose for a specific task. Although unsurprising, the question remains: which ETL processes should be migrated to GPUs?

UNION operations

Although RAPIDS Accelerator on T4 GPUs generate results having both lower costs and execution times with UNION operations, the difference when compared with CPUs is negligible. Moving an existing ETL pipeline from CPUs to GPUs seems unwarranted for this combination of dataset and Spark SQL operation. It is likely–albeit untested by this research–that a larger dataset may generate results that warrant a move to GPUs.

CROSS JOIN operations

For the compute-heavy CROSS JOIN operation, we observed an order of magnitude of both time and cost savings by employing RAPIDS Accelerator (GPUs) over Photon (CPUs).

One possible explanation for these performance gains is that the CROSS JOIN is a Cartesian product that involves an unstructured data column being multiplied with itself. This leads to exponentially increasing complexity. The performance gains of GPUs are well suited for this type of large-scale parallelizable operation.

The main driver of cost differences is that the CPU clusters we experimented with had a much higher DBU rating than the chosen GPU clusters.

SUM + GROUP BY operations

For aggregation operations (SUM + GROUP BY), we observed mixed results. Photon (CPUs) delivered notably faster compute times, whereas RAPIDS Accelerator (GPUs) provided lower overall costs. Looking at individual experimental runs, we observed that the higher Photon costs result in higher DBUs, whereas the costs associated with T4s are significantly lower.

This explains the lower overall cost using RAPIDS Accelerator in this part of the experiment. In summary, if speed is the objective, Photon is the clear winner. More price-conscious users may prefer the longer compute times of RAPIDS Accelerator for notable cost savings.

Bar graphs showing the trade-off between compute time and cost for UNION, CROSS JOIN, and SUM + GROUP operations in Spark SQL for both Photon and RAPIDS Accelerator — *Figure 1. Comparison of mean compute time and mean cost*

Deciding which architecture to use

The CPU cluster gained performance in execution time in the commonly used aggregation (SUM + GROUP BY) experiment. However, this came at the price of higher associated cluster costs. For CROSS JOINs, a less common high-compute and highly-parallelizable operation, GPUs dominated both in higher speed and lower costs. UNIONs showed negligible comparative differences in compute time and cost.

Where GPUs (and by association RAPIDS Accelerator) will excel depends largely on the data structure, the scale of the data, the ETL operation(s) performed, and the user’s technical depth.

GPUs for ETL

In general, GPUs are well suited to large, complex datasets and Spark SQL operations that are highly parallelizable. The experimental results suggest using GPUs for CROSS JOIN situations, as they are amenable to parallelization, and can also scale easily as data grows in size and complexity.

It is important to note the scale of data is less important than the complexity of the data and the selected operation, as shown in the SUM + GROUP BY experiment. (This experiment involved more data, but less computational complexity compared to CROSS JOINs.) You can work with NVIDIA free of charge to estimate expected GPU acceleration gains based on analyses of Spark log files.

CPUs for ETL

Based on the experiments, certain Spark SQL operations such as UNIONs showed a negligible difference in cost and compute time. A shift to GPUs may not be warranted in this case. Moreover, for aggregations (SUM + GROUP BY), a conscious choice of speed over cost can be made based on situational requirements, where CPUs will execute faster, but at a higher cost.

In cases where in-memory calculations are straightforward, staying with an established CPU ETL architecture may be ideal.

Discussion and future considerations

This experiment explored one-step Spark SQL operations. For example, a singular CROSS JOIN, or a singular UNION, omitting more complex ETL jobs that involve multiple steps. An interesting future experiment might include optimizing ETL processing at a granular level, sending individual SparkSQL operations to CPUs or GPUs in a single job or script, and optimizing for both time and compute cost.

A savvy Spark user might try to focus on implementing scripting strategies to make the most of the default runtime, rather than implementing a more efficient paradigm. Examples include:

Spark SQL join strategies (broadcast join, shuffle merge, hash join, and so on)
High-performing data structures (storing data in parquet files that are highly performant in a cloud architecture as compared to text files, for example)
Strategic data caching for reuse

The results of our experiment indicate that leveraging GPUs for ETL can supply additional performance sufficient to warrant the effort to implement a GPU architecture.

Although supported, RAPIDS Accelerator for Apache Spark is not available by default on Azure Databricks. This requires the installation of .jar files that may necessitate some debugging. This tech debt was largely paid going forward, as subsequent uses of RAPIDS Accelerator were seamless and straightforward. NVIDIA support was always readily available to help if and when necessary.

Finally, we opted to keep all created clusters under 100 DBUs per hour to manage experimental costs. We tried only one size of Photon cluster. Experimental results may change by varying the cluster size, number of workers, and other experimental parameters. We feel these results are sufficiently robust and relevant for many typical use cases in organizations running ETL jobs.

Conclusion

NVIDIA T4 GPUs, designed specifically for analytics workloads, accomplish a leap in the price/performance ratio associated with leveraging GPU-based compute. NVIDIA RAPIDS Accelerator for Apache Spark, especially when run on NVIDIA T4 GPUs, has the potential to significantly reduce costs and execution times for certain common ETL SparkSQL operations, particularly those that are highly parallelizable.

To implement this solution on your own Apache Spark workload with no code changes, visit the NVIDIA/spark-rapids-examples GitHub repo or the Apache Spark tool page for sample code and applications that showcase the performance and benefits of using RAPIDS Accelerator in your data processing or machine learning pipelines.

Misc

A Powerful Legacy: Researcher’s Mom Fueled Passion for Nuclear Fusion

Post author By
Post date September 6, 2023
No Comments on A Powerful Legacy: Researcher’s Mom Fueled Passion for Nuclear Fusion

Before she entered high school, Ge Dong wanted to be a physicist like her mom, a professor at Shanghai Jiao Tong University.

Misc

‘Arteana’s Art Squad’ Assembles — Indie Showrunner Rafi Nizam Creates High-End Children’s Show on a Budget

Post author By
Post date September 6, 2023
No Comments on ‘Arteana’s Art Squad’ Assembles — Indie Showrunner Rafi Nizam Creates High-End Children’s Show on a Budget

Rafi Nizam is an award-winning independent animator, director, character designer and more. He’s developed feature films at Sony Pictures, children’s series and comedies at BBC and global transmedia content at NBCUniversal.

Misc

Webinar: Build Realistic Robot Simulations with NVIDIA Isaac Sim and MATLAB

Post author By
Post date September 5, 2023
No Comments on Webinar: Build Realistic Robot Simulations with NVIDIA Isaac Sim and MATLAB

A warehouse with a medley of robotics pieces. On Sept. 12, learn about the connection between MATLAB and NVIDIA Isaac Sim through ROS.

On Sept. 12, learn about the connection between MATLAB and NVIDIA Isaac Sim through ROS.

Misc

The Halo Effect: AI Deep Dives Into Coral Reef Conservation

Post author By
Post date September 5, 2023
No Comments on The Halo Effect: AI Deep Dives Into Coral Reef Conservation

With coral reefs in rapid decline across the globe, researchers from the University of Hawaii at Mānoa have pioneered an AI-based surveying tool that monitors reef health from the sky. Using deep learning models and high-resolution satellite imagery powered by NVIDIA GPUs, the researchers have developed a new method for spotting and tracking coral reef Read article >

Misc

A Perfect Pair: adidas and Covision Media Use AI, NVIDIA RTX to Create Photorealistic 3D Content

Post author By
Post date September 5, 2023
No Comments on A Perfect Pair: adidas and Covision Media Use AI, NVIDIA RTX to Create Photorealistic 3D Content

Creating 3D scans of physical products can be time consuming. Businesses often use traditional methods, like photogrammetry-based apps and scanners, but these can take hours or even days. They also don’t always provide the 3D quality and level of detail needed to make models look realistic in all its applications. Italy-based startup Covision Media is Read article >

Misc

NVIDIA CEO Meets with India Prime Minister Narendra Modi

Post author By
Post date September 4, 2023
No Comments on NVIDIA CEO Meets with India Prime Minister Narendra Modi

Underscoring NVIDIA’s growing relationship with the global technology superpower, Indian Prime Minister Narendra Modi met with NVIDIA founder and CEO Jensen Huang Monday evening. The meeting at 7 Lok Kalyan Marg — as the Prime Minister’s official residence in New Delhi is known — comes as Modi prepares to host a gathering of leaders from Read article >

Misc

Speeding Up Text-To-Speech Diffusion Models by Distillation

Post author By
Post date September 1, 2023
No Comments on Speeding Up Text-To-Speech Diffusion Models by Distillation

An image representing fast diffusion TTS. Every year, as part of their coursework, students from the University of Warsaw, Poland get to work under the supervision of engineers from the NVIDIA Warsaw…

Every year, as part of their coursework, students from the University of Warsaw, Poland get to work under the supervision of engineers from the NVIDIA Warsaw office on challenging problems in deep learning and accelerated computing. We present the work of three M.Sc. students—Alicja Ziarko, Paweł Pawlik, and Michał Siennicki—who managed to significantly reduce the latency in TorToiSe, a multi-stage, diffusion-based, text-to-speech (TTS) model.

Alicja, Paweł, and Michał first learned about the recent advancements in speech synthesis and diffusion models. They chose the combination of classifier-free guidance and progressive distillation, which performs well in computer vision, and adapted it to speech synthesis, achieving a 5x reduction in diffusion latency without a regression in speech quality. Small perceptual speech tests confirmed the results. Notably, this approach does not require costly training from scratch on the original model.

Why speed up diffusion-based TTS?

Since the publication of WaveNet in 2016, neural networks have become the primary models for speech synthesis. In simple applications, such as synthesis for AI-based voice assistants, synthetic voices are almost indistinguishable from human speech. Such voices can be synthesized orders of magnitudes faster than real time, for instance with the NVIDIA NeMo AI toolkit.

However, achieving high expressivity or imitating a voice based on a few seconds of recorded speech (few-shot) is still considered challenging.

Denoising Diffusion Probabilistic Models (DDPMs) emerged as a generative technique that enables the generation of images of great quality and expressivity based on input text. DDPMs can be readily applied to TTS because a frequency-based spectrogram, which graphically represents a speech signal, can be processed like an image.

For instance, in TorToiSe, which is a guided diffusion-based TTS model, a spectrogram is generated by combining the results of two diffusion models (Figure 1). The iterative diffusion process involves hundreds of steps to achieve a high-quality output, significantly increasing latency compared to state-of-the-art TTS methods, which severely limits its applications.

In Figure 1, the unconditional diffusion model iteratively refines the initial noise until a high-quality spectrogram is obtained. The second diffusion model is further conditioned on the text embeddings produced by the language model.

Diagram shows a speech spectrogram generated by combining the results of two diffusion models. After numerous iterations, the expected speech spectrogram is obtained. — *Figure 1. Architecture of TorToiSe, a diffusion-based neural network for TTS*

Methods for speeding up diffusion

Existing latency reduction techniques in diffusion-based TTS can be divided into training-free and training-based methods.

Training-free methods do not involve training the network used to generate images by reversing the diffusion process. Instead, they only focus on optimizing the multi-step diffusion process. The diffusion process can be seen as solving ODE/SDE equations, so one way to optimize it is to create a better solver like DDPM, DDIM, and DPM, which lowers the number of diffusion steps. Parallel sampling methods, such as those based on Picard iterations or Normalizing Flows, can parallelize the diffusion process to benefit from parallel computing on GPUs.

Training-based methods focus on optimizing the network used in the diffusion process. The network can be pruned, quantized, or sparsified, and then fine-tuned for higher accuracy. Alternatively, its neural architecture can be changed manually or automatically using NAS. Knowledge distillation techniques enable distilling the student network from the teacher network to reduce the number of steps in the diffusion process.

Distillation in diffusion-based TTS

Alicja, Paweł, and Michał decided to use the distillation approach based on promising results in computer vision and its potential for an estimated 5x reduction in latency of the diffusion model at inference. They have managed to adapt progressive distillation to the diffusion part of a pretrained TorToiSe model, overcoming problems like the lack of access to the original training data.

Their approach consists of two knowledge distillation phases:

Mimicking the guided diffusion model output
Training another student model

In the first knowledge distillation phase (Figure 2), the student model is trained to mimic the output of the guided diffusion model at each diffusion step. This phase reduces latency by half by combining the two diffusion models into one model.

To address the lack of access to the original training data, text embeddings from the language model are passed through the original teacher model to generate synthetic data used in distillation. The use of synthetic data also makes the distillation process more efficient because the entire TTS, guided diffusion pipeline does not have to be invoked at each distillation step.

Diagram shows a two-step distillation pipeline. First, the student model is trained (distilled) to mimic the output of the guided diffusion model at each diffusion step. In the second phase, the newly trained student model serves as a teacher to another student model, with a reduced number of steps, using progressive distillation. — *Figure 2. Distillation of guided diffusion-based TTS model*

In the second progressive distillation phase (Figure 3), the newly trained student model serves as a teacher to train another student model. In this technique, the student model is trained to mimic the teacher model while reducing the number of diffusion steps by a factor of two. This process is repeated many times to further reduce the number of steps, while each time, a new student serves as the teacher for the next round of distillation.

A progressive distillation with seven iterations reduces the number of inference steps 7^2 times, from 4,000 steps on which the model was trained to 31 steps. This reduction results in a 5x speedup compared to the guided diffusion model, excluding the text embedding calculation cost.

Diagram shows two steps of progressive distillation. In each step, the number of steps required to transform the Gaussian noise to the output speech spectrogram is reduced by a factor of two, from 4 to 2, and then 2 to 1. — *Figure 3. Example of two iterations of progressive distillation*

The perceptual pairwise speech test shows that the distilled model (after the second phase) matches the quality of speech produced by the TTS model based on guided distillation.

As an example, listen to audio samples in Table 1 generated by the progressive distillation-based TTS model. The samples match the quality of the audio samples from the guided diffusion-based TTS model. If we simply reduced the number of distillation steps to 31, instead of using progressive distillation, the quality of the generated speech deteriorates significantly.

Speaker	Guided diffusion-based TTS model (2×80 diffusion steps)	Diffusion-based TTS after progressive distillation (31 diffusion steps)	Guided diffusion-based TTS model (naive reduction to 31 diffusion steps)
Female 1	Audio	Audio	Audio
Female 2	Audio	Audio	Audio
Female 3	Audio	Audio	Audio
Male 1	Audio	Audio	Audio

Table 1: Audio samples generated by diffusion-based TTS compared to the two baseline models

Conclusion

Collaborating with academia and assisting young students in shaping their future in science and engineering is one of the core NVIDIA values. Alicja, Paweł, and Michał’s successful project exemplifies the NVIDIA Warsaw, Poland office partnership with local universities.

The students managed to solve the challenging problem of speeding up the pretrained, diffusion-based, text-to-speech (TTS) model. They designed and implemented a knowledge distillation-based solution in the complex field of diffusion-based TTS, achieving a 5x speedup of the diffusion process. Most notably, their unique solution based on synthetic data generation is applicable to pretrained TTS models without access to the original training data.

We encourage you to explore NVIDIA Academic Programs and try out the NVIDIA NeMo Framework to create complete conversational AI (TTS, ASR, or NLP/LLM) solutions for the new era of generative AI.

Misc

Advanced API Performance: Shaders

Post author By
Post date September 1, 2023
No Comments on Advanced API Performance: Shaders

A graphic of a computer sending code to multiple stacks. This post covers best practices when working with shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced…

This post covers best practices when working with shaders on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Shaders play a critical role in graphics programming by enabling you to control various aspects of the rendering process. They run on the GPU and are responsible for manipulating vertices, pixels, and other data.

General shaders
Compute shaders
Pixel shaders
Vertex shaders
Geometry, domain, and hull shaders

General shaders

These tips apply to all types of shaders.

Avoid warp-divergent constant buffer view (CBV) and immediate constant buffer (ICB) reads.
- Constant buffer reads are most effective when threads in a warp access data uniformly. If you need divergent reads, use shader resource view (SRVs).
- Typical cases where SRVs should be preferred over CBVs include the following:
  - Bones or skinning data
  - Lookup tables, like precomputed random numbers
To optimize buffers and group shared memory, use manual bit packing. When creating structures for packing data, consider the range of values a field can hold and choose the smallest datatype that can encompass this range.
Optimize control flow by providing hints of the expected runtime behavior.
- Make sure to enable compile flag -all-resources-bound for DXC (or D3DCOMPILE_ALL_RESOURCES_BOUND in FXC) if possible. This enables a larger set of driver-side optimizations.
- Consider using the [FLATTEN] and [BRANCH] keywords where appropriate.
  - A conditional branch may prevent the compiler from hoisting long-latency instructions, such as texture fetches.
  - The [FLATTEN] keyword hints that the compiler is free to hoist and start the load operations before the statement has been evaluated.
Use Root Signature 1.1 to specify static data and descriptors to enable the driver to make the most optimal shader optimizations.
Keep the register use to a minimum. Register allocation could limit occupancy and may force the driver to spill registers to memory.
Prefer the use of gather instructions when loading single channel texture quads.
- This will cut down the expected latency by almost 4x compared to the equivalent operation constructed from consecutive sample instructions.
Prefer structured buffers over raw buffers.
- Structured buffers have stricter alignment requirements, which enables the driver to schedule more efficient load instructions.
Consider using numerical approximations or precomputed lookup tables of transcendental functions (exp, log, sin, cos, sqrt) in math-intensive shaders, for instance, physics simulations and denoisers.
To promote a fast path in the TEX unit, with up to 2x speedup, use point filtering in certain circumstances:
- Low-resolution textures where point filtering is already an accurate representation.
- Textures that are being accessed at their native resolution.

Not recommended

Don’t assume that half-precision floats are always faster than full precision and the reverse.
- On NVIDIA Ampere GPUs, it’s just as efficient to execute FP32 as FP16 instructions. The overhead of converting between precision formats may just end up with a net loss.
- NVIDIA Turing GPUs may benefit from using FP16 math, as FP16 can be issued at twice the rate of FP32.

Compute shaders

Compute shaders are used for general-purpose computations, from data processing and simulations to machine learning.

Consider using wave intrinsics over group shared memory when possible for communication across threads.
- Wave intrinsics don’t require explicit thread synchronization.
- Starting from SM 6.0, HLSL supports warp-wide wave intrinsics natively without the need for vendor-specific HLSL extensions. Consider using vendor-specific APIs only when the expected functionality is missing. For more information, see Unlocking GPU Intrinsics in HLSL.
- To increase atomic throughput, use wave instructions to coalesce atomic operations across a warp.
To maximize cache locality and to improve L1 and L2 hit rate, try thread group ID swizzling for full-screen compute passes.
A good starting point is to target a thread group size corresponding to between two or eight warps. For instance, thread group size 8x8x1 or 16x16x1 for full-screen passes. Make sure to profile your shader and tune the dimensions based on profiling results.

Not recommended

Do not make your thread group size difficult to scale per platform and GPU architecture.
- Specialization constants can be used in Vulkan to set the dimensions at pipeline creation time whereas HLSL requires the thread group size to be known at shader compile time.
Be careless of thread group launch latency.
- If your CS has early-out conditions that are expected to early out in most cases, it might be better to choose larger thread group dimensions and cut down on the total number of thread groups launched.

Pixel shaders

Pixel shaders, also known as fragment shaders, are used to calculate effects on a per-pixel basis.

Prefer the use of depth bounds test or stencil and depth testing over manual depth tests in pixel shaders.
Depth and stencil tests may discard entire 16×16 raster tiles down to individual pixels. Make sure that Early-Z is enabled.
Be mindful of the use patterns that may force the driver to disable Early-Z testing:
- Conditional z-writes such as clip and discard
  - As an alternative consider using null blend ops instead
- Pixel shader depth write
- Writing to UAV resources
Consider converting your full screen pass to a compute shader if there’s a large difference in latency between warps.

Not recommended

Don’t use raster order view (ROV) techniques pervasively.
- Guaranteeing order doesn’t come for free.
- Always compare with alternative approaches like advanced blending ops and atomics.

Vertex shaders

Vertex shaders are used to calculate effects on a per-vertex basis.

Geometry, domain, and hull shaders

Geometry, domain, and hull shaders are used to control, evaluate, and generate geometry, enabling tessellation to create a dynamic generation of surfaces and objects.

Replace the geometry, domain, and hull shaders with the mesh shading capabilities introduced in NVIDIA Turing.
Enable the fast geometry path with the following configuration:
- Fixed topology: Neither an expansion or reduction in the number of vertices.
- Fixed primitive type: The input primitive type is equal to the output primitive type.
- Immutable per-vertex attributes: The application cannot change the vertex attributes and can only copy them from the input to the output.
- Mutable per-primitive attributes: The application can compute a single value for the whole primitive, which then is passed to the fragment shader stage. For example, it can compute the area of the triangle.

Acknowledgments

Thanks to Ryan Prescott, Ana Mihut, Katherine Sun, and Ivan Fedorov.

Intrusion detection with BlueField DPU

Morpheus AI framework for cybersecurity

Addressing ransomware with AI

Inside BlueField DPU

Inside the Morpheus Inference Framework

FinSec lab use case

Learn more

Migrating ETL to GPUs

Experimental design

Other experimental considerations

Results

UNION operations

CROSS JOIN operations

SUM + GROUP BY operations

Deciding which architecture to use

GPUs for ETL

CPUs for ETL

Discussion and future considerations

Conclusion

Why speed up diffusion-based TTS?

Methods for speeding up diffusion

Distillation in diffusion-based TTS

Conclusion

General shaders

Recommended

Not recommended

Compute shaders

Recommended

Not recommended

Pixel shaders

Recommended

Not recommended

Vertex shaders

Recommended

Geometry, domain, and hull shaders

Recommended

Acknowledgments