Categories
Misc

What Is Robotics Simulation?

Robots are moving goods in warehouses, packaging foods and helping assemble vehicles  — when they’re not flipping burgers or serving lattes. How did they get so skilled so fast? Robotics simulation. Making leaps in progress, it’s transforming industries all around us. Robotics Simulation Summarized A robotics simulator places a virtual robot in virtual environments to Read article >

Categories
Misc

‘Remnant II’ Headlines 14 Games Joining GeForce NOW in July

It’s a jam-packed July with 14 newly supported titles in the GeForce NOW library, including Remnant II from Gunfire Games and Gearbox Publishing. Need a new adventure? Check out the nine additions streaming from the cloud this week. Plus, the Steam Summer Sale kicks off this week, and many supported titles in the GeForce NOW Read article >

Categories
Misc

Calm, Cool and Creative: MUE Studio Showcases 3D Scenes ‘In the NVIDIA Studio’

MUE Studio, founded by 3D artists Minjin Kang and Mijoo Kim, specializes in art direction, photography and 3D design for campaigns and installations.

Categories
Misc

Matice Founder and Harvard Professor, Jessica Whited on Harnassing Regenerative Species – and AI – for Medical Breakthroughs

Scientists at Matice Biosciences are using AI to study the regeneration of tissues in animals known as super-regenerators, such as salamanders and planarians. The goal of the research is to develop new treatments that will help humans heal from injuries without scarring. On the latest episode of NVIDIA’s AI Podcast, host Noah Kravtiz spoke with Read article >

Categories
Misc

How to Deploy an AI Model in Python with PyTriton

AI models are everywhere, in the form of chatbots, classification and summarization tools, image models for segmentation and detection, recommendation models,…

AI models are everywhere, in the form of chatbots, classification and summarization tools, image models for segmentation and detection, recommendation models, and more. AI machine learning (ML) models help automate many business processes, generate insights from data, and deliver new experiences. 

Python is one of the most popular languages used in AI/ML development. In this post, you will learn how to use NVIDIA Triton Inference Server to serve models within your Python code and environment using the new PyTriton interface

More specifically, you will learn how to prototype and test inference of an AI model in a Python development environment with a production-class tool, and how to go to production with the PyTriton interface. You will also learn the advantages of using PyTriton, compared to a generic web framework like FastAPI or Flask. The post includes several code examples to illustrate how you can activate high-performance batching, preprocessing, and multi-node inference; and implement online learning.

What is PyTriton?

PyTriton is a simple interface that enables Python developers to use Triton Inference Server to serve AI models, simple processing functions, or entire inference pipelines within Python code. Triton Inference Server is an open-source multi-framework inference serving software with high performance on CPUs and GPUs.

PyTriton enables rapid prototyping and testing of ML models while achieving performance and efficiency with, for example, high GPU utilization. A single line of code brings up Triton Inference Server, providing benefits such as dynamic batching, concurrent model execution, and support for GPU and CPU from within the Python code. 

PyTriton removes the need to set up model repositories and port models from the development environment to production. Existing inference pipeline code can also be used without modification. This is especially useful for newer types of frameworks like JAX, or complex pipelines that are part of the application code without dedicated backends in Triton Inference Server.

Simplicity of Flask

Flask and FastAPI are generic Python web frameworks used to deploy a wide variety of Python applications. Because of their simplicity and widespread adoption, many developers use them to deploy and run AI models in production. However, significant drawbacks to this approach include the following:

  • General-purpose web servers lack support for AI inference features. There is no out-of-box support to take advantage of accelerators like GPUs, or to turn on dynamic batching or multi-node inference.
  • Users need to build logic to meet the demands of specific use cases, like audio/video streaming input, stateful processing, or preprocessing the input data to fit the model.
  • Metrics on compute and memory utilization or inference latency are not easily accessible to monitor application performance and scale.

Triton Inference Server includes built-in support for features like those listed above, and many more. PyTriton provides the simplicity of Flask and the benefits of Triton in Python. An example deployment of a HuggingFace text classification pipeline using PyTriton is shown below:

import logging
 
import numpy as np
from transformers import BertTokenizer, FlaxBertModel  # pytype: disable=import-error
 
from pytriton.decorators import batch
from pytriton.model_config import ModelConfig, Tensor
from pytriton.triton import Triton
 
logger = logging.getLogger("examples.huggingface_bert_jax.server")
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(name)s: %(message)s")
 
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = FlaxBertModel.from_pretrained("bert-base-uncased")
 
 
@batch
def _infer_fn(**inputs: np.ndarray):
	(sequence_batch,) = inputs.values()
 
	# need to convert dtype=object to bytes first
	# end decode unicode bytes
	sequence_batch = np.char.decode(sequence_batch.astype("bytes"), "utf-8")
 
	last_hidden_states = []
	for sequence_item in sequence_batch:
    	tokenized_sequence = tokenizer(sequence_item.item(), return_tensors="jax")
    	results = model(**tokenized_sequence)
    	last_hidden_states.append(results.last_hidden_state)
	last_hidden_states = np.array(last_hidden_states, dtype=np.float32)
	return [last_hidden_states]
 
 
with Triton() as triton:
	logger.info("Loading BERT model.")
	triton.bind(
    	model_name="BERT",
    	infer_func=_infer_fn,
    	inputs=[
        	Tensor(name="sequence", dtype=np.bytes_, shape=(1,)),
    	],
    	outputs=[
        	Tensor(name="last_hidden_state", dtype=np.float32, shape=(-1,)),
    	],

PyTriton offers an interface familiar to Flask users for easy installation and setup, and provides the following benefits: 

  • ​Bring up NVIDIA Triton with a single line of code
  • No need to set up model repositories and model format conversion (important for a high-performance implementation using Triton Inference Server)
  • Use of existing inference pipeline code without modification
  • Support for many decorators to adapt model input  

Whether working on a generative AI application or any other model, PyTriton enables you to gain the benefits of Triton Inference Server in your own development environment. It helps take advantage of the GPU to produce an inference response in very short time (milliseconds or seconds, depending on the use case). It also helps run the GPU at high capacity and serve many inference requests at the same time, keeping ‌infrastructure costs low.

PyTriton code examples

This section provides a few code examples you can use to get started with PyTriton. They begin on a local machine, which is ideal to test and prototype, and provide Kubernetes configuration for scaled deployment. 

Dynamic batching support

A key difference between Flask/FastAPI and PyTriton, dynamic batching enables batching of inference requests from multiple calling applications for the model, while retaining the latency requirements. Two examples are HuggingFace BART PyTorch and HuggingFace ResNET PyTorch.

Online learning

Online learning is learning from new data continuously in production. With PyTriton, you can control the number of distinct model instances backing your inference server. This enables you to train and serve the same model simultaneously from two different endpoints. Learn more about how to use PyTriton to train and infer models at the same time on MNIST dataset.

Multi-node inference of large language models

Large language models (LLMs) that are too large to fit into a single GPU memory require the model to be partitioned across multiple GPUs, and in certain cases across multiple nodes for inference. Check out an example using Hugging Face OPT model in JAX with inference done on multiple nodes. 

See NeMo Megatron GPT model deployment for a second example that uses the NVIDIA NeMo 1.3B parameter model. The multi-node inference deployment orchestration is shown using both Slurm and Kubernetes.

Stable Diffusion

With PyTriton, you can use preprocessing decorators to perform advanced batching operations, like batching together images of the same size using simple definitions:

@batch
@group_by_values("img_size")
@first_value("img_size")

To learn more, check out this example that uses the Stable Diffusion 1.5 image generation pipeline from Hugging Face.

Summary

PyTriton provides a simple interface that enables Python developers to use NVIDIA Triton Inference Server to serve a model, a simple processing function, or an entire inference pipeline. This native support for Triton Inference Server in Python enables rapid prototyping and testing of ML models with performance and efficiency. A single line of code brings up Triton Inference Server. Dynamic batching, concurrent model execution, and support for GPU and CPU from within the Python code are among the benefits. PyTriton offers the simplicity of Flask and the benefits of Triton Inference Server in Python. 

Try PyTriton using the examples in this post, or using your own model. See Migrating to the Triton Inference Server for information on migrating from Flask to PyTriton and Triton Inference Server. To learn more, visit the Triton Inference Server page and PyTriton repository on GitHub.

Categories
Misc

ICYMI: Exploring Challenges Posed by Biased Datasets Using Rapids cuDF

Several graph illustrations representing data science.Read about an innovative GPU solution that solves limitations using small biased datasets with RAPIDS cuDF.Several graph illustrations representing data science.

Read about an innovative GPU solution that solves limitations using small biased datasets with RAPIDS cuDF.

Categories
Offsites

How They Fool Ya | Mathematical parody of Hallelujah

Categories
Misc

Optimizing Memory with NVIDIA Nsight Systems

NVIDIA Nsight Systems is a comprehensive tool for tracking application performance across CPU and GPU resources. It helps ensure that hardware is being…

NVIDIA Nsight Systems is a comprehensive tool for tracking application performance across CPU and GPU resources. It helps ensure that hardware is being efficiently used, traces API calls, and gives insight into inter-node network communication by describing how low-level metrics sum to application performance and finding where it can be improved.

Nsight Systems can scale to cluster-size problems with multi-node analysis, but it can also be used to find simple performance improvements when you’re just starting your optimization journey. For example, Nsight Systems can be used to see where memory transfers are more expensive than expected. A quick look at memory activity will catch and correlate performance tolls and suggest how to resolve them.

In this deep dive, I look at the changes between GROMACS 2019 and GROMACS 2020. I go step-by-step with Nsight Systems to find GPU and memory optimization opportunities in the previous version and examine how they were resolved in the newer one.

GROMACS 2019

GROMACS is a versatile and widely used software package designed for molecular dynamics (MD) simulations of biomolecules, such as proteins, lipids, and nucleic acids. It’s used to help researchers examine important biological processes for disease prevention and treatment.

I analyzed GROMACS 2019 using Nsight Systems on an Arm server system with NVIDIA Volta GPUs. Nsight Systems includes both a user interface and a CLI, called nsys.

The following command runs nsys to collect trace information for CUDA and NVTX with no CPU sampling. It’s used here to gather performance data from the target Arm server, which can then be analyzed in the Nsight Systems user interface as usual. For more information about nsys and the profiling options, see Example Single Command Lines.

nsys profile -t cuda,nvtx -s none gmx mdrun -dlb no -notunepme -noconfout -nsteps 3000

Figure 1 shows executing this command and opening the report file.

Screenshot shows GROMACS 2019 run on an Arm server system with NVIDIA Volta GPUs before zooming in.
Figure 1. Nsight Systems command line performance analysis of GROMACS

Nsight Systems uses level of detail (LOD) scaling to show you how much GPU usage is under each pixel on the timeline. However, at this zoom level, the data is way too dense to see real patterns. Zooming in reveals the granularity of what Nsight Systems captured.

Likewise, there is information about the GPU activity, but details are organized under drop-downs. Expand the GPU line to reveal information about streams and contexts.

Screenshot shows an expanded view of the GROMACS 2019 run, zoomed in to 1 second clock time.
Figure 2. Hardware-level throughput metrics for GROMACS

In Figure 2, you can start to see a repetitive pattern of memory transfers and kernel activity on the GPU. Repetitive patterns are normal, but you can also see that a lot of time is being spent on memory transfers. There are gaps in the GPU activity that indicate opportunities for improvement.

Figure 3 zooms in further to 0.1 sec clock time.

Screenshot shows a GROMACS 2019 run, zoomed in to 0.1 seconds, revealing a pattern across kernels and memory on the GPU.
Figure 3. GPU activity pattern

At this resolution, you can clearly see the pattern of activity on the GPU. In each repetition, there is a host-to-device memory transfer (green), some kernel work (blue), and then a memory transfer back to the host (magenta). You can also see the pause after each kernel is run, before the new memory transfer starts.

Figure 3 shows shift-dragging across a single iteration, from the end of the previous device-host transfer to the end of this iteration’s device-host transfer (the selection marked in green). Right-clicking here brings up the menu, and you can zoom in to isolate this iteration.

Screenshot shows isolating and zooming in on a single pattern of the GROMACS 2019 run for further analysis.
Figure 4. Profiling GPU throughput patterns with Nsight Systems

A few things marked on Figure 4 to point out:

  1. The [All Streams] row is an amalgamation of the streams. Focus on the breakdown, and notice that there is one active context with two active streams (Stream 22 and 23) and that the Default stream is inactive. Each stream is spending more than half of its time on memory transfers. This high percentage is a strong indication that you must optimize the size of memory transfers and consider moving them to the default stream.
  2. The iteration takes ~19 ms, but 3 ms of that time is in an empty gap between chunks of activity. This implies that there is a ~17% speedup waiting to be captured. A good next step would be to run a CPU profile to check for a programmatic reason for the pause.
  3. Memory transfers are serialized. Stream 23 gets data from the host, while Stream 22 waits, doing no work until it can get its data. The same thing happens after the work is complete.
  4. The most dominant work kernel is nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_cuda. I suggest resolving gaps on the CPU and GPU timelines before diving deeper into individual kernels.

To look for the most expensive kernels, right-click the [All Streams] row and choose Show in Events View (Figure 5).

Screenshot shows the GROMACS 2019 events view displaying CUDA kernels and their run times.
Figure 5. CUDA kernel profiling in Nsight Systems

Sorting the Events view by duration reveals the most expensive kernels. Different instances vary a bit, but the most expensive is currently ~10 msec. Clicking on a particular kernel instance highlights all the events that correlate with it on the timeline in cyan.

In Figure 6, a cyan marker on a closed row indicates that something in that row is highlighted. Arrows at the top or left-hand side of the timeline indicate that something highlighted is currently off the screen.

Screenshot shows selecting a kernel from the Events View of the GROMACS 2019 run reveals its location on the timeline.
Figure 6. Nsight Systems kernel analysis

Alternatively, you can right-click in the event table and choose Show Current in Timeline to zoom the timeline to that kernel.

Screenshot shows menu for right-clicking on a CUDA kernel to analyze it with Nsight Compute.
Figure 7. Launching Nsight Compute from Nsight Systems

Most of the time, you concentrate on removing timeline gaps and making sure that work is distributed to all CPUs and GPUs before diving into individual kernels. When that time comes, or if kernel performance is out of line with your expectations, you can right-click the kernel of interest and choose Analyze the Selected Kernel with NVIDIA Nsight Compute.

Nsight Compute is a powerful standalone tool for profiling, debugging, and optimizing CUDA kernels. You can launch its user interface or CLI directly from the problematic kernel in Nsight Systems, if it’s installed on your machine.

GROMACS 2020

Improvements were made to GROMACS in collaboration between NVIDIA and the Stockholm GROMACS development team. For more information about the specifics, see Bringing Gromacs Up to Speed on Modern Multi-GPU Systems.

There are a couple of changes that you may notice:

  • The nbnxn_gpu_x_to_nbat_x_ kernel is the GPU implementation of a function to convert coordinate data of atoms from “XYZ” into a structure that also has charge in it for data locality: XYZQ. In GROMACS 2019, this was done on the CPU. The 2020 release resized and moved memory transfers to the default stream.
  • The long kernel nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_cuda now enables overlapped kernel execution between the two streams.
Nsight Systems performance analysis of GROMACS 2020 zoomed in on the timeline to show ~0.1 sec.
Figure 8. Nsight Systems profile of GROMACS 2020

Compare the results in Figure 8 to Figure 3. Both show ~100 msec of wall clock time. As you can see in Figure 8, a lot more of the GPU time is actually doing work. GPU usage has clearly been optimized between versions.

Almost all the memory transfer work has been moved onto the default stream, meaning that the two active streams have little time spent in memory transfer operations (4% and 2.5% compared to over 50% in GROMACS 2019). Small memory transfers have also been combined to larger sizes that still fit in transfer buffers.

Screenshot shows Nsight Systems isolating a single pattern of GPU activity for GROMACS 2020.
Figure 9. Zooming in to a single iteration of GROMACS 2020

Comparing Figure 4 (one work slice of GROMACS 2019) to Figure 9 (one work slice of GROMACS 2020) is not quite a fair comparison. As I mentioned earlier, much of the memory transfer was bundled up and shifted to the default stream. There are several interesting things to note:

  • There is much more parallel kernel execution between the two streams.
  • The nbnxn_gpu_x_to_nbat_x_ kernel is being run on the GPU before the expensive calculation. Moving this to the GPU instead of the CPU has not only filled up some of the gap time, but it also removed some of the memory transfer overhead.
  • The expensive kernel, nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_cuda, has not gotten any faster. The longest instance is still roughly 10 ms. However, because it is more efficiently fed data for this whole iteration of running, that work is reduced to 10 ms instead of the 19 ms in GROMACS 2019.

Now that the gaps are cleared out, nbnxn_kernel_ElecEw_VdwLJCombGeom_VF_cuda is dominating the run time, indicating that your performance tuning journey should move on to that kernel next.

Key takeaways

Improving CUDA kernels is not the first thing that you should do when optimizing GPU code. Improving memory utilization and making sure that the GPU is “fed” enough makes the most immediate impact.

This can be especially important when initially refactoring for GPU or when moving to new generations of GPU hardware. If the GPU is starved for work, you will never realize the performance gains available.

Ready to get started?

This is just the beginning of all the system performance information that Nsight Systems has to offer.

NVIDIA Nsight Systems is available from the NVIDIA CUDA ToolKit public download. You can also obtain the most recent, updated Nsight Systems with enhancements and fixes beyond the version shipping in the NVIDIA CUDA Toolkit on the Nsight Systems page.

For more information about GROMACS, see http://manual.gromacs.org/documentation/.

For other recent posts, see the following:

For a full list of posts, videos, and other tutorials, see the Other Resources section of the Nsight Systems documentation.

Have a question? Post it to the NVIDIA forums using NVIDIA Nsight Systems. Drop a message at nsight-systems-feedback@nvidia.com. Or just choose Feedback in the application to let us know what you are seeing and what you think.

Categories
Misc

Improving GPU Performance by Reducing Instruction Cache Misses

CUDA abstractGPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming…CUDA abstract

GPUs are specially designed to crunch through massive amounts of data at high speed. They have a large amount of compute resources, called streaming multiprocessors (SMs), and an array of facilities to keep them fed with data: high bandwidth to memory, sizable data caches, and the capability to switch to other teams of workers (warps) without any overhead if an active team has run out of data. 

Yet data starvation may still occur, and much of code optimization focuses on that issue. In some cases, ‌SMs are starved not for data, but for instructions. This post presents an investigation of a GPU workload that experiences a slowdown due to instruction cache misses. It describes how to identify this bottleneck, as well as techniques for removing it to improve performance.

Recognizing the problem

The origin of this investigation is an application from the domain of genomics in which many small, independent problems related to aligning small sections of a DNA sample with a reference genome need to be solved. The background is the well-known Smith-Waterman algorithm (but that by itself is not important for the discussion). 

Running the program on a medium-sized dataset on the powerful NVIDIA H100 Hopper GPU, with 114 SMs, showed good promise. The NVIDIA Nsight Compute (NCU) tool, which analyzes a program’s execution on the GPU, confirmed that the SMs were quite busy with useful computations, but there was a snag.

So many of the small problems composing the overall workload—each handled by its own thread—could be run on the GPU simultaneously that not all compute resources were fully used all the time. This is expressed as a small and non-integral number of waves. Work for the GPU is divided into chunks called thread blocks, and one or more can reside on an SM. If some SMs receive fewer thread blocks than others, they will run out of work and must idle while the other SMs continue working.

Filling up all SMs completely with thread blocks constitutes one wave. NCU dutifully reports the number of waves per SM. If that number happens to be 100.5, it means that not all SMs have the same amount of work to do, and that some are forced to idle. But the impact of the uneven distribution is not substantial. Most of the time, the load on the SMs is balanced. That situation changes if the number of waves is just 0.5, for example. For a much larger percentage of the time, SMs experience an uneven work distribution, which is called the “tail” effect.

Addressing the tail effect

This phenomenon is exactly what materialized with the genomics workload. The number of waves was just 1.6. The obvious solution is to give the GPU more work to do (more threads, leading to more warps of 32 threads each), which is usually not a problem. The original workload was relatively modest, and in a practical environment larger problems need to be completed. However, increasing the workload (1x) by doubling (2x), tripling (3x), and quadrupling (4x) the number of subproblems saw performance deteriorate rather than improve. What could cause this outcome?

The combined NCU report of those four workload sizes sheds light on the situation. In the section called Warp State, which lists the reasons threads cannot make progress, the value for ‘No Instruction’ stands out for increasing significantly with workload size (Figure 1). 

‘No Instruction’ means the SMs could not be fed instructions fast enough from memory. ‘Long Scoreboard’ indicates the SMs could not be fed data fast enough from memory. Fetching instructions on time is so critical that the GPU provides a number of stations where instructions can be placed once they have been fetched to keep them close to the SMs. Those stations are called instruction caches, and there are even more levels of them than data caches.

Screenshot of warp stall reasons for four workload sizes from a combined NsightCompute report showing multicolored horizontal bar graphs that depict different performance results based on amount of time and number of cycles required to complete the Smith-Waterman algorithm.
Figure 1. Screenshot of warp stall reasons for four workload sizes from the combined NVIDIA Nsight Compute report

To understand where the instruction caching bottleneck occurred, our team ran the same workloads again, but this time instructed NCU to gather more information than before, using a feature called Metrics. This feature is used to specify a user-defined list of performance counters that are not included in the regular performance reports. In this particular case, a broad array of counters was used related to instruction caches:

gcc__raw_l15_instr_hit, gcc__raw_l15_instr_hit_under_miss, gcc__raw_l15_instr_miss, sm__icc_requests, sm__icc_requests_lookup_hit, sm__icc_requests_lookup_miss, sm__icc_requests_lookup_miss_covered, sm__icc_requests_lookup_miss_to_gcc, sm__raw_icc_covered_miss, sm__raw_icc_covered_miss_tpc, sm__raw_icc_hit, sm__raw_icc_hit_tpc, sm__raw_icc_request_tpc_1b_apm, sm__raw_icc_true_hits_tpc_1b_apm, sm__raw_icc_true_miss, sm__raw_icc_true_miss_tpc, sm__raw_icc_unlock_all_tpc, sm__raw_l0icache_hits_sctlall, sm__raw_l0icache_requests_sctlall, sm__raw_l0icache_requests_to_icc_sctlall, smsp__l0icache_fills, smsp__l0icache_requests, smsp__l0icache_requests_hit, smsp__l0icache_requests_miss, smsp__raw_l0icache_hits, smsp__raw_l0icache_requests_to_icc

The result is that of all the measured quantities, the relatively costly icc cache misses, in particular, increase disproportionately with increasing workload size (Figure 2). The icc cache is an instruction cache that resides in the SM itself and is very close to the actual instruction execution engine.

Multicolored vertical bar graphs representing performance counters related to icc instruction cache requests, including the fast increasing icc misses, for workloads of increasing sizes.
Figure 2. Performance counters related to icc instruction cache requests, including the fast increasing icc misses, for workloads of increasing sizes

The fact that icc misses increase so quickly implies that, first, not all instructions in the busiest part of the code fit in icc. Second, the need for more different instructions increases as the workload size increases. The reason for the latter is somewhat subtle. Multiple thread blocks, composed of warps, reside on the SM simultaneously, but not all warps execute simultaneously.

The SM is internally divided into four partitions, each of which can generally execute one warp instruction per clock cycle. When a warp is stalled for any reason, another warp that also resides on the SM can take over. Each warp can execute its own instruction stream independently from other warps. At the start of the main kernel in this program, warps running on each SM are mostly in sync. They start with the first instruction and keep chugging along. 

However, they are not explicitly synchronizing, and as time goes on and warps take turns idling and executing, they will drift further and further apart in terms of the instructions they execute. This means that a growing set of different instructions must be active as the execution progresses, and this in turn means that the icc overflows more frequently. Instruction cache pressure builds, and more misses occur.

Solving the problem

The gradual drifting apart of the warp instruction streams cannot be controlled, except by synchronizing the streams. But synchronization typically reduces performance, because it requires warps to wait for each other when there is no fundamental need. However, decreasing the overall instruction footprint can be attempted, such that instruction spilling from icc occurs less frequently, and perhaps not at all.

The code in question contains a collection of nested loops, and most of the loops are unrolled. Unrolling improves performance by enabling the compiler to:

  • Reorder (independent) instructions for better scheduling
  • Eliminate some instructions that can be shared by successive iterations of the loop
  • Reduce branching
  • Allocate different registers to the same variable referenced in different iterations of the loop, to avoid having to wait for specific registers to become available

Unrolling loops brings many benefits, but it does increase the number of instructions. It also tends to increase the number of registers used, which may depress performance, because fewer warps may reside on the SMs simultaneously. This reduced warp occupancy comes with less latency hiding.

The two outermost loops of the kernel are the focus. Practical unrolling is best left to the compiler, which has myriad heuristics to generate good code. That is, the user expresses that there is an expected benefit of unrolling by using hints, called pragmas in C++, before the top of the loop. These take the following form:

#pragma unroll X

Where X can be blank (canonical unroll), the compiler is only told that unrolling may be beneficial, but is not given any suggestion how many iterations to unroll. Or it can be (n), where n is a positive number that suggests unrolling in groups of n iterations. For convenience, the following notation has been adopted. An unroll factor of 0 means no unroll pragma at all, an unroll factor of 1 means an unroll pragma without any number (canonical), and an unroll factor of n that is larger than 1 means:

#pragma unroll (n)

The next experiment comprises a suite of runs in which the unroll factor varies between 0 and 4 for both levels of the two outermost loops in the code, producing a performance figure for each of the four workload sizes. Unrolling by more is not needed because experiments show that the compiler does not generate different code for higher unroll factors for this particular program. Figure 3 shows the outcome of the suite.

The top horizontal axis shows unroll factors for the outermost loop (top level). The bottom horizontal axis shows unroll factors for the second-level loop. Each point on any of the four performance curves (higher is better) corresponds to two unroll factors, one for each of the outermost loops as indicated on the horizontal axes. 

Figure 3 also shows, for each instance of unroll factors, the size of the executable in units of 500 KB. While the expectation might be to see increasing executable sizes with each higher level of unrolling, that is not consistently the case. Unroll pragmas are hints that may be ignored by the compiler if they are not deemed beneficial.

A graph with red, green, purple, and orange lines moving left-to-right, depicting performance of Smith-Waterman code for different workload sizes and different loop unroll factors.
Figure 3. Performance of Smith-Waterman code for different workload sizes and different loop unroll factors

Measurements corresponding to the initial version of the code (indicated by the ellipse labeled A) are for the canonical unrolling of the top-level loop, and no unrolling of the second-level loop. The anomalous behavior of the code is apparent, where larger workload sizes lead to poorer performance, due to increasing icc misses. 

In the next isolated experiment (indicated by the ellipse labeled B), attempted before the full suite of runs, neither of the outermost loops is unrolled. Now the anomalous behavior is gone and larger workload sizes lead to the expected better performance. However, absolute performance is reduced, especially for the original workload (1x) size. Two phenomena, revealed by NCU, help explain this result. Due to the smaller instruction footprint, icc misses have dropped virtually to zero for all sizes of the workload. However, the compiler assigned a relatively large number of registers to each thread, such that the number of warps that can reside on the SM is not optimal. 

Doing the full sweep of unroll factors suggests that the experiment in the ellipse labeled C is the proverbial sweet spot. It corresponds to no unrolling of the top-level loop, and unrolling by a factor of 2 of the second-level loop. NCU still shows virtually no icc misses, and a reduced number of registers per thread, such that more warps can fit on the SM than in experiment B, leading to more latency hiding.

While absolute performance of the smallest workload still lags by that of experiment A, the difference is not much, and larger workloads fare increasingly better, leading to the best average performance across all workload sizes.

Conclusion

Instruction cache misses can cause performance degradation for kernels that have a large instruction footprint, which is often caused by substantial loop unrolling. When the compiler is put in charge of unrolling through pragmas, the heuristics it applies to the code to determine the best actual level of unrolling are necessarily complicated and are not always predictable by the programmer. It may pay to experiment with different compiler hints regarding loop unrolling to arrive at the optimal code with good warp occupancy and reduced instruction cache misses.

Categories
Misc

Field to Fork: Startup Serves Food Industry an AI Smorgasbord

It worked like magic. Computer vision algorithms running in a data center saw that a disease was about to infect a distant wheat field in India. Sixteen days later, workers in the field found the first evidence of the outbreak. It was the kind of wizardry people like Vinay Indraganti call digital transformation. He’s practiced Read article >