Categories
Misc

Community Spotlight: Democratizing Computer Vision and Conversational AI in Kenya

Jacques Khisa, community leader at Africa Data School Emerging Chapters Nairobi, shares his experience on getting started in AI in Africa.

In the quest for knowledge in understanding data, I never pictured my passion shifting towards AI. As a matter of fact, AI is all data!

For context, the major hindrance to the implementation of AI projects across the African continent has been the lack of digitized data upon which AI algorithms are built. In my local region of Kenya, for instance, we have struggled to convert data stacked in traditional formats in both public and private data silos, despite a higher penetration of digital products in the past decade compared to neighboring countries.

Ironically, this incentivized my enthusiasm for AI and created the need to help democratize it. As Paulo Coelho said in The Alchemist, “And, when you want something, all the universe conspires in helping you to achieve it.”

NVIDIA Emerging Chapters

With my particular interest in natural language processing, I attended the AI Expo Africa virtual conference to enable me to network with local developers, experts, and researchers in the field of AI.

There, I had a life-changing conversation with the head of Developer Ecosystems and Strategic Partnerships at NVIDIA, Amulya Vishwanath, about the Emerging Chapters program. This is a program that enables local communities in emerging areas to build and scale AI, data science, and graphics projects by providing the following:

  • Technological tools
  • Educational resources
  • Co-marketing opportunities

In Kenya, the academic and entrepreneurial communities are particularly active. Emerging AI hotspots are mostly in academia.

Being a young conversational AI developer, I faced constraints in obtaining compute resources, research papers, and a feasible guide into the immense field of deep learning. I looked for educational opportunities to help other young enthusiasts easily access these resources and practice AI for good in my local community.

Training opportunities

As the NVIDIA DLI Ambassador and Certified DLI Instructor in deep learning and conversational AI at the Africa Data School community, I helped enable members of the Emerging Chapters to have access to training and development opportunities through the NVIDIA Deep Learning Institute (DLI). This includes free passes to select self- or instructor-led courses on AI and data science. Developers receive a NVIDIA DLI certificate upon course completion that highlights their skills, thereby advancing their careers.

Since partnering with NVIDIA, members have had great exposure and high participation in the NVIDIA GTC conference and DLI workshops. I was able to help facilitate these workshops at the Nairobi Garage co-working space, which not only allowed the attendees to get connected to a dynamic community of innovative companies and professionals but also increased our scale and impact.

The training gave participants access to world-class best practices, and knowledge to facilitate their development as AI engineers. Although some students find the content challenging, their enthusiasm is contagious. The content uses real-life case studies and shows the application of different deep learning algorithms on end-to-end applications in startups.

As individuals in my local community make full use of these resources, more talent becomes available, which consequently attracts and increases investments, accelerating growth.

After our in-person workshop, I realized that we needed more talent to educate and inspire. As it was our first workshop, we only provided 20 students with GPU instances and course materials from NVIDIA DLI. There will be many more workshops to come. We are also using free DLI courses that we were granted as part of the Emerging Chapters program to frequent training participants.

Working with the Africa Data School Emerging Chapter community has literally enabled the democratization of AI through the provision of educational resources and development opportunities in my region. Our goal is to create a community of young researchers, developers, AI engineers, and students passionate about NLP and computer vision in fintech, education, and agriculture.

These projects are in line with the Kenya Vision 2030: transforming Kenya into a newly industrializing, middle-income country that provides a high quality of life to all its citizens.

Student feedback from the first workshop

“Deep Learning doesn’t have to be a black box and is a potent tool in the right context with proper constraints. We discussed and implemented the various aspects and techniques fundamental to deep learning at the workshop. The level of discussion and implementation continues to showcase the sheer engineering talent in Kenya and the deep technical talent pool that we are known for across the continent. More efforts such as this will be vital in cementing our position as the Silicon Savannah. We appreciate NVIDIA for providing their state-of-the-art cloud-based GPU compute resources.”

Wilfred Odero

“The AI space is evidently a partnership-intensive space ranging from data collectors, developers, computing resources manufacturers, data regulators, etc. I may not have a clear bird’s-eye view of the scale of what’s happening on the ground, but from where I sit, the continent is taking off in terms of organizing itself toward a structure/ecosystem of some sort that supports the continent’s unanimous AI strategy and AI policy frameworks, with efforts such as the AU-commissioned ‘African stance on Artificial Intelligence’ and a number of big-tech sponsored tech hubs specializing on AI/ML-focused solutions. At the moment, most of the effort is being put into developing ready talent, though all stakeholders need to be ready.”

Rita Grace

“The exciting world of deep learning was introduced with practical examples and by the end of the day we could train models with over 95% accuracy. The training was well planned and our instructor Jacques Khisa explained all the topics in detail. It was a great experience to set up my own AI application development environment and earn a certificate in Fundamentals of Deep Learning. I would like to thank NVIDIA AI Emerging Chapters and Africa Data School for their workshops and commitment to developing future leaders in AI.”

Ibrahim Abdi

Conclusion

Joining the NVIDIA developer program and making Africa Data School a part of the Emerging Chapters community has helped us elevate our technology skills and connect with like-minded local and global professionals. 

The NVIDIA Emerging Chapters program is for developer communities. If you are interested in starting a local chapter, apply to the NVIDIA Emerging Chapters pilot program.

For more about developer communities and upcoming educational series webinars, see the NVIDIA Emerging Chapters program page.

Categories
Misc

Advanced API Performance: SetStablePowerState

This post covers best practices for using SetStablePowerState on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

This post covers best practices for using SetStablePowerState on NVIDIA GPUs. To get a high and consistent frame rate in your applications, see all Advanced API Performance tips.

Most modern processors, including GPUs, change processor core and memory clock rates during application execution. These changes can vary performance, introducing errors in measurements and rendering comparisons between runs difficult.

Recommended

  • Use the nvidia-smi utility to set the GPU core and memory clocks before attempting measurements. This command is installed by typical driver installations on Windows and Linux. Installation locations may vary by OS version but should be fairly stable.
    • Run commands on an administrator console on Windows, or prepend sudo to the following commands on Linux-like OSs.
    • To query supported clock rates
      • nvidia-smi --query-supported-clocks=timestamp,gpu_name,gpu_uuid,memory,graphics --format=csv
    • To set the core and memory clock rates, respectively:
      • nvidia-smi --lock-gpu-clocks=
      • nvidia-smi --lock-memory-clocks=
    • Perform performance capture or other work.
    • To reset the core and memory clock rates, respectively:
      • nvidia-smi --reset-gpu-clocks
      • nvidia-smi --reset-memory-clocks
    • For general use during a project, it may be convenient to write a simple script to lock the clocks, launch your application, and after exit, reset the clocks.
    • For command-line help, run nvidia-smi --help. There are shortened versions of the commands listed earlier for your convenience.
  • Use the DX12 function SetStablePowerState to read the GPU’s predetermined stable power clock rate. The stable GPU clock rate may vary by board.
    • Modify a DX12 sample to invoke SetStablePowerState.
    • Execute nvidia-smi -q -d CLOCK, and record the Graphics clock frequency with the SetStablePowerState sample running. Use this frequency with the --lock-gpu-clocks option.
  • Use Nsight Graphics’s GPU Trace activity with the option to lock core and memory clock rates during profiling (Figure 1).
Screenshot of Nsight Graphics UI with Locks Clocks to Base checkbox.
Figure 1. Lock Clocks to Base checkbox

Not recommended

  • Don’t lock the GPU core clock using DX12’s SetStablePowerState function only. This does not lock the memory clock and results are less comparable than achievable with nvidia-smi.
Categories
Misc

Detect to Protect: Taiwan Hospital Deploys Real-Time AI Risk Prediction for Kidney Patients

Taiwan has nearly 85,000 kidney dialysis patients — the highest prevalence in the world based on population density. Taipei Veterans General Hospital (TVGH) is working to improve outcomes for these patients with an AI model that predicts heart failure risk in real time during dialysis procedures. Cardiovascular disease is the leading cause of death for Read article >

The post Detect to Protect: Taiwan Hospital Deploys Real-Time AI Risk Prediction for Kidney Patients appeared first on NVIDIA Blog.

Categories
Misc

Tensorflow v2.6 CUDA v11.7 not utilising GPU

I am a beginner. I have installed Tensorflow version 2.6.0 and cuda version 11.7. However, tensorflow is not utilizing GPU. When I use tf.config.list_physical_devices(‘GPU’), it gives an empty object. Could anyone help me with this?

submitted by /u/ArunabhB
[visit reddit] [comments]

Categories
Misc

How does Tensorflow calculate mean squared error under the hood (cannot reproduce with custom loop)

Hi all,

My question is linked to a question I asked recently: post

I need to loop over individual samples when training due to too large a batch size to hold in memory. I have had good success generating reproducible losses and and accumulated gradients with one of the training loops I am carrying out – and, applied gradients to weights are accurate (plus floating point errors) –

another custom loop I am carrying out on a batch is the mean squared error between a predicted label and the real label. Again, I need to iterate over the batch of samples manually due to a large batch size. To confirm it works, and I get the same losses and gradients, I am comparing my custom loop on a batch of 100 samples so i can compare both methods using ‘GradientTape()’

My code snippet is as follows: for batch training:

with tf.GradientTape() as tape:

value_loss = tf.reduce_mean((return_buffer – critic_model([degree_buffer, graph_adj_buffer, action_vect_buffer])) ** 2)

value_grads = tape.gradient(value_loss, critic_model.trainable_variables)

value_optimizer.apply_gradients(zip(value_grads, critic_model.trainable_variables))

for individual samples:

value_loss_tracking = []total_loss = 0train_vars_val = critic_model_individual.trainable_variablesaccum_gradient_val = [tf.zeros_like(this_var) for this_var in train_vars_val]for adj_ind, degree_ind, action_vect_ind, return_ind in zip(graph_adj_buffer, degree_buffer, action_vect_buffer, return_buffer_):adj_ind = adjacency_normed_tensor(adj_ind)degree_ind = tf.expand_dims(degree_ind, 0)action_vect_ind = tf.expand_dims(action_vect_ind, 0)

with tf.GradientTape() as tape:

ind_value_loss = tf.square(return_ind – critic_model_individual([degree_ind, adj_ind, action_vect_ind]))

value_loss_tracking.append(ind_value_loss)

total_loss += ind_value_lossgradients = tape.gradient(ind_value_loss,train_vars_val)

accum_gradient_val = [(acum_grad + grad) for acum_grad, grad in zip(accum_gradient_val, gradients)]

accum_gradient_vals_final = [this_grad / steps_per_epoch for this_grad in accum_gradient_val]policy_optimizer_ind.apply_gradients(zip(accum_gradient_vals_final, train_vars_val))

mean_loss = tf.reduce_mean(value_loss_tracking)

forgive the lack of indentation, but both loops work fine (in bold is the loss) – however, when I look at the loss in my custom loop relative to the mean squared error in the batch loop, the values are different starting sometimes from one decimal place – and they do not look like floating point errors to me. i.e. 0.43429542 and 0.4318762 – these seem really different to me to be floating point errors – in the other custom loop, i see floating points changing after about 5 decimal places… this is not the case here. sometime i will even see losses like 0.39 compared 0.40 – this seems not right to me. does anybody if this makes sense, or agree that this does not look right? I have tried np.mean and np.square also – I have looked at source code and cannot see exactly how Tensorflow does this under the hood!

any help is appreciated!

submitted by /u/amjass12
[visit reddit] [comments]

Categories
Misc

Tensorflow-Lite not recognizing interpreter

This is my code:

#include <iostream> #include <cstdio> #include <iomanip> #include "src/VideoProcessing.h" #include <opencv2/opencv.hpp> #include <opencv2/videoio.hpp> #include <opencv2/highgui.hpp> #include <interpreter.h> #include "tensorflow/lite/interpreter.h" #include "tensorflow/lite/kernels/register.h" #include "tensorflow/lite/model.h" #include "tensorflow/lite/model_builder.h" #include "tensorflow/lite/interpreter_builder.h" #include "tensorflow/lite/optional_debug_tools.h" #include "tensorflow/lite/tools/gen_op_registration.h" typedef cv::Point3_<float> Pixel; void normalize(Pixel &pixel) {...} int main() { ... auto model = tflite::FlatBufferModel::BuildFromFile("/home/me/tensorflow_src/tensorflow/lite/examples/model-verification/pose_landmark_full.tflite"); if(!model){ printf("Failed to mmap modeln"); exit(0); } tflite::ops::builtin::BuiltinOpResolver resolver; std::unique_ptr<tflite::Interpreter> interpreter; ... 

The last line std::unique_ptr<tflite::Interpreter> interpreter; is throwing an error, suggesting that interpreter, and associated classes, are undefined. This is the error:

/usr/bin/ld: tensorflow-lite/libtensorflow-lite.a(interpreter.cc.o): in function `tflite::Interpreter::SetProfilerImpl(std::unique_ptr<tflite::Profiler, std::default_delete<tflite::Profiler> >)': interpreter.cc:(.text+0x2a66): undefined reference to `tflite::profiling::RootProfiler::RemoveChildProfilers()' /usr/bin/ld: interpreter.cc:(.text+0x2a75): undefined reference to `tflite::profiling::RootProfiler::AddProfiler(std::unique_ptr<tflite::Profiler, std::default_delete<tflite::Profiler> >&&)' /usr/bin/ld: interpreter.cc:(.text+0x2ab2): undefined reference to `vtable for tflite::profiling::RootProfiler' /usr/bin/ld: interpreter.cc:(.text+0x2b19): undefined reference to `vtable for tflite::profiling::RootProfiler' /usr/bin/ld: tensorflow-lite/libtensorflow-lite.a(interpreter.cc.o): in function `tflite::Interpreter::~Interpreter()': interpreter.cc:(.text+0x307e): undefined reference to `vtable for tflite::profiling::RootProfiler' /usr/bin/ld: tensorflow-lite/libtensorflow-lite.a(interpreter.cc.o): in function `tflite::profiling::RootProfiler::~RootProfiler()': interpreter.cc:(.text._ZN6tflite9profiling12RootProfilerD0Ev[_ZN6tflite9profiling12RootProfilerD5Ev]+0x7): undefined reference to `vtable for tflite::profiling::RootProfiler' /usr/bin/ld: tensorflow-lite/libtensorflow-lite.a(interpreter.cc.o): in function `tflite::profiling::RootProfiler::~RootProfiler()': interpreter.cc:(.text._ZN6tflite9profiling12RootProfilerD2Ev[_ZN6tflite9profiling12RootProfilerD5Ev]+0x7): undefined reference to `vtable for tflite::profiling::RootProfiler' collect2: error: ld returned 1 exit status make[2]: *** [CMakeFiles/model-verification.dir/build.make:247: model-verification] Error 1 make[1]: *** [CMakeFiles/Makefile2:1374: CMakeFiles/model-verification.dir/all] Error 2 make: *** [Makefile:149: all] Error 2 

And I only get this error when I use `tflite::interpreter` despite having the correct `interpreter.h` file.

This is how I compile:

cmake ../tensorflow/lite/examples/model-verification/ make ./model-verification 

This is my Cmake output:

cmake ../tensorflow/lite/examples/model-verification/ -- Setting build type to Release, for debug builds use'-DCMAKE_BUILD_TYPE=Debug'. CMake Warning at /home/me/tensorflow_src/build/abseil-cpp/CMakeLists.txt:74 (message): A future Abseil release will default ABSL_PROPAGATE_CXX_STD to ON for CMake 3.8 and up. We recommend enabling this option to ensure your project still builds correctly. -- Standard libraries to link to explicitly: none -- The Fortran compiler identification is GNU 11.2.0 -- Could NOT find CLANG_FORMAT: Found unsuitable version "0.0", but required is exact version "9" (found CLANG_FORMAT_EXECUTABLE-NOTFOUND) -- -- Configured Eigen 3.4.90 -- -- Proceeding with version: 2.0.6.v2.0.6 -- CMAKE_CXX_FLAGS: -std=c++0x -Wall -pedantic -Werror -Wextra -Werror=shadow -faligned-new -Werror=implicit-fallthrough=2 -Wunused-result -Werror=unused-result -Wunused-parameter -Werror=unused-parameter -fsigned-char -- Configuring done -- Generating done -- Build files have been written to: /home/onur/tensorflow_src/build 

submitted by /u/janissary2016
[visit reddit] [comments]

Categories
Misc

Converting TensorFlow Keras model API to model subclassing

For a simple TF2 Object detection CNN architecture defined using Keras’s functional API, a batch of data is obtained as:

 example, label = next(data_generator(batch_size = 32)) example.keys() # dict_keys(['image']) image = example['image'] image.shape # (32, 144, 144, 3) label.keys() # dict_keys(['class_out', 'box_out']) label['class_out'].shape, label['box_out'].shape # ((32, 9), (32, 2)) 

The CNN architecture defined using Keras’s functional API is:

 input_ = Input(shape = (144, 144, 3), name = 'image') # name - An optional name string for the Input layer. Should be unique in # a model (do not reuse the same name twice). It will be autogenerated if it isn't provided. # Here 'image' is the Python3 dict's key used to map the data to one of the layer in the model. x = input_ # Define a conv block- x = Conv2D(filters = 64, kernel_size = 3, activation = 'relu')(x) x = BatchNormalization()(x) x = MaxPool2D(pool_size = 2)(x) x = Flatten()(x) # flatten the last pooling layer's output volume x = Dense(256, activation='relu')(x) # We are using a data generator which yields dictionaries. Using 'name' argument makes it # possible to map the correct data generator's output to the appropriate layer class_out = Dense(units = 9, activation = 'softmax', name = 'class_out')(x) # classification output box_out = Dense(units = 2, activation = 'linear', name = 'box_out')(x) # regression output # Define the CNN model- model = tf.keras.models.Model(input_, [class_out, box_out]) # since we have 2 outputs, we use a list 

I am attempting to define it using Model sub-classing as:

 class OD(Model): def __init__(self): super(OD, self).__init__() self.conv1 = Conv2D(filters = 64, kernel_size = 3, activation = None) self.bn = BatchNormalization() self.pool = MaxPool2D(pool_size = 2) self.flatten = Flatten() self.dense = Dense(256, activation = None) self.class_out = Dense(units = 9, activation = None, name = 'class_out') self.box_out = Dense(units = 2, activation = 'linear', name = 'box_out') def call(self, x): x = tf.nn.relu(self.bn(self.conv1(x))) x = self.pool(x) x = self.flatten(x) x = tf.nn.relu(self.dense(x)) x = [tf.nn.softmax(self.class_out(x)), self.box_out(x)] return x A batch of training data is obtained as: example, label = next(data_generator(batch_size = 32)) example.keys() # dict_keys(['image']) image = example['image'] image.shape # (32, 144, 144, 3) label.keys() # dict_keys(['class_out', 'box_out']) label['class_out'].shape, label['box_out'].shape # ((32, 9), (32, 2)) 

Is my Model sub-classing architecture equivalent to Keras’s functional API?

submitted by /u/grid_world
[visit reddit] [comments]

Categories
Misc

Explore Resources and Activities for Jetson Nano Users with the “Summer of Jetson” from NVIDIA and SparkFun

Experience the “Summer of Jetson” now through Sept. 30, with quizzes, prizes, and a project showcase to learn about the joys of working with Jetson Nano developer kit.

Categories
Misc

Object Localization from scratch TF2

Object localization trained from scratch for emoji dataset in TensorFlow 2.8. Getting an IoU = 0.5969 and classification output accuracy = 100%. The code can be referred here. Though in fairness, I am using only 9 classes out of the emoji dataset. Thoughts?

submitted by /u/grid_world
[visit reddit] [comments]

Categories
Misc

Boosting Application Performance with GPU Memory Access Tuning

CUDA 16x9 Aspect RatioIn this post, we examine a method programmers can use to saturate memory bandwidth on a GPU.CUDA 16x9 Aspect Ratio

Introduction

NVIDIA GPUs have enormous compute power and typically need to be fed data at high speed to deploy that power. That is possible, in principle, since GPUs also have high memory bandwidth, but sometimes they need the programmer’s help to saturate that bandwidth. In this blog post, we examine one method to accomplish that and apply it to an example taken from financial computing. We will explain under what circumstances this method can be expected to work well, and how to find out whether these circumstances apply to your workload.

Context

NVIDIA GPUs derive their power from massive parallelism. Many warps of 32 threads can be placed on a Streaming Multiprocessor (SM), awaiting their turn to execute. When one warp is stalled for whatever reason, the warp scheduler switches to another with zero overhead, making sure that the SM always has work to do. On the high-performance NVIDIA Ampere 100 (A100) GPU up to 64 active warps can share an SM, each with its own resources. On top of that, A100 has many SMs—108—that can all execute warp instructions simultaneously. Most instructions must operate on data, and that data almost always originates in the device memory (DRAM) attached to the GPU. One of the main reasons why even the abundance of warps on an SM can run out of work is because they are waiting on data to arrive from memory. If this happens and the bandwidth to memory is not fully utilized, it may be possible to reorganize the program to improve memory access and reduce warp stalls, which in turn makes the program complete faster.

First step: wide loads

In a previous blog post, we examined a workload that did not fully utilize the available compute and memory bandwidth resources of the GPU. We determined that prefetching data from memory before it is needed substantially reduced memory stalls and improved performance. When prefetching is not applicable, the quest is to determine what other factors may be limiting performance of the memory subsystem. One possibility is that the rate at which requests are made of that subsystem is too high. Intuitively, we may reduce the request rate by fetching multiple words per load instruction. It is best illustrated with an example.

In all code examples in this post, uppercase variables are compile-time constants. BLOCKDIMX assumes the value of the predefined variable blockDim.x. For some purposes, it must be a constant known at compile time, whereas for other purposes, it is useful for avoiding computations at run time.
The original code looked like this, where index is a helper function to compute array indices. It implicitly assumes that just a single, one-dimensional thread block is being used, which is not the case for the motivating application from which it was derived. However, it reduces code clutter and does not change the argument.

for (pt = threadIdx.x; pt 

Observe that each thread loads kmax consecutive values from the suggestively named small_array. This array is sufficiently small that it fits entirely in the L1 cache, but asking it to return data at a very high rate may become problematic. The following change recognizes that each thread can issue requests for two double-precision words in the same instruction if we restructure the code slightly and introduce the double2 data type, which is supported natively on NVIDIA GPUs; it stores two double-precision words in adjacent memory locations, which can be accessed with field selectors “x” and “y.” The reason this works is that each thread accesses successive elements of small_array. We call this technique wide loads. Notice that the inner loop over index “k” is now incremented by two instead of one.

for (pt = threadIdx.x; pt 

A few caveats are in order. First, we did not check whether kmax is even. If not, the modified loop over k would execute an extra iteration, and we would need to write some special code to prevent that. Second, we did not confirm that small_array is properly aligned on a 16-byte boundary. If not, the wide loads would fail. If it was allocated using cudaMalloc, it would automatically be aligned on a 256-byte boundary. But if it was passed to the kernel using pointer arithmetic, some checks would need to be carried out.

Next, we inspect the helper function index and discover that it is linear in pt with coefficient 1. Consequently, we can apply a similar wide-load approach to values fetched from big_array by requesting two double-precision values in one instruction. The difference between accesses to big_array and to small_array is that now successive threads within a warp access adjacent array elements. The restructured code below doubles the increment of the loop over elements of array big_array, and now each thread processes two array elements in each iteration.

for (pt = 2*threadIdx.x; pt 

The same caveats as before apply, and they should now be extended to parity of ptmax and alignment of big_array. Fortunately, the application from which this example was derived satisfies all the requirements. The figure below shows the duration (in nanoseconds) of a set of kernels that gets repeated identically multiple times in the application. The average speedup of the kernel was 1.63x for the combination of wide loads.

Three line graphs showing a drop in time it takes to perform kernel launches when using memory prefetch.
Figure 1. Reduction of kernel durations due to wide load

Second step: register use

We could be tempted to stop here and declare success, but a deeper analysis of the execution of the program, using NVIDIA Nsight Compute, shows that we have not fundamentally changed the rate of requests to the memory subsystem, even though we have halved the number of load instructions. The reason is that a warp load instruction—i.e. 32 threads simultaneously issuing load instructions—results in one or more sector requests, which is the actual unit of memory access processed by the hardware. Each sector is 32 bytes, so one warp load instruction of one 8-byte double-precision word per thread results in 8 sector requests (accesses are with unit stride), and one warp load instruction of double2 words results in 16 sector requests. The total number of sector requests is the same for plain and wide loads. So, what caused the performance improvement?

To understand the code behavior we need to consider a resource we have not yet discussed, namely registers. These are used to store the data loaded from memory and serve as input for arithmetic instructions. Registers are a finite resource. If a Streaming Multiprocessor (SM) hosts the maximum number of warps possible on the A100 GPU, 32 4-byte registers are available to each thread, which together can hold 16 double-precision words. The compiler that translates our code into machine language is aware of this and will limit the number of registers per thread. How do we determine the register use of our code and the role it plays in performance? We use the “source” view in Nsight Compute to see assembly code (“SASS”) and C source code side by side.

The innermost loop of the code is the one that is executed most, so if we select in the navigation menu “instructions executed” and subsequently ask to be taken to the line in the SASS code that has the highest number of those, we automatically land in the inner loop. If you are uncertain, you can compare SASS and the highlighted corresponding source code to confirm. Next, we identify in the SASS code of the inner loop all the instructions that load data from memory (LDG). Figure 2 shows a snippet of the SASS where we have panned around to find the start of the inner loop; it is on line 166 where the number of times an instruction is executed suddenly jumps to its maximum value.

Screen capture from Nsight Compute tool showing inline hexadecimal encoding of assembly language instructions indicating GPU time taken to execute each instruction.
Figure 2. SASS code snippet demonstrating start of inner loop (line 166)

LDG.E.64 is the instruction we are after. It LoaDs from Global memory (DRAM) a 64-bit word with an Extended address. The load of a wide word corresponds to LDG.E.128. The first parameter after the name of the load instruction (R34 in Figure 2) is the register that receives the value. Since a double-precision value occupies two adjacent registers, R35 is implied in the load instruction. Next, we compare for the three versions of our code (1. baseline, 2. wide loads of small_array, 3. wide loads of small_array and big_array) the way registers are used in the inner loop. Recall that the compiler tries to stay within limits and sometimes needs to play musical chairs with the registers. That is, if not enough registers are available to receive each unique value from memory it will reuse a register previously used in the inner loop.

The effect of that is that the previous value needs to be used by an arithmetic instruction so that it can be overwritten by the new value. At this time the load from memory needs to wait until that instruction completes: a memory latency is exposed. On all modern computer architectures, this latency constitutes a significant delay. On the GPU some of it can be hidden by switching to another warp, but often not all of it. Consequently, the number of times a register is reused in the inner loop can be an indication of the slowdown of the code.

With this insight, we analyze the three versions of our code and find that they experience 8, 6, and 3 memory latencies per inner loop, respectively, which explains the differences in performance shown in Figure 1. The main reason behind the different register reuse patterns is that when two plain loads are fused into a single wide load, typically fewer address calculations are needed, and the result of an address calculation also goes into a register. With more registers holding addresses, fewer addresses are left over to act as “landing zones” for values fetched from memory, and we lose seats in the musical chairs game; the register pressure grows.

Third step: launch bounds

We are not yet done. Now that we know the critical role registers play in the performance of our program, we review total register use by the three versions of the code. Easiest is to inspect Nsight Compute reports again. We find that the numbers of registers used are 40, 36, and 44, respectively.

The way the compiler determines these numbers is by using sophisticated heuristics that take a large number of factors into account, including how many active warps may be present on an SM, the number of unique values to be loaded in busy loops, and the number of registers required for each operation. If the compiler has no knowledge of the number of warps that may be present on an SM, it will try to limit the number of registers per thread to 32, because that is the number that would be available if the absolute maximum simultaneous number of warps allowed by the hardware (64) were present. In our case we did not tell the compiler what to expect, so it did its best, but evidently determined that the code generated using just 32 registers would be too inefficient.

However, the actual size of the thread block specified in the launch statement of the kernel is 1024 threads, so 32 warps. This means that if only a single thread block is present on the SM, each thread can use up to 64 threads. At 40, 36, and 44 registers per thread of actual use, not enough registers would be available to support two or more thread blocks per SM, so exactly one will be launched, and we leave 24, 28, and 20 registers per thread unused, respectively.

We can do a lot better by informing the compiler of our intent through the use of launch bounds. By telling the compiler the maximum number of threads in a thread block (1024) and also the minimum number of blocks to support simultaneously (1), it relaxes and is happy to use 63, 56, and 64 registers per thread, respectively.

Interestingly, the fastest version of the code is now the baseline version without any wide loads. While the combined wide loads without launch bounds gave a speedup of 1.64x, with launch bounds the speedup with wide loads becomes 1.76x, whereas the baseline code speeds up by 1.77x. This means we did not have to go to the trouble of modifying the kernel definition; merely supplying launch bounds was enough in this case to obtain optimal performance for this particular thread block size.

Experimenting a little more with thread block sizes and minimum number of threads blocks to be expected on the SM, we reach a speedup of 1.79x at 2 thread blocks of 512 threads each per SM, also for the baseline version without wide loads.

Conclusions

Efficient use of registers is critical to obtaining good performance of GPU kernels. Sometimes a technique called “wide loads” can give significant benefits. It reduces the number of memory addresses that are computed and need to be stored in registers, leaving a larger number of registers to receive data from memory. However, giving the compiler hints about the way you launch kernels in your application may give the same benefit without having to change the kernel itself.

Acknowledgements

The author would like to thank Mark Gebhart and Jerry Zheng of NVIDIA for providing the expertise to analyze register use in the example discussed in this blog.