Categories
Misc

Knight Rider Rides a GAN: Bringing KITT to Life with AI, NVIDIA Omniverse

Fasten your seatbelts. NVIDIA Research is revving up a new deep learning engine that creates 3D object models from standard 2D images — and can bring iconic cars like the Knight Rider’s AI-powered KITT to life — in NVIDIA Omniverse. Developed by the NVIDIA AI Research Lab in Toronto, the GANverse3D application inflates flat images Read article >

The post Knight Rider Rides a GAN: Bringing KITT to Life with AI, NVIDIA Omniverse appeared first on The Official NVIDIA Blog.

Categories
Misc

Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain

The CUDA 11.3 release of the CUDA C++ compiler toolchain incorporates new features aimed at improving developer productivity and code performance. NVIDIA is introducing cu++flt, a standalone demangler tool that allows you to decode mangled function names to aid source code correlation. Starting with this release, the NVRTC shared library versioning scheme is relaxed to … Continued

The CUDA 11.3 release of the CUDA C++ compiler toolchain incorporates new features aimed at improving developer productivity and code performance. NVIDIA is introducing cu++flt, a standalone demangler tool that allows you to decode mangled function names to aid source code correlation.

Starting with this release, the NVRTC shared library versioning scheme is relaxed to facilitate compatible upgrades of the library within a CUDA major release sequence. The alloca built-in function that can be used to allocate dynamic memory out of the stack frame is now available for use in device code as a preview feature.

With the CUDA 11.3 release, the CUDA C++ language is extended to enable the use of the constexpr and auto keywords in broader contexts. The CUDA device linker has also been extended with options that can be used to dump the call graph for device code along with register usage information to facilitate performance analysis and tuning.

We are again proud to help enhance the developer experience on the CUDA platform.

Standalone demangler tool: cu++filt

To facilitate function overloading in CUDA C++, the NVCC compiler frontend mangles (or encodes) function identifiers to include information about their return types and arguments. The compiler follows the Itanium C++ (IA-64) mangling scheme, with some added CUDA specific extensions.

When disassembling or debugging CUDA programs, it is hard to trace the mangled identifier back to its original function name as the encoded names are not human readable. To simplify debugging and to improve readability of PTX assembly, we introduced a new CUDA SDK tool in the CUDA SDK: cu++filt.

The cu++filt tool demangles or decodes these mangled function names back to their original identifiers for readability. You can use the demangled names for precisely tracing the call flow. We modelled this tool after the GNU C++ demangler: c++filt with a similar user interface. This tool can be found in the bin directory of the CUDA SDK and is available on the Linux and Windows operating systems.

Example:

Demangling a regular C++ identifier:

#cu++filt  _ZSt7forwardIRPcEOT_RNSt16remove_referenceIS2_E4typeE
T1 && std::forward(std::remove_reference::type &)

Demangling a CUDA device function call from within a CUDA kernel:

#cu++filt  $_Z5helloPi$_Z7displayv
hello(int *)::display()
#cu++filt $_Z5helloPc$_Z7displayv
	hello(char *)::display()

Demangling a CUDA kernel with static (internal) linkage:

#cu++filt __nv_static_21__12_test_cpp1_ii_main__Z5helloc
hello(char)

Demangling a non-compliant identifier:

#cu++filt _InV@LiD_mAnGled_n@M3
_InV@LiD_mAnGled_n@M3

NVRTC supports enhanced compatibility

Starting with the CUDA 11.3 release, the NVRTC shared library versioning scheme and the library naming convention is relaxed to allow you to use newer NVRTC libraries on older toolkits, but only within a major CUDA release series.

Typically, an NVRTC library’s SONAME value (Linux), or the DLL file name (Windows), always encoded both the major and minor number of the CUDA toolkit version to which it belonged. As a result, developers were unable to upgrade to the latest NVRTC library without upgrading the entire CUDA toolkit.

Diagram shows that the NVRTC shared library with SONAME 11.2 can be used against any of the CUDA 11.x toolkits.
Figure 1. NVRTC shared library relaxed versioning scheme allows newer NVRTC to be a drop-in replacement for the NVRTC shared library in a CUDA toolkit from within a major release having a matching SONAME or DLL filename.

In CUDA toolkits prior to CUDA 11.3, the SONAME value was in the form MAJOR.MINOR and the DLL filename was in the form nvrtc64_XY_0.dll, where X=MAJOR, Y=MINOR. Starting from CUDA 11.3, and for all future CUDA 11.x toolkit releases, the NVRTC shared library version will not change and will be frozen at 11.2. The SONAME in the Linux version of the library is 11.2 and the corresponding DLL filename in Windows is nvrtc64_112_0.dll.

From the next major CUDA release onwards, X (which will be greater than 11), the NVRTC shared library’s SONAME and its DLL filename equivalent will only encode the CUDA major version. On Linux, the SONAME will be X and on Windows the DLL filename will be nvrtc64_X0_0.dll, where X is the major version.

Figure 1 shows that this relaxed versioning scheme enables you to easily upgrade to a newer NVRTC library within the same major release stream and take advantage of bug fixes and performance improvements. The current version of the NVRTC library in use can be found by using the nvrtcVersion API:

nvrtcResult nvrtcVersion(int *major, int *minor);

However, there is a caveat. A more recent NVRTC library may generate PTX with a version that is not accepted by the CUDA Driver API functions of an older CUDA driver. In the event of such an incompatibility between the CUDA Driver and the newer NVRTC library, you have two options:

  • Install a more recent CUDA driver that is compatible with the CUDA toolkit containing the NVRTC library being used.
  • Compile device code directly to SASS instead of PTX with NVRTC, using the nvrtcGetCUBIN API introduced in 11.2.

This versioning scheme allows applications developed using different toolkits to coexist and NVRTC to be redistributed along with it without a dependency on the toolkit versions. It also allows applications to take advantage of the latest compiler enhancements by updating the library transparently.

However, those updates could impact performance in some cases, especially for highly tuned code that depends on compiler heuristics that may change across CUDA versions. Expert users who would like to optimize for a specific version of NVRTC and want to maintain that dependency can do so using the dlopen (Linux) or LoadLibrary (Windows) API functions to use a specific library version at run time on an existing installation from a compatible minor release.

Preview support for alloca

CUDA C++ supports dynamic memory allocation using either the built-in function malloc or using the operator new. However, allocations by malloc and new contribute to significant runtime performance overhead due to dynamic allocation on the heap.

In CUDA 11.3, CUDA C++ introduces support for using the memory allocator alloca in device code as a preview feature. Unlike malloc and new, the built-in function alloca allocates memory on the current thread’s stack, offering a faster and more convenient way to allocate small chunks of memory dynamically. This is especially useful when the size of an allocation is not known in advance at compile time.

When memory is allocated using alloca, the stack pointer of the thread’s stack is moved based on the requested memory allocation size to reserve or otherwise allocate the memory. The memory allocated is aligned at a 16-byte boundary, making possible accesses using all basic types, including vector types, without alignment constraints.

There are some caveats that you should pay attention to when using alloca, so that you don’t risk introducing memory corruptions or undefined behaviors during program execution. Consider the following code sample of allocate.cu:

$ cat allocate.cu
...
#ifdef USE_MALLOC
#define ALLOC(sz) malloc((sz))
#define FREE(ptr) free((ptr))
#else
#define ALLOC(sz) alloca((sz))
#define FREE(ptr)
#endif

__device__ int out;
__device__ int foo(int *ptr1, int *ptr2, int len)
{
    int ret = 0;
    for (int i=0; i 



stack frame of calling function launch with memory allocated for lptr1 and lptr2 along with the callee bar()’s stack with memory allocated for bptr1 and bptr2. The memory allocated by bar() should not be returned or referenced in launch.
Figure 2. Thread stack frame of launch that in turn invokes bar.

Unlike memory allocated using malloc or new that must be explicitly freed, memory allocated by bar using alloca is part of the stack, so it should not be freed or accessed after the stack unwinds.

Thread stack space is a limited resource. Be wary of a possible stack overflow when using alloca. Currently, you can’t determine ahead of time whether the stack is going to overflow. To aid you, a ptax warning is shown when compiling a code using alloca, reminding you that the stack size cannot be determined at compile time.

$ nvcc.exe -arch=sm_80 allocate.cu -o allocate.exe

ptxas warning : Stack size for entry function '_Z6launchi' cannot be statically determined
   Creating library alloc.lib and object alloc.exp

As the CUDA driver cannot set the correct stack size for the program, the default stack size is used. Set stack size according to the actual stack memory usage in the program.

Despite the caveats, the potential performance benefits of using alloca combined with the automatic memory management makes alloca an attractive alternative to dynamic memory allocation on the heap.

Comparing alloca and malloc usage and performance

The performance benefits of allocating memory on the thread stack using alloca is significant. 

The earlier allocate.cu example showed the difference in usage and performance between stack based alloca and heap-based, per-thread malloc. Before launching the kernel, you must set device limits properly, with cudaDeviceSetLimit (cudaLimitStackSize, bytesPerThread) for stack size, or cudaDeviceSetLimit (cudaLimitMallocHeapSize, heapSize) for heap size. The FREE(ptr) is defined as free(ptr) only when USE_MALLOC is defined; otherwise, it is empty.

For this test, we set the following limits:

#ifdef USE_MALLOC
  cudaDeviceSetLimit (cudaLimitMallocHeapSize, 500000000);
#else
  cudaDeviceSetLimit (cudaLimitStackSize, 1024*50);
#endif

In the first performance measurement, we executed alloca.exe and malloc.exe with different launch configurations. When launch config is (block size is 512 and grid size is 64) and up, the malloc.exe ran out of memory for the heap size limit 500000000.

Bar chart shows the speedup of allocate.cu when allocating large numbers of small chunks of memory using alloca compared to when allocating using malloc for various launch configurations. At higher launch configurations <64, 512>, <128, 512> and <256, 512>, malloc failed with OOM.
Figure 3. Execution time speedup of allocate.cu when using alloca vs. malloc for different launch configs. *malloc OOMed in these configurations.

In the next measurement, we used fixed launch configuration , but doubled the number of iterations of bar for, which is invoked for each run. Figure 5 shows the results.

Bar chart shows speedup of allocate.cu when allocating large numbers of small chunks of memory using alloca vs. when using malloc, for launch config <8,512> for varying number of iterations for which the function bar was invoked.
Figure 4. Execution time speedup of allocate.cu when using alloca vs. malloc for a given launch config .

In CUDA 11.3, the cuda-gdb/classic backend debugger returns a truncated stack. You can see the first device function that invokes alloca. Full support for alloca by CUDA tools may be available in the next release.

CUDA C++ support for new keywords

CUDA 11.3 has added device code support for new C++ keywords: constexpr and auto.

Support for constexpr

In CUDA C++, __device__ and __constant__ variables can now be declared constexpr. The constexpr variables can be used in constant expressions, where they are evaluated at compile time, or as normal variables in contexts where constant expressions are not required. While CUDA C++ allowed some uses of host constexpr variables from device code in constant expressions in 11.2 and earlier, using them in other contexts would result in errors. For this case, constexpr device variables now be used instead.

Example:

constexpr int host_var = 10;
__device__ constexpr int dev_var = 10;

__device__ void foo(int idx) {
  constexpr int vx = host_var; // ok
  constexpr int vy = dev_var; // also ok
  
  const int& rx = host_var; // error, host_var is not defined in device code.
  const int& ry = dev_var; // ok
}

Support for auto

In CUDA C++, we are introducing support for the auto type for namespace scope device variables. A placeholder type uses the initializer to deduce the type of the variable being declared. This can be useful as a shorthand if the type of the variable has a long name. It enables the declaration of namespace scope variable templates where the type of the initializer is not known until instantiation.

Example:

namespace N1 { namespace N2 { struct longStructName { int x; }; } }
constexpr __device__ N1::N2::longStructName foo() { return N1::N2::longStructName{10}; }

__device__ auto x = foo; // x has 'int' type

template constexpr __device__
  auto foo() -> decltype(+T{}) { return {}; }

template __device__ auto y = foo();

__global__ void test() {
  auto i = y;  // i has type int
  auto f = y; // f has type float
}

NVLINK call graph and register usage support

Optimizing for register usage can improve the performance of device code. To get the best performance in device code, it is important to consider the usage of limited GPU resources like registers, as using fewer registers can increase occupancy and parallelism. When using separate compilation, the linker builds a call graph and then propagates the register usage of the called device functions, up to the kernel function representing the root node of the call graph.

However, if there are indirect calls through function pointers, then the call graph conservatively adds an edge for every potential target. The targets are where the prototype (function signature) of potential target functions match the prototype of the function pointer call, and where the function target has their address taken somewhere. This can result in the call graph reaching functions that you know are not real targets. If these false targets increase the register usage, that can in turn affect occupancy, as we show later in this section.

In large CUDA C++ applications with complex call graphs or precompiled device libraries, it can be difficult to know what the device linker infers to be potential indirect function call targets. So, we’ve added an option to dump the call graph information. The option is specific to the device linker nvlink, which is invoked as follows:

nvcc -Xnvlink -dump-callgraph

By default, this dumps demangled names. To avoid demangled names, use the following:

nvcc -Xnvlink -dump-callgraph-no-demangle

The format of the -dump-callgraph output is as follows:

# A: s -> B // Function s is given a number #A, and s potentially calls the function number B".
# s [N]     // s uses N registers
# ^s        // s is entry point
# &s        // s has address taken

For the CUDA sample in 0_Simple/simpleSeparateCompilation, the following code is in one file:

__device__ float multiplyByTwo(float number)
{
return number * 2.0f;
}
__device__ float divideByTwo(float number)
{
return number * 0.5f;
}

Then another file has the following:

__device__ deviceFunc dMultiplyByTwoPtr = multiplyByTwo;
__device__ deviceFunc dDivideByTwoPtr = divideByTwo;

//! Applies the __device__ function "f" to each element of the vector "v".
__global__ void transformVector(float *v, deviceFunc f, uint size)
{
    uint tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid 



This is invoked as follows:

cudaMemcpyFromSymbol(&hFunctionPtr, dMultiplyByTwoPtr, sizeof(deviceFunc));
transformVector>>(dVector, hFunctionPtr, kVectorSize);

The call graph for this on sm_52 is as follows:

1: ^transformVector(float *, float (*)(float), unsigned int) [20] -> 4 3
2: 
3: &multiplyByTwo(float) [5] ->
4: &divideByTwo(float) [5] ->

According to the call graph, the transformVector kernel calls two functions, divideByTwo (#4) and multiplyByTwo (#3). The called functions all use fewer registers (five) than transformVector[20], so the final register count stays at 20.

two call graphs. The call graph for transformVector shows invocations for multiplyByTwo and divideByTwo, each consuming only 5 registers. The second call graph for the function Kernel which calls into Transform that shows invocation for Action1 and Action2 which each consumes 5 and 12 registers respectively. Function Debug() consumes 40 registers but is never invoked but this register reservation bubbles up and the entire ‘Kernel()’ is reserved 40 device registers.
Figure 5. Call graph and corresponding register reservations for transformVector and Kernel. 

Consider a more interesting case, where a Transform function calls either Action1 or Action2, but also potentially matches a Debug function:

1: &Debug(int, int) [40] ->
2: &Action1(int, int) [5] ->
3: &Action2(int, int) [12] ->
4: Transform(int, int, int (*)(int, int)) [7] -> 3 2 1
5: ^Kernel(int *) [8] -> 4

In this case, Kernel calls Transform (function #4) which potentially calls Action2 (#3), Action1 (#2), and Debug (#1). The max register count for Action2, Action1, and Debug is 40 (for Debug), so a register usage of 40 ends up being propagated into Kernel. But if you know that Debug is not called by Transform, you could restructure your code to remove Debug from the call graph. Either modify the prototype for Debug or don’t have the address taken for Debug. The result would be that Transform would only call Action1 or Action2, which would then have a max register count of 12.

The resulting reduced register reservation increases the availability of the unused register for other kernels, increasing the throughput of kernel execution.

Try out the CUDA 11.3 compiler features

Whether it is the cu++flt demangler tool, redistributable NVRTC versioning scheme, or NVLINK call graph option, the compiler features and tools in CUDA 11.3 are aimed at improving your development experience on the CUDA platform. There is preview support for alloca in this release as well. Download today!

As always, please share any feedback or questions that you may have in the CUDA Forum or leave a comment here.

Categories
Misc

Exploring the New Features of CUDA 11.3

CUDA is the software development platform for building GPU-accelerated applications, providing all the components you need to develop applications that use NVIDIA GPUs. CUDA is ideal for diverse workloads from high performance computing, data science analytics, and AI applications. The latest release, CUDA 11.3, and its features are focused on enhancing the programming model and … Continued

CUDA is the software development platform for building GPU-accelerated applications, providing all the components you need to develop applications that use NVIDIA GPUs. CUDA is ideal for diverse workloads from high performance computing, data science analytics, and AI applications. The latest release, CUDA 11.3, and its features are focused on enhancing the programming model and performance of your CUDA applications.

CUDA 11.3 has several important features. In this post, I offer an overview of the key capabilities:

  • CUDA programming model
    • CUDA Graph enhancements
    • Stream-ordered memory allocator enhancements
  • Language support: CUDA
    • C++ support enhancements
    • Python support
  • Compiler enhancements

Download CUDA 11.3 today.

CUDA programming model enhancements

With every CUDA release, NVIDIA continues to enhance the CUDA programming model to enable you to get the most out of NVIDIA GPUs, while maintaining the programming flexibility of the higher-level APIs. In this release, we extended several of the CUDA APIs to improve the ease-of-use for CUDA graphs and enhance the stream-ordered memory allocator feature introduced in 11.2 among other update features.

CUDA Graphs

A graph is a series of operations, such as kernel launches, connected by dependencies, which is defined separately from its execution. This allows a graph to be defined once and then launched repeatedly, thus reducing overhead. CUDA graphs were introduced in CUDA 10.0 to allow work to be defined as graphs rather than single operations. For more information, see CUDA Graphs and Getting Started with CUDA Graphs.

Subsequent CUDA releases have seen a steady progression of new features. CUDA 11.3 introduces new features to improve the flexibility and the experience of using CUDA Graphs, such as:

  • Stream capture composability
  • User objects
  • Debug API

There are two ways to create CUDA graphs: construct the graph from scratch using graph APIs or use stream capture, which wraps a section of the application code and records the launched work from CUDA streams into a CUDA graph.

The new enhancements provide mutable access to the graph and to the set of dependencies of a capturing stream while capture is in progress, enabling new use patterns. You can intermix native graph node creation with stream capture more easily in the same graph. You can also pipeline tasks in new ways with the event-based dependency mechanics. Such access to the graph handle is required for the next feature, user objects.

User objects

Dynamic resources referenced by a graph must persist for the lifetime of the graph and until all executions have completed. This can be particularly challenging with stream capture, when the code responsible for the resource, such as a library, is not the same code managing the graph, such as the application code calling a library inside stream capture. User objects is a new feature to assist with the lifetime management of such resources in graphs by assisting with reference-counting the resource.

The following pseudo-code example shows how you can use the CUDA user object and the corresponding APIs to address this. The cudaUserObjectCreate API provides the mechanism to associate a user-specified, destructor callback with an internal refcount for managing a dynamic resource.

Object *object = new Object;  
// Arbitrary C++ object
cudaUserObject_t cuObject;  
// Template to wrap a delete statement in an extern "C" callback for CUDA:
cudaUserObjectCreate(&cuObject, object); 
// Transfer our one reference (from calling create) to a graph:
cudaGraphRetainUserObject(graph, cuObject, cudaGraphUserObjectMove); 
// No more references owned by this thread; no need to call release.
cudaGraphInstantiate(&graphExec, graph); 
// Retains an additional reference
cudaGraphDestroy(graph); 
// graphExec still owns a reference
cudaGraphLaunch(graphExec, stream);  
// graphExec has access while executing
// The execution is not synchronized yet, so the release may be deferred past the destroy call:
cudaGraphExecDestroy(graphExec);
cudaStreamSynchronize(0);  
// The final release, and the call to the destructor, are now queued if they have not already executed. This completes asynchronously.

For more information, see CUDA User Objects.

Debug API

The new graph debug API provides a fast and convenient way to gain high-level understanding of a given graph by creating a comprehensive overview of the entire graph, without you calling individual API actions like the following to compose the graph:

  • cudaGraphGetNodes
  • cudaGraphGetEdges
  • cudaGraphHostNodeGetParams
  • cudaGraphKernelNodeGetParams

The single cudaGraphDebugDotPrint API can construct a detailed view of any uninstantiated graph, demonstrating the topology, node geometry, attribute configurations, and parameter values. Given a CUDA graph, it outputs a DOT graph, where DOT is a graph description language. This detailed view of the graph makes it easier for you to identify obvious configuration issues and enables you to create easy-to-understand bug reports that can be used by others to analyze and debug the issue. Combining this API with a debugger increases the usefulness, by isolating issues to specific nodes.

cudaGraphDebugDotPrint(CUgraph hGraph, const char *path, unsigned int flags);

Stream-ordered memory allocator

One of the highlights of CUDA 11.2 was the stream-ordered CUDA memory allocator feature that enables applications to order memory allocation and deallocation with other work launched into a CUDA stream. It also enables sharing memory pools across entities within an application. For more information, see Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features.

In CUDA 11.3, we added new APIs to enhance this feature:

  • A pointer query to obtain the handle to the memory pool for those pointers obtained from an async allocator. cuPointerGetAttribute has the CU_POINTER_ATTRIBUTE_MEMPOOL_HANDLE attribute to retrieve the mempool corresponding to an allocation.
  • Device query to check if mempool-based inter-process communication (IPC) is supported for a particular mempool handle type. cuDeviceGetAttribute has an CU_DEVICE_ATTRIBUTE_MEMPOOL_SUPPORTED_HANDLE_TYPES attribute
  • Query mempool usage statistics that provide a way to obtain allocated memory details. cuMemPool has attributes such as the amount of physical memory currently allocated to the pool. The sum of sizes of allocations that have not been freed can be queried using the cudaMemPoolGetAttribute API.

For more information, see Stream Ordered Memory Allocator.

Other programming model enhancements

CUDA 11.3 formally supports virtual aliasing, a process where an application accesses two different virtual addresses, but they end up referencing the same physical allocation, creating a synonym in the memory address space. The CUDA programming model has been updated to provide guidelines and guarantees for this previously undefined behavior. The Virtual Memory Management APIs provide a way to create multiple virtual memory mappings to the same allocation using multiple calls to cuMemMap with different virtual addresses, that is, virtual aliasing.

CUDA 11.3 also introduces a new driver and runtime API to query memory addresses for driver API functions. Previously, there was no direct way to obtain function pointers to the CUDA driver symbols. To do so, you had to call into dlopen, dlsym, or GetProcAddress. This feature implements a new driver API, cuGetProcAddress, and the corresponding new runtime API cudaGetDriverEntryPoint.

This enables you to use the runtime API to call into driver APIs like cuCtxCreate and cuModuleLoad that do not have a runtime wrapper. It also enables access to new CUDA features with the newer driver on older toolkits or requesting for a per-thread default stream version of a CUDA driver function. For more information, see CUDA Driver API and Driver Entry Point Access.

Language extensions: CUDA

We continue to enrich and extend the CUDA software environment through extensions to industry-standard programming languages. More recently, we’ve added enhancements to C++ and CUDA Python to help simplify the developer experience.

C++ support enhancements

NVIDIA C++ Standard Library (libcu++) is the C++ Standard Library for your entire system that is shipped with the CUDA toolkit. It provides a heterogeneous implementation of the C++ Standard Library that can be used in and between CPU and GPU code. It is an open-source project available on GitHub. A new version of libcu++ 1.4.1 is being released with the CUDA 11.3 toolkit release.

The CUDA 11.3 toolkit release also includes CUB 1.11.0 and Thrust 1.11.0, which are major releases providing bug fixes and performance enhancements. CUB 1.11.0 includes a new DeviceRadixSort backend that improves performance by up to 2x on supported keys and hardware. Thrust 1.11.0 includes a new sort algorithm that provides up to 2x more performance from thrust::sort when used with certain key types and hardware. The new thrust::shuffle algorithm has been tweaked to improve the randomness of the output. For more information, see CUB and Thrust.

Python support (preview release on GitHub)

NVIDIA is excited by the vast developer community demand for support of the Python programming language. Python plays a key role within the science, engineering, data analytics, and deep learning application ecosystem. We’ve long been committed to helping the Python ecosystem use the accelerated computing performance of GPUs to deliver standardized libraries, tools, and applications. This is another step towards improved Python code portability and compatibility.

Our goal is to help unify the Python CUDA ecosystem with a single standard set of low-level interfaces, providing full coverage of and access to the CUDA host APIs from Python. We want to provide a foundation for the ecosystem to build on in unison, to allow interoperability amongst different accelerated libraries. Most importantly, we want to make it easy for you to use NVIDIA GPUs with Python:

  • Cython/Python wrappers for CUDA driver and Runtime APIs.
  • CUDA Python is a preview release on GitHub aligned with the CUDA 11.3 release.

For more information about the CUDA Python initiative, see Unifying the CUDA Python Ecosystem.

CUDA compiler

The CUDA 11.3 release of the CUDA C++ compiler toolchain incorporates new features aimed at improving productivity and code performance:

  • cu++flt—A standalone demangler tool that allows you to decode mangled function names to aid source code correlation.
  • NVRTC shared library versioning scheme—Relaxed to facilitate compatible upgrades of the library within a CUDA major release sequence.
  • Built-in function alloca —Used to dynamically allocate dynamic memory out of the stack frame and now available for use in device code as a preview feature. 
  • CUDA C++ language—Extended to enable the use of the constexpr and auto keywords in broader contexts.
  • CUDA device linker— Also extended, with options that can be used to dump the call graph for device code along with register usage information to facilitate performance analysis and tuning.

For more information about these features,  see Programming Efficiently with the NVIDIA CUDA 11.3 Compiler Toolchain.

NVIDIA Nsight developer tools

The NVIDIA Nsight toolset contains Nsight Systems, Nsight Compute, and Nsight Graphics to for better GPU profiling and performance optimizations. NVIDIA Nsight developer tools are seamlessly integrated into the software development environments for ease in execution and testing, with IDEs such as Microsoft Visual Studio and Eclipse.

Nsight VS Code is our latest addition to the series of toolsets. It is an extension to Visual Studio Code for CUDA-based applications. As VS Code is widely adopted by the developer community, Nsight VS Code supports the latest features.

Nsight Systems 2021.2 introduces support for GPU metrics sampling. These metrics chart an overview of GPU efficiency over time within compute, graphics, and IO activities:

  • IO throughputs: PCIe, NVLink, and memory bandwidth
  • SM utilization: SMs active, Tensor Core activity, instructions issued, warp occupancy, and unassigned warp slots

This expands Nsight Systems existing ability to profile system-wide activity to help you in the investigative work of tracking GPU workloads back to their CPU origins. It provides a deeper understanding of the GPU utilization levels and the combined effect of multiple processes.

Nsight Compute 2021.1 adds a new NVLink topology and properties, OptiX 7 API stepping, MacOS 11 Big Sur host support, and improved resource tracking capabilities for user objects, stream capture, and asynchronous sub allocations. These new features give you increased visibility into the dynamic performance behavior of your workloads and how you are using hardware and software resources.

For more information, see the following resources:

Categories
Misc

Using Tensor Cores in CUDA Fortran

Tensor Cores, which are programmable matrix multiply and accumulate units, were first introduced in the V100 GPUs where they operated on half-precision (16-bit) multiplicands. Tensor Core functionality has been expanded in the following architectures, and in the Ampere A100 GPUs (compute capability 8.0) support for other data types was added, including double precision. Access to … Continued

Tensor Cores, which are programmable matrix multiply and accumulate units, were first introduced in the V100 GPUs where they operated on half-precision (16-bit) multiplicands. Tensor Core functionality has been expanded in the following architectures, and in the Ampere A100 GPUs (compute capability 8.0) support for other data types was added, including double precision.

Access to programming Tensor Cores in CUDA C became available in the CUDA 9.0 release for Volta GPUs through the WMMA (Warp Matrix Multiply and Accumulate) API, which was extended in CUDA 11.0 to support Ampere GPUs. This post describes a CUDA Fortran interface to this same functionality, focusing on the third-generation Tensor Cores of the Ampere architecture.

With the WMMA interface, a single warp of 32 threads performs D = A∗B +C. This operation is the building block to construct GEMM-like operations. The size of the matrices (C and D are m×n, A is m×k, and B is k×n) in this operation depends on the precision:

  • For real(2) multiplicands, m×n×k can be 16×16×16, 32×8×16, or 8×32×16.
  • For real(4) data using the TensorFloat32 format, m×n×k is 16×16×8.
  • For real(8) data, m×n×k is 8×8×4.

For the case where the multiplicands A and B contain real(2) data, C and D can be either both real(2) or both real(4) matrices. The Volta and Turing architectures support only the cases where the multiplicands are real(2) data. All this is summarized in Table 1. Before the WWMA operation can take place, the operand matrices must be loaded into registers and then distributed amongst the threads in the warp. The mapping of threads to matrix elements is opaque, where the WMMA submatrix datatype (equivalent to the fragment in CUDA C), is used to represent the elements each thread holds of the matrix represented by the warp of threads, along with other metadata.

In this post, I focus on the WMMA interface for double precision or real(8) data.

Multiplicand Precision Accumulator Precision WMMA Tile Sizes (m×n×k) Architecture
real(2) / 16-bit real(2) / 16-bit
real(4) / 32-bit
16×16×16
32×8×16
8×32×16
Volta, Turing, NVIDIA Ampere
real(4) / TF32 real(4) / TF32 16×16×8 NVIDIA Ampere
real(8) / 64-bit real(8) / 64-bit 8×8×4 NVIDIA Ampere
Table 1. CUDA Fortran Tensor Core data precision and WMMA tile sizes.

CUDA Fortran wmma module

The use of Tensor Cores through the WMMA API in CUDA Fortran requires the wmma module as well as the cuf_macros.CUF macro file. These provide Tensor Core–specific data types, along with routines to load and store data and perform warp-based matrix multiplications using these data types.

WMMASubMatrix datatype

Tiles of matrices used by a warp of threads to perform matrix multiplication are stored in variables of the WMMASubMatrix datatype in device code. For those familiar with the CUDA C API to Tensor Cores, WMMASubMatrix corresponds to the fragment template. There are different WMMASubMatrix types based on use (i.e., which operand in D = A ∗ B + C), precision, storage order, and dimensions, which are specified as type parameters in CUDA Fortran. Typical declarations of WMMASubMatrix variables used in device code are:

WMMASubMatrix(WMMAMatrixA, 8, 8, 4, Real, WMMAColMajorKind8) :: sa
WMMASubMatrix(WMMAMatrixB, 8, 8, 4, Real, WMMAColMajorKind8) :: sb 
WMMASubMatrix(WMMAMatrixC, 8, 8, 4, Real, WMMAKind8)   :: sc

The first parameter indicates the operand, corresponding to the A matrix, B matrix, and the accumulator. The following three parameters are the tile sizes, in this case m×n×k = 8×8×4. The datatype Real is specified next and is currently the only allowed value. The last parameter in WMMAMatrixA and WMMAMatrixB matrix declaration specifies row- or column-ordering (WMMAColMajor) as well as the precision (Kind8). The storage order for the accumulators is not specified at declaration, therefore the last parameter in the WMMAMatrixC declaration specifies only the precision.

Compilation

When compiling code using Tensor Cores for the A100 GPU, a compute capability of 8.0 and a CUDA runtime version of at least 11.0 should be used (-cuda -gpu=cc80,cuda11.0).

In addition, the preprocessor must be invoked due to the included macro file cuf_macros.CUF, either explicitly through the -Mpreprocess compiler option or implicitly by using an uppercase file extension such as .CUF or .F90. The cuf_macros.CUF file is included in the HPC SDK in the examples/CUDA-Fortran/TensorCores/Utils directory.

WMMA programming basics

This section presents a sequence of small example programs showing the concepts of WMMA programming in CUDA Fortran. The sequence begins with launching a kernel using a single warp of threads that performs a matrix multiplication of an 8×4 by and 4×8 matrix. Then, each following example adds complexity until performance issues can be addressed, in the Performance of WMMA code section.

Example 1: 8×8×4 matrix multiply

The simplest possible matrix multiply using WMMA is the operation C = A ∗ B where the matrix sizes correspond to the WMMA tile sizes, where A, B, and C are 8×4, 4×8, and 8×8, respectively. The kernel code for this is as follows:

#include "cuf_macros.CUF"
module m
  integer, parameter :: wmma_m = 8
  integer, parameter :: wmma_n = 8
  integer, parameter :: wmma_k = 4
contains 
  ! kernel for C = A B  (C(8x8), A(8x4), B(4x8)) using wmma
  ! Should be launched with one block of 32 threads 

  attributes(global) subroutine wmma_single(a, b, c)
    use wmma   
    implicit none
    real(8) :: a(wmma_m,*), b(wmma_k,*), c(wmma_m,*)
    WMMASubMatrix(WMMAMatrixA, 8, 8, 4, Real, WMMAColMajorKind8) :: sa
    WMMASubMatrix(WMMAMatrixB, 8, 8, 4, Real, WMMAColMajorKind8) :: sb
    WMMASubMatrix(WMMAMatrixC, 8, 8, 4, Real, WMMAKind8)   :: sc
    integer :: lda, ldb, ldc

    lda = wmma_m
    ldb = wmma_k
    ldc = wmma_m

    sc = 0.0_8
    call wmmaLoadMatrix(sa, a(1,1), lda)
    call wmmaLoadMatrix(sb, b(1,1), ldb)
    call wmmaMatMul(sc, sa, sb, sc)
    call wmmaStoreMatrix(c(1,1), sc, ldc)
  end subroutine wmma_single
 end module m 

This is launched in host code with a single thread block containing a single warp of threads:

    call wmma_single>>(a_d, b_d, c_d) 

The device arrays a_d, b_d, and c_d are declared in host code with dimensions corresponding to the WMMA submatrices:

  integer, parameter :: m = wmma_m, n = wmma_n, k = wmma_k
  real(8), device :: a_d(m,k), b_d(k,n), c_d(m,n)

Because the matrices passed in as arguments to the kernel are the same sizes as the WMMA submatrices, the matrix multiplication is accomplished by initializing the C WMMA submatrix to 0.0, loading the A and B matrices from global memory to WMMA submatrices, performing the matrix multiplication on the submatrices, and storing the result in the WMMA submatrix to global memory:

    sc = 0.0_8
    call wmmaLoadMatrix(sa, a(1,1), lda)
    call wmmaLoadMatrix(sb, b(1,1), ldb)
    call wmmaMatMul(sc, sa, sb, sc)
    call wmmaStoreMatrix(c(1,1), sc, ldc)

You may have noticed that the thread index threadIdx does not appear at all in this code. This underlies the important concept to take away from this example: the threads in a warp work collectively to accomplish these tasks. So, when dealing with the WMMA submatrices, you are doing warp-level programming rather than thread-level programming.

This kernel is launched with a single warp of 32 threads, yet your WMMA C submatrix has 8×8 or 64 elements. When the sc = 0.0_8 initialization statement is executed, each thread sets two elements in the 8×8 submatrix to zero. For those familiar with the CUDA C Tensor Core API, assignment to WMMA submatrices is overloaded to invoke the fill_fragment method. The mapping of threads to submatrix elements is opaque for this and other operations involving WMMA submatrices. From a programming standpoint, you only address what happens collectively by a warp of threads on WMMA submatrices.

The following statements load A and B from global memory to WMMA submatrices:

    call wmmaLoadMatrix(sa, a(1,1), lda)
    call wmmaLoadMatrix(sb, b(1,1), ldb)

They also work collectively. In these calls, the WMMA submatrices are specified as the first argument. The second arguments contain the addresses of the upper-left element of the tiles in global or shared memory to be loaded to the WMMA submatrices. The leading dimension of the matrices in global or shared memory is the third argument. The arguments passed to wmmaLoadMatrix are the same for all threads in the warp. Because the mapping of elements to threads in a warp is opaque, each thread just passes the address of the first element in the 8×4 or 4×8 matrix along with the leading dimension as the third parameter, and the load operation is distributed amongst the threads in the warp.

The matrix multiplication on the WMMA submatrices is performed by the following statement:

    call wmmaMatMul(sc, sa, sb, sc)

This statement is again performed collectively by a warp of threads. Here, you have used the same accumulator submatrix for the first and last arguments in the wmmaMatMul call, which is why its initialization to zero was required.

The following wmmaStoreMatrix call is analogous to the prior wmmaLoadMatrix calls:

    call wmmaStoreMatrix(c(1,1), sc, ldc)

Here, however, the first argument is the address of the upper left element of the tile in global memory, and the second argument is the WMMA submatrix whose values are stored. When both wmmaLoadMatrix and wmmaStoreMatrix are called with accumulator (WMMAMatrixC) arguments, there is an optional fourth argument that specifies the storage order—recall that declarations for WMMA accumulators do not specify storage order. In CUDA Fortran, the default order is WMMAColMajor or column-major storage order.

One final note on arguments to the wmmaLoadMatrix and wmmaStoreMatrix routines. There is a requirement that the leading dimension of the matrices, specified by the third argument of these routines, must be a multiple of 16 bytes, such as two real(8) elements, four real(4) elements, or eight real(2) elements. This becomes important when you use shared memory for the input arrays or more specifically, when you pad shared memory arrays.

Example 2: 8×8×16 matrix multiply

The second example performs the matrix multiplication C = A ∗ B where A is 8×16 and B is 16×8. A common theme in all these examples of this section is that a warp of threads calculates an 8×8 tile of the resultant matrix. As such, this kernel is still launched with a single warp of threads. The kernel for this code is as follows:

#include "cuf_macros.CUF"
module m
  integer, parameter :: wmma_m = 8
  integer, parameter :: wmma_n = 8
  integer, parameter :: wmma_k = 4
contains

  ! kernel for wmma on a single blockwise row in A
  !   C(8x8) = A(8xK) B(Kx8)
  ! with K a multiple of 4
  ! Launched with a single block of 32 threads 

  attributes(global) subroutine wmma_row(a, b, c, k)
    use wmma
    implicit none
    real(8) :: a(wmma_m,*), b(k,*), c(wmma_m,*)
    integer, value :: k 
    integer :: i

    WMMASubMatrix(WMMAMatrixA, 8, 8, 4, Real, WMMAColMajorKind8) :: sa
    WMMASubMatrix(WMMAMatrixB, 8, 8, 4, Real, WMMAColMajorKind8) :: sb
    WMMASubMatrix(WMMAMatrixC, 8, 8, 4, Real, WMMAKind8) :: sc
    integer :: lda, ldb, ldc

    lda = wmma_m 
    ldb = k
    ldc = wmma_m 

    sc = 0.0_8
    do i = 1, k, wmma_k
       call wmmaLoadMatrix(sa, a(1,i), lda)
       call wmmaLoadMatrix(sb, b(i,1), ldb)
       call wmmaMatMul(sc, sa, sb, sc)
    enddo
    call wmmaStoreMatrix(c(1,1), sc, ldc)
  end subroutine wmma_row
end module m

The only difference between this kernel and the kernel from the previous example is that the loading of A and B submatrices and the matrix multiplication on these submatrices occur inside a loop, where for each iteration a matrix multiplication of 8×4 by 4×8 tiles is performed. The second argument to wmmaLoadMatrix is again the same for all threads in the warp, it is the address of the first element of the 8×4 or 4×8 tile used in that iteration. The results are accumulated in the submatrix sc because the first and last arguments in wmmaMatMul are the same.

Example 3: 16×16×16 matrix multiply

For this example, you move past launching a kernel with a single warp of threads. The kernel still uses a single warp to calculate each 8×8 tile of the resultant matrix, but for a 16×16 resultant matrix, the host code now launches the kernel with four warps of threads, for now with one warp per thread block. The associated parameters in host code are as follows:

  ! m=n=k=16
  integer, parameter :: m_blocks = 2, n_blocks = 2, k_blocks= 4
  integer, parameter :: m = m_blocks*wmma_m, &
       n = n_blocks*wmma_n, &
       k = k_blocks*wmma_k

The host code launches the kernel with the following:

  grid = dim3(m_blocks,n_blocks,1)
  call wmma_8x8>>(a_d, b_d, c_d, m, n, k)

The kernel itself is as follows:

#include "cuf_macros.CUF"
module m
  integer, parameter :: wmma_m = 8
  integer, parameter :: wmma_n = 8
  integer, parameter :: wmma_k = 4
contains
  ! kernel where each block performs matmul for a 8x8 submatrix of C

  attributes(global) subroutine wmma_8x8(a, b, c, m, n, k)
    use wmma
    implicit none
    real(8) :: a(m,*), b(k,*), c(m,*)
    integer, value :: m, n, k 
    integer :: i, row_t, col_t

    WMMASubMatrix(WMMAMatrixA, 8, 8, 4, Real, WMMAColMajorKind8) :: sa
    WMMASubMatrix(WMMAMatrixB, 8, 8, 4, Real, WMMAColMajorKind8) :: sb
    WMMASubMatrix(WMMAMatrixC, 8, 8, 4, Real, WMMAKind8) :: sc
    integer :: lda, ldb, ldc
    lda = m 
    ldb = k
    ldc = m 

    row_t = (blockIdx%x - 1)*wmma_m + 1
    col_t = (blockIdx%y - 1)*wmma_n + 1

    sc = 0.0_8
    do i = 1, k, wmma_k
       call wmmaLoadMatrix(sa, a(row_t,i), lda)
       call wmmaLoadMatrix(sb, b(i,col_t), ldb)
       call wmmaMatMul(sc, sa, sb, sc)
    enddo
    call wmmaStoreMatrix(c(row_t,col_t), sc, ldc)

  end subroutine wmma_8x8
end module m

Each thread block (consisting of a single warp of threads) calculates the results in the tile c(row_t:row_t+7, col_t:col_t+7). With one warp of threads per thread block, the row_t and col_t indices can be calculated from the blockIdx values:

    row_t = (blockIdx%x - 1)*wmma_m + 1
    col_t = (blockIdx%y - 1)*wmma_n + 1

While this kernel code is general and can be used for large matrices, with only 32 threads
per block it is inefficient. In the next example, I address this inefficiency.

Example 4: 16×16×16 single-block matrix multiply

In this example, the kernel performs the same matrix multiplication as in example 3, but uses a single block of four warps of threads rather than four blocks of one warp each. The kernel is launched with the following code:

  tpb = dim3(tile_m/wmma_m*32, tile_n/wmma_n, 1)
  grid = dim3((m-1)/tile_m+1, (n-1)/tile_n+1, 1)
  call wmma_tile>>(a_d, b_d, c_d, m, n, k)

The module containing the kernel is as follows:

module m
  integer, parameter :: wmma_m = 8
  integer, parameter :: wmma_n = 8
  integer, parameter :: wmma_k = 4
  ! tile_m and tile_n are the size of submatrix of C that
  !   gets calculated per thread block and should be
  !   integral multiples of wmma_m and wmma_n, respectively
  integer, parameter :: tile_m = 16, tile_n = 16

contains

  ! launch with blocksize of
  !   dim3(tile_m/wmma_m*32, tile_n/wmma_n, 1)
  !   [= dim3(64, 2, 1) in this case]
 
  attributes(global) subroutine wmma_tile(a, b, c, m, n, k)
    use wmma
    use cudadevice
    implicit none
    real(8) :: a(m,*), b(k,*), c(m,*)
    integer, value :: m, n, k

    WMMASubMatrix(WMMAMatrixA, 8, 8, 4, Real, WMMAColMajorKind8) :: sa
    WMMASubMatrix(WMMAMatrixB, 8, 8, 4, Real, WMMAColMajorKind8) :: sb
    WMMASubMatrix(WMMAMatrixC, 8, 8, 4, Real, WMMAKind8) :: sc
    integer :: lda, ldb, ldc
    type(dim3) :: warpIdx
    integer :: i, row_t, col_t   

    lda = m 
    ldb = k
    ldc = m 
   
    warpIdx%x = (threadIdx%x - 1)/warpsize + 1
    warpIdx%y = threadIdx%y
   
    row_t = (blockIdx%x-1)*tile_m + (warpIdx%x - 1)*wmma_m + 1
    col_t = (blockIdx%y-1)*tile_n + (warpIdx%y - 1)*wmma_n + 1
   
    sc = 0.0_8

    do i = 1, k, wmma_k
       call wmmaLoadMatrix(sa, a(row_t,i), lda)
       call wmmaLoadMatrix(sb, b(i,col_t), ldb)
       call wmmaMatMul(sc, sa, sb, sc)
    enddo
    call wmmaStoreMatrix(c(row_t,col_t), sc, ldc)

  end subroutine wmma_tile

end module m

The parameters tile_m=32 and tile_n=32 denote the size of the tile of C that gets calculated per thread block. It is convenient to define a warpIdx variable of type(dim3) that is analogous to threadIdx, as all the WMMA operations are done on a per-warp basis. The threadIdx variable appears in the code but is only used to calculate the warpIdx values. The row_t and col_t indices now depend on the warpIdx values as well as the blockIdx values, but aside from that the code is like the code in Example 3.

While this code addresses the low occupancy of previous examples, it is inefficient in that it loads each 8×4 tile of A and 4×8 tile of B twice. The performance impact of such redundant loads is addressed in the next section.

Performance of WMMA code

One of the most important aspects affecting performance of Tensor Core code is data reuse. The following code example is an artificial benchmark that helps quantify the cost of performing loads from global memory to the WMMA submatrices, which live in registers, as well as the cost of the store operation from WMMA submatrices to global memory.  This code example uses the same 8×4 A and 4×8 B matrices to calculate their product in a tile of a larger C matrix:

  attributes(global) subroutine dmma_peak(a, b, c, m, n, niter)
    use cudadevice
    use wmma
    implicit none
    real(8) :: a(wmma_m,wmma_k), b(wmma_k,wmma_n), c(m, n)
    integer, value :: m, n, niter

    WMMASubMatrix(WMMAMatrixA, 8, 8, 4, Real, WMMAColMajorKind8) :: sa
    WMMASubMatrix(WMMAMatrixB, 8, 8, 4, Real, WMMAColMajorKind8) :: sb
    WMMASubMatrix(WMMAMatrixC, 8, 8, 4, Real, WMMAKind8)   :: sc
    type(dim3) :: warpIdx
    integer :: i, row_t, col_t   

    warpIdx%x = (threadIdx%x - 1)/warpsize + 1
    warpIdx%y = threadIdx%y   

    row_t = (blockIdx%x-1)*tile_m + (warpIdx%x - 1)*wmma_m + 1
    col_t = (blockIdx%y-1)*tile_n + (warpIdx%y - 1)*wmma_n + 1

    sc = 0.0_8
    call wmmaLoadMatrix(sa, a, wmma_m)
    call wmmaLoadMatrix(sb, b, wmma_k)
    do i = 1, niter
       call wmmaMatMul(sc, sa, sb, sc)
    Enddo
    call wmmaStoreMatrix(c(row_t,col_t), sc, m)

  end subroutine dmma_peak

A loop in the host code launches the kernel with different values of niter, the number of times a matrix multiplication occurs using a single load/store of the input/output matrices. CUDA events are used to time the kernel and the resulting teraflops are reported.

Figure 1 shows that for large values of niter, the cost of loading and storing matrices is largely amortized away. The wmmaMatMul performance tapers off at about 18.5 TFlops. However, if there is no reuse of the A and B submatrices, so each load is used in only one mmaMatMul call, the kernel operates at around 1 TFlops. As a result, the first step in achieving good performance is to reuse the input matrices as much as possible, which brings you to the next example.

Graph showing how performance improves with number of iterations.
Figure 1. Artificial benchmark results where wmmaMatMul is called niter times within the kernel. This illustrates the effects of amortizing the loads and stores of the input and resultant matrices on the performance.

Example 5: Shared-memory matrix multiply

From this point on, the performance of code is measured and therefore larger matrix sizes are used. For example, the following code example has values of mn, and k of 3,200. To illustrate the shared memory strategy, each thread block calculates a 32×32 tile of C. The parameters for this case are as follows:

  ! dmma tile sizes
  integer, parameter :: wmma_m = 8
  integer, parameter :: wmma_n = 8
  integer, parameter :: wmma_k = 4

  ! C tile of size tile_m x tile_n is calculated per thread block
  ! tile_m must be a multiple of 32
  integer, parameter :: tile_m = 32 
  integer, parameter :: tile_n = tile_m/wmma_m*wmma_n

  ! problem matrix sizes
  ! m, n, and k must be multiples of tile_m, tile_n, and wmma_k
  integer, parameter :: m = (3200/tile_m)*tile_m
  integer, parameter :: n = (3200/tile_n)*tile_n
  integer, parameter :: k = (3200/wmma_k)*wmma_k

  ! shared memory padding for A tile in terms of real(8) elements
  ! padding must be a multiple of 16 bytes, so smPad must be a multiple of 2
  integer, parameter :: smPad = 0

  ! Number of times kernel is called and timed
  integer, parameter :: nRuns = 20

  ! dependent parameters
  integer, parameter :: blockRows_as = tile_m/wmma_m

The expression for tile_n evaluates to 32. The parameter smPad specifies shared memory padding, which I discuss shortly. The parameter nRuns specifies the number of times that this kernel is run and its performance measured. When performed on an idle GPU, a single run may complete before the clocks reach their optimal level. Because these BLAS-like routines are typically used in large code blocks where such transitional periods are a small part of the overall run, you want to measure the asymptotic state. Alternatively, you could run untimed kernels before measuring performance as was done in the code that produced Figure 1. Another option is to fix the GPU clocks using nvidia-smi -ac.

A strategy for optimizing reuse of data in WMMA submatrices is to load the 32×4 A tile into shared memory, and for each warp of threads to load a single 4×8 submatrix of B and use that to calculate a 32×8 column of the C tile that is calculated by the thread block (Figure 2). To load the 32×4 A tile used here, this kernel is launched with four warps of threads per thread block.

Depiction of how A, B, and C tiles are used in Tensor Core matrix multiplication.
Figure 2. Depiction of the tiles used by a thread block to calculate a 32×32 tile of the C matrix. The 32×4 tile of A is stored in shared memory, and each warp of threads loads a 4×8 submatrix of B. Each warp uses its B submatrix to calculate a 32×8 tile of C.

In the kernel, the shared memory and WMMA submatrix declarations are as follows:

    real(8), shared :: a_s(tile_m+smPad, wmma_k)
    WMMASubMatrix(WMMAMatrixA, 8, 8, 4, Real, WMMAColMajorKind8) :: sa
    WMMASubMatrix(WMMAMatrixB, 8, 8, 4, Real, WMMAColMajorKind8) :: sb
    WMMASubMatrix(WMMAMatrixC, 8, 8, 4, Real, WMMAKind8)   :: &
         sc(blockRows_as)

To avoid shared memory bank conflicts, a variable amount of padding is added to the first index of the shared memory tile. This padding can be specified using the smPad parameter. The leading dimension of the source matrix, for the shared memory array a_s, is as follows:

    lda_s = tile_m + smPad

This must be a multiple of 16 bytes, thus the smPad parameter must be a multiple of 2. Because each warp of threads is calculating a 32×8 tile of matrix C, an array of 4 (blockRows_as) submatrix accumulators is required. The kernel code performs the following index calculations prior to entering the main loop over the k dimension:

    ! row and column indices to the first element
    !   in the tile_m x tile_n tile of C
    row_t = (blockIdx%x-1)*tile_m + 1
    col_t = (blockIdx%y-1)*tile_n + 1

    ! C column index for each warp
    col_w = col_t + (warpIdx-1)*wmma_n

col_w is the column index used to load the warp’s B tile. After initializing the array of accumulator submatrices, the main loop over the k dimension and the loop to store the resultant C submatrices are as follows:

    do i = 1, k, wmma_k
       ! load the tile_m x wmma_k tile of A into shared memory,
       call syncthreads()
       a_s(rowIdx_as, colIdx_as) = &
             a(row_t-1 + rowIdx_as, i-1 + colIdx_as)

       ! load B into wmma submatrices,
       !   each warp gets a different 4x8 block
       call wmmaLoadMatrix(sb, b(i,col_w), ldb)

       ! for a_s
       call syncthreads() 

       ! loop down column of C tile calculating sc() 
       do j = 1, blockRows_as
          ! each warp loads the same 8x4 block of A
          call wmmaLoadMatrix(sa, a_s(1+(j-1)*wmma_m, 1), lda_s)  
          call wmmaMatMul(sc(j), sa, sb, sc(j))
       enddo
    end do

    do j = 1, blockRows_as
       call wmmaStoreMatrix(c(row_t+(j-1)*wmma_m,col_w), sc(j), ldc)
    end do

For each iteration over the k dimension, the tile_m×4 A tile is loaded into shared memory. The size of the shared memory tile is chosen largely to keep the indexing for loading A into shared memory as simple and efficient as possible. By restricting tile_m to be a multiple of 32, multiple warps can load in a single column of the A tile facilitated by the mappings:

    colIdx_as = (threadIdx%x - 1)/(tile_m) + 1
    rowIdx_as = threadIdx%x - (colIdx_as-1)*(tile_m)

These mappings are only used when reading in shared memory on the line:

    a_s(rowIdx_as, colIdx_as) = &
          a(row_t-1 + rowIdx_as, i-1 + colIdx_as)

The number of warps used to read in a tile of A into shared memory dictates the number of warps per thread block and hence the number of columns in the B and C tiles:

    integer, parameter :: tile_m = 32
    Integer, parameter :: tile_n = tile_m/wmma_m*wmma_n

In the main loop of the kernel over the k dimension, after issuing loads for the shared memory A tile, each warp of threads loads a B WMMA submatrix using the col_w index. A thread synchronization ensures all the elements in a_s are available to all threads. The code then loops over block rows, loading the WMMA submatrices from shared memory and performing the matrix multiplication. After the main loop over the k dimension completes, the results in the sc(j) tile are stored to global memory.

The code can be compiled with the following command:

nvfortran -cuda -gpu=cc80,cuda11.1 -O3 -o tensorCoreR8SharedA  tensorCoreR8SharedA.CUF

Executing the code should produce comparable results to the following:

$ ./tensorCoreR8SharedA
 Device: A100-PCIE-40GB
 M = 3200, N = 3200, K = 3200
 tile_m = 32, tile_n = 32
 thread block = 128x1x1
 grid = 100x100x1
 shared memory padding (elements): 0
 nRuns: 1

 Test passed, TFlops:     6.324040 

Adding padding to the shared memory tile to avoid bank conflicts, you obtain a significant bump in performance:

$ ./tensorCoreR8SharedA
 Device: A100-PCIE-40GB
 M = 3200, N = 3200, K = 3200
 tile_m = 32, tile_n = 32
 thread block = 128x1x1
 grid = 100x100x1
 shared memory padding (elements): 2
 nRuns: 1

 Test passed, TFlops:     8.234760

Table 2 shows the results for various cases.

Kernel smPad = 0 smPad = 2 smPad = 4 smPad = 6
32×32×4 6.3 8.2 9.2 8.2
64×64×4 6.6 8.8 10.0 8.8
96×96×4 4.9 6.2 6.3 6.2
128×128×4 6.6 9.3 9.6 9.3
Table 2. Tensor Core Double Precision Matrix Multiply Tflops. Performance of Tensor Core double precision matrix multiply on A100 PCIe for various kernels and shared memory padding

Tensor Core submatrices are register-intensive. While a single submatrix is declared for the  A and B matrices, an array of submatrices is declared for the accumulator. With the number of submatrices per block column growing with the C tile size, as well as the number of block columns, register utilization becomes a limiting factor in performance. In the next example, I investigate a way to increase data reuse without increasing register usage.

Example 6: Matrix multiply with multiple wmma_k blocks

The code in Example 5 used a single block of wmma_k=4 for the A and B tiles. In this section, you generalize this to allow tile_k to be multiples of wmma_k. Increasing the size of the k dimension in the A and B tiles increases data reuse without increasing the register utilization by the Tensor Core submatrices, which is a function of tile_m and tile_n.

To facilitate the larger sizes of the A and B tiles in this example, both A and B tiles are placed in shared memory. In doing so, a single thread block may require more shared memory than the limit of 48KB of static shared memory. As a result, you use dynamic shared memory. When multiple dynamic shared memory arrays are declared in a kernel, all such arrays point to the head of the dynamic shared memory block, and it is up to you to do the bookkeeping needed to partition that block to different arrays. This can be accomplished with defining offsets, but to accommodate multiple two-dimensional arrays of different shapes, use Cray pointers. Declare a block of dynamic shared memory in the kernel with the following line:

    real(8), shared :: dynSM(*)

You also declare Cray pointers used to partition this array into the A and B tiles:

 real(8), shared :: a_s(tile_m+smPadA, tile_k); pointer(a_sCrayP, a_s)
 real(8), shared :: b_s(tile_k+smPadB, tile_n); pointer(b_sCrayP, b_s)

The pointers a_sCrayP and b_sCray are associated with portions of the dynamic shared memory array in the block:

    block
      integer :: offset
      offset = 0
      a_sCrayP = loc(dynSM(offset+1))
      offset = offset + (tile_m+smPadA)*tile_k
      b_sCrayP = loc(dynSM(offset+1))
    end block

After the pointers are associated, the pointees a_s and b_s can be used as any two-dimensional array would be used. In host code, you must set the amount of shared memory used with the cudaFuncSetAttribute function, as the amount of dynamic shared memory needed may exceed the default maximum.

   smSizeInBytes = 8*((tile_m+smPadA)*tile_k &
       + (tile_k+smPadB)*tile_n)

   i = cudaFuncSetAttribute(dmm, &
       cudaFuncAttributeMaxDynamicSharedMemorySize, &
       smSizeInBytes)

In Example 5, you used shared memory for the A tile rather than the B tile. With a leading dimension of tile_m+smPad, where tile_m is a multiple of 32, the A tile can be loaded into shared memory in a coalesced fashion, and relatively little shared memory is used for padding. On the other hand, for the B tile with a leading dimension of wmma_k+smPad, loading data into shared memory would not be coalesced and a significant portion of shared memory would be dedicated to the padding for any non-zero padding. The choice of having a warp of threads calculate a block column of the C tile fell out of using shared memory for the A tile.

With both A and B tiles in shared memory, and with typical values of tile_k equal to or greater than tile_m, having a warp of threads calculate a block row or block column of C are both valid from a performance standpoint. It ends up that having a warp of threads calculate a block row of C performs better. The following example code performs the WMMA submatrix loads and matrix multiplications after loading the A and B tiles into shared memory:

       ! nested loop over WMMA submatrices
       !
       ! k is outermost loop so 8x4 WMMA submatrix of A is reused
       do kt = 1, tile_k, wmma_k
          ! each warp gets a different 8x4 block of A
          call wmmaLoadMatrix(sa, a_s(rowIdx_w, kt), lda_s)
            
          ! now go across block row of B/C
          do j = 1, blockCols_bs
             nt = (j-1)*wmma_n + 1
             ! each warp gets the same block of B
             call wmmaLoadMatrix(sb, b_s(kt,nt), ldb_s)
             call wmmaMatMul(sc(j), sa, sb, sc(j))
          enddo
       enddo

Figure 3 shows a graphical depiction of the process. For the first iteration of the outermost loop over the k dimension, the warps in a thread block collectively load a block column of the A matrix, where each warp loads a different 8×4 submatrix of A. Each warp then loads the same 4×8 submatrix of B, and C is updated with the resulting matrix multiplications. With the same submatrices in A, the kernel iterates over block columns of B, updating the corresponding C submatrices within each iteration. The kernel then advances to the next block in k, loading the second block column of A and iterating over the second block row of B. Placing k as the outermost loop here facilitates the reuse of data already loaded into submatrices (the block column of A).

Depiction of how A and B tiles with multiple wmma_k blocks are used in Tensor Core matrix multiplication.
Figure 3. Tile configuration with multiple wmma_k blocks in tile_k.

Table 3 shows the results for a sample of parameters. This is just a small sampling of the available parameter space. To keep the indexing relatively simple when loading shared memory, choice of tile_k is limited, in addition to the previous restrictions on tile_m and tile_m.

tile_m × tile_n tile_k = 32 tile_k = 64 tile_k = 96 tile_k = 128
32×32 10.6 9.0 6.4
64×64 14.5 14.4 8.65
96×96 10.7 10.9 11.5
128×128 12.86 13.5
Table 3. Tensor Core Double Precision Matrix Multiply Tflops. Performance of Tensor Core double precision matrix multiply with tile_k being multiples of wmma_k. smPadA = smPadB = 4 for all cases. Due to restrictions on tile sizes and limits of shared memory certain combinations of tile sizes are not allowed.

Conclusion

I’ve included the source code for all the examples in this post: tensorCore_source_code.zip.

As mentioned earlier, to keep the indexing used to load the shared memory tiles simple in the examples, I’ve restricted the parameter space of tile_m, tile_n, and tile_k. These restrictions can be relaxed with some additional indexing arithmetic and guards to prevent out-of-bounds accesses. Even with a limited parameter space and relatively simple kernels, you were able to get 14.5 out of a peak 18.5 TFlops.

The earlier examples can be used as a template for other code blocks that use Tensor Cores on double precision data. Another, more hands-off approach to leveraging the power of Tensor Cores is through the cuTensor library, which has CUDA Fortran interfaces. For more information, see Bringing Tensor Cores to Standard Fortran.

Categories
Misc

Healthcare Headliners Put AI Under the Microscope at GTC

Two revolutions are meeting in the field of life sciences — the explosion of digital data and the rise of AI computing to help healthcare professionals make sense of it all, said Daphne Koller and Kimberly Powell at this week’s GPU Technology Conference,. Powell, NVIDIA’s vice president of healthcare, presented an overview of AI innovation Read article >

The post Healthcare Headliners Put AI Under the Microscope at GTC appeared first on The Official NVIDIA Blog.

Categories
Misc

NVIDIA RTX Lights Up the Night in Stunning Demos at GTC

NVIDIA is putting complex night scenes in a good light. A demo at GTC21 this week showcased how NVIDIA RTX Direct Illumination (RTXDI) technology is paving the way for realistic lighting in graphics. The clip shows thousands of dynamic lights as they move, turn on and off, change color, show reflections and cast shadows. People Read article >

The post NVIDIA RTX Lights Up the Night in Stunning Demos at GTC appeared first on The Official NVIDIA Blog.

Categories
Misc

You Put a Spell on Me: GFN Thursdays Are Rewarding, 15 New Games Added This Week

This GFN Thursday — when GeForce NOW members can learn what new games and updates are streaming from the cloud — we’re adding 15 games to the service, with new content, including NVIDIA RTX and DLSS in a number of games. Plus, we have a GeForce NOW Reward for Spellbreak from our friends at Proletariat. Read article >

The post You Put a Spell on Me: GFN Thursdays Are Rewarding, 15 New Games Added This Week appeared first on The Official NVIDIA Blog.

Categories
Misc

how can I save an NVIDIA StyleGan2 Ada .pkl file to .pb for use in other applications?

Title says it all. I’ve trained a sg2-ada GAN and it is saved as a pickled checkpoint, but I don’t know how to get it out.

For reference, the github with all the goods: https://github.com/dvschultz/stylegan2-ada/blob/main/train.py

submitted by /u/diditforthevideocard
[visit reddit] [comments]

Categories
Misc

Tensorflow Returning "ValueError: ‘outputs’ must be defined before the loop"

Hey everyone, I hope you are having a fabulous day!

I am trying to train a ResNet50 model on a dataset represented by a tensorflow.python.keras.utils.data_utils.Sequence that I created. When I run the code, I keep getting errors saying that my dataset/generator is returning Nones instead of images and I would really like some help figuring out why and how to fix it.

I posted the issue on Stack Overflow, which seems unusually quiet.

Could you guys please help me out?

Thanks a bunch for your time and efforts in advance!

submitted by /u/Revolutionary-Tie412
[visit reddit] [comments]

Categories
Misc

Ray Tracing Gems II: Available August 4th

We are pleased to announce that Ray Tracing Gems II, the follow up to 2019’s Ray Tracing Gems, will be available for digital download and print on August 4th, 2021.

We are pleased to announce that Ray Tracing Gems II, the follow up to 2019’s Ray Tracing Gems, will be available for digital download and print on August 4th, 2021.

Today, as nearly every hardware vendor and 3D software platform embraces ray tracing, it is clear that real-time ray tracing is here to stay. Ray Tracing Gems II brings the community of rendering experts back together again to unearth true “gems” for developers of games, architectural applications, visualizations, and more in this exciting new era of real-time rendering. Rendering experts share their knowledge by explaining everything from basic ray tracing concepts geared toward beginners all the way to full ray tracing deployment in shipping AAA games.

Just like the first book, the digital edition of Ray Tracing Gems II will be free to download and the print edition will be available for purchase from Apress and Amazon. 

Sharing and using knowledge widely and freely is an important pillar of the Ray Tracing Gems series. We are pleased to announce that Ray Tracing Gems II will be “Open Access”, under the terms of the Creative Commons Attribution 4.0 International License (CC-BY-NC-ND) “which permits use, duplication, distribution, and reproduction in any medium or format, as long as appropriate credit is given to the original author(s) and the source, a link is provided to the Creative Commons license, and any changes made are indicated”.

In the coming months, leading up to the book’s release in August, we’ll be sharing a preview of some of the incredible content coming in Ray Tracing Gems II. In the meantime, you can get the original Ray Tracing Gems (in hardcover from Apress or as a free digital download) here

Adam Marrs, Peter Shirley, and Ingo Wald