9 Down, 14 letters: Someone skilled in creating and solving crossword puzzles. This April, the fastest “cruciverbalist” at the American Crossword Puzzle Tournament was Dr.Fill, a crossword puzzle-solving AI program created by Matt Ginsberg. Dr.Fill perfectly solved the championship puzzle in 49 seconds. The first human champion, Tyler Hinman, filled the 15×15 crossword in exactly Read article >
NVIDIA announces the newest release of the CUDA development environment, CUDA 11.4. This release includes GPU-accelerated libraries, debugging and optimization tools, programming language enhancements, and a runtime library to build and deploy your application on GPUs across the major CPU architectures: x86, Arm, and POWER. CUDA 11.4 is focused on enhancing the programming model and … Continued
NVIDIA announces the newest release of the CUDA development environment, CUDA 11.4. This release includes GPU-accelerated libraries, debugging and optimization tools, programming language enhancements, and a runtime library to build and deploy your application on GPUs across the major CPU architectures: x86, Arm, and POWER.
CUDA 11.4 is focused on enhancing the programming model and performance of your CUDA applications. CUDA continues to push the boundaries of GPU acceleration and lay the foundation for new applications in HPC, graphics, CAE applications, AI and deep learning, automotive, healthcare, and data sciences.
CUDA 11.4 has several important features. This post offers an overview of the key capabilities:
CUDA Programming model enhancements:
CUDA Graphs
Multi-Process Service (MPS)
Formalizing Asynchronous Data Movement
C++ Language support – CUDA
Compiler enhancements
CUDA Driver Enhancements
CUDA 11.4 ships with the R470 driver, which is a long-term support branch. GPUDirect RDMA and GPUDirect technology Storage (GDS) are now part of the CUDA Driver and Toolkit. This streamlines the workflow and enables our developers to leverage these technologies without the need for separate installation of additional packages. The driver enables new MIG configurations for the recently launched NVIDIA A30 GPU, which doubles the memory per MIG slice. This results in greater peak performance for various workloads on the A30 GPU, especially for AI inference workloads.
This release introduced key enhancements to improve the performance of CUDA Graphs without requiring any modifications to the application or any other user intervention. It also improves the ease of use of Multi-Process Service (MPS). We formalized the asynchronous programming model in the CUDA Programming Guide.
CUDA Graphs
Reducing graph launch latency is a common request from the developer community, especially in applications that have real-time constraints, such as 5G telecom workloads or AI inference workloads. CUDA 11.4 delivers performance improvements in reducing the CUDA graph launch times. In addition, we also integrated the stream-ordered memory allocation feature that was introduced in CUDA 11.2.
CUDA graphs are ideal for workloads that are executed multiple times, so a key tradeoff in choosing graphs for a workload is amortizing the cost of creating a graph over repeated launches. The higher the number of repetitions or iterations, the larger the performance improvement.
In CUDA 11.4, we made a couple of key changes to CUDA graph internals that further improve the launch performance. CUDA graphs already sidesteps streams to enable lower latency runtime execution. We extended this, to bypass streams even at the launch phase, submitting a graph as a single block of work directly to the hardware. We’ve seen good performance gains from these host improvements, both for single-threaded and multithreaded applications.
Figure 1 shows the relative improvement in launch latency for the re-launch of different graph patterns. There is significant benefit for graphs that have a fork or join pattern.
Figure 1.Launch latency performance improvement for a single graph for repeat launches with CUDA 11.3 baselines.
Multithreaded launch performance is particularly affected by the resource contention that happens when launching multiple graphs in parallel. We’ve optimized the interthread locking to reduce contention, and so multithreaded launch is now significantly more efficient. Figure 2 shows the relative performance benefits of the changes in CUDA 11.4 to ease resource contention and how it scales with the number of threads.
Figure 2.Multithreaded launch latency improvement for a straight line for repeat launches with CUDA 11.3 baselines.
Stream-ordered memory allocator support
The stream-ordered memory allocator enables applications to order memory allocation and deallocation with respect to other work launched into a CUDA stream. This also enables allocation re-use, which can significantly improve application performance. For more information about the feature and capabilities, see Enhancing Memory Allocation with New NVIDIA CUDA 11.2 Features.
In CUDA 11.4, CUDA Graphs now supports stream-ordered memory allocation both through stream capture or in native graph construction through new allocate and free node types, enabling the same efficient, deferred memory reuse logic within graphs.
These node types are collectively referred to as memory nodes. They can be created in several ways:
Using the explicit API
Using cudaGraphAddMemAllocNode and cudaGraphAddMemFreeNode, respectively
Using stream capture
Using cudaMallocAsync/cudaMallocFromPoolAsync and cudaFreeAsync, respectively
In the same way that stream-ordered allocation uses implicit stream ordering and event dependencies to reuse memory, graph-ordered allocation uses the dependency information defined by the edges of the graph to do the same.
Figure 3. Intra-graph memory reuse. When a MemAlloc node is created, it attempts to reuse memory, which was freed by MemFree nodes that it depends upon.
The Multi-Process Service (MPS) is a binary-compatible client-server runtime implementation of the CUDA API designed to transparently enable co-operative multiprocess CUDA applications.
It consists of a control daemon process, client runtime, and server process. MPS enables better GPU utilization in cases where a single process does not use all the compute and memory-bandwidth capacity. MPS also reduces on-GPU context storage and context switching. For more information, see Multi-Process Service in the GPU Management and Deployment guide.
In this release, we made a couple of key enhancements to improve the ease of use of MPS.
Figure 4.Schematic representation of MPS with reduced on-GPU context storage and context switching to improve ease-of-use.
Programmatic configuration of SM partitions
There are certain use cases that share the following characteristics:
They consist of kernels that have little to no interaction, which enables concurrent execution.
The ratio of SMs required by these workloads may change and requires flexibility in allocating the right number of SMs.
The MPS active thread percentage setting enables you to limit the execution to a portion of the SMs. Before CUDA 11.4, this was a fixed value that was set equally for all clients within the process. In CUDA 11.4, this has been extended to offer a mechanism to partition the SMs at a per-client level through a programmatic interface. This enables you to create contexts that have different SM partitions within the same application process.
A new resource type called CU_EXEC_AFFINITY_TYPE_SM_COUNT enables you to specify a minimum number N that the context requires. The system guarantees that at least this many SMs are assigned, although more may be reserved. CUDA 11.4 also introduces a related affinity API cuCtxGetExecAffinity, which queries the exact amount of a resource (such as the SM count) allocated for a context. For more information, see the cuCtxGetExecAffinity section in the API documentation.
Error reporting
To improve the error reporting and ease of diagnosing the root cause of MPS issues, we introduced new and detailed driver and runtime error codes. These error codes provide more specificity regarding the type of error. They supplement the common MPS error codes with additional information to help you trace down the cause of the failures. Use these error codes in your applications with the error messages in the server log, as part of the root cause analysis.
In support of the asynchronous memory transfer operations, enabled by NVIDIA A100 GPU microarchitecture, in CUDA 11.4, we formalized and defined the asynchronous SIMT programming model. The asynchronous programming model defines the behavior and the APIs for C++ 20 barriers and cuda::memcpy_async on the GPU.
For more information about how you can use the asynchronous APIs to overlap memory operations from global memory, with computations in the streaming multiprocessors (SMs), see Asynchronous SIMT Programming Model.
Other enhancements
In addition to the key capabilities listed earlier, there are a few enhancements in CUDA 11.4 geared towards improving the mulit-thread submission throughput and extending the CUDA forward compatibility support to NVIDIA RTX GPUs.
Multithread submission throughput
In 11.4, we reduced the serialization of the CUDA API overheads between CPU threads. These changes are enabled by default. However, to assist with the triage of possible issues because of the underlying changes, we provide an environment variable, CUDA_REDUCE_API_SERIALIZATION, to turn off these changes. This was one of the underlying changes discussed earlier that contributed to the performance improvements for CUDA graphs.
CUDA forward compatibility
To enable use cases where you want to update your CUDA toolkit but stay on your current driver version, for example to reduce the risk or the overhead of additional validation needed to move to a new driver, CUDA offers the CUDA forward compatibility path. This was introduced in CUDA 10.0 but was initially limited to data-center GPUs. CUDA 11.4 eases those restrictions and you can now take advantage of the forward compatibility path for NVIDIA RTX GPUs as well.
Figure 5.Forward compatibility upgrade path between pre-CUDA 11.0 releases and current CUDA 11.x drivers.
C++ language support for CUDA
Here are some key enhancements included with C++ language support in CUDA 11.4.
Major releases:
NVIDIA C++ Standard Library (libcu++) 1.5.0 was released with CUDA 11.4.
Thrust 1.12.0 has the new thrust::universal_vector API that enables you to use the CUDA unified memory with Thrust.
Bug fix release: The CUDA 11.4 toolkit release includes CUB 1.12.0.
New asynchronous thrust::async:exclusive_scan and inclusive_scan algorithms have been added, and the synchronous versions of these were updated to use cub::DeviceScan directly.
CUDA compiler enhancements
CUDA 11.4 NVCC C++ compiler has JIT LTO support in preview, offers more L1 and L2 cache control, and exposes a C++ symbol demangling static library along with NVIDIA Nsight debugger support for alloca.
JIT link time optimization
JIT link-time optimization (LTO) is a preview feature and is available only on CUDA Toolkit 11.4, not on embedded platforms. This feature enables LTO to be performed at runtime. Use NVRTC to generate NVVM IR, and then use the cuLink driver APIs to link the NVVM IR and do LTO.
The following code example shows how runtime JIT LTO can be used in your program. Generate NVVM IR using nvrtcCompileProgram with the -dlto option and retrieve the generated NVVM IR using the newly introduced nvrtcGetNVVM. Existing cuLink APIs are augmented to take newly introduced JIT LTO options to accept NVVM IR as input and to perform JIT LTO. Pass the CU_JIT_LTO option to cuLinkCreate API to instantiate the linker and then use CU_JIT_INPUT_NVVM as option to cuLinkAddFile or cuLinkAddData API for further linking of NVVM IR.
The CUDA SDK now ships with libcu++filt, a static library that converts compiler-mangled C++ symbols into user-readable names. The following API, found in the nv_decode.h header file, is the entry point to the library:
char* __cu_demangle(const char* id, char *output_buffer, size_t *length, int *status)
The following C++ example code shows usage:
#include
#include "/usr/local/cuda-14.0/bin/nv_decode.h"
using namespace std;
int main(int argc, char **argv)
{
const char* mangled_name = "_ZN6Scope15Func1Enez";
int status = 1;
char* w = __cu_demangle(mangled_name,0,0,&status);
if(status != 0)
cout
This code example outputs as follows:
Demangling Succeeded: Scope1::Func1(__int128, long double, ...)
Demangling Succeeded: Scope1::Func1(__int128, long double, ...)
For more information, see Library Availability in the CUDA Binary Utilities documentation.
Configuring cache behavior in PTX
PTX ISA 7.4 gives you more control over caching behavior of both L1 and L2 caches. The following capabilities are introduced in this PTX ISA version:
Enhanced data prefetching: The new .level::prefetch_size qualifier can be used to prefetch additional data along with memory load or store operations. This enables exploiting the spatial locality of data.
Eviction priority control: PTX ISA 7.4 introduces four cache eviction priorities. These eviction priorities can be specified with the .level::eviction_priority qualifier on memory load or store operations (applicable to the L1 cache) and on the prefetch instruction (applicable to the L2 cache).
evict_normal (default)
evict_last (useful when the data should be kept in the cache longer)
evict_first (useful for streaming data)
no_allocate (avoid data from being cached at all)
Enhanced L2 cache control: This comes in two flavors:
Cache control on specific addresses: The new discard instruction enables discarding data from cache without writing it back to memory. It should be only used when the data is no longer required. The new applypriority instruction sets the eviction priority of specific data to evict_normal. This is [articularly useful in downgrading the eviction priority from evict_last when the data no longer needs to be persistent in cache.
Cache-hints on memory operations: The new createpolicy instruction enables creating a cache policy descriptor that encodes one or more cache eviction priorities for different data regions. Several memory operations including load, store, asynchronous copy, atom, red, and so on can accept cache policy descriptor as an operand when the .level::cache_hint qualifier is used.
These extensions are treated as performance hints only. The caching behavior specified using these extensions is not guaranteed by the caching system. For more information about usage, see the PTX ISA specification.
Other compiler enhancements in CUDA 11.4 include support for a new host compiler: ICC 2021. The diagnostics emitted by the CUDA frontend compiler are now ANSI colored and Nsight debugger can now correctly unwind CUDA applications with alloca calls, in the Call Stack view.
Nsight Developer Tools
New versions are now available for NVIDIA Nsight Visual Studio Code Edition (VSCE) and Nsight Compute 2021.2, adding enhancements to the developer experience for CUDA programming.
NVIDIA Nsight VSCE is an application development environment for heterogeneous platforms bringing CUDA development for GPUs into Microsoft Visual Studio Code. NVIDIA Nsight VSCE enables you to build and debug GPU kernels and native CPU code in the same session as well as inspect the state of the GPU and memory.
It includes IntelliSense code highlighting for CUDA applications and an integrated GPU debugging experience from the IDE with support for stepping through code, setting breakpoint, and inspecting memory states and system information in CUDA kernels. Now it’s easy to develop and debug CUDA applications directly from Visual Studio Code.
Nsight Compute 2021.2 adds new features that help detect more performance issues and make it easier to understand and fix them. The new register dependency visualization (Figure 6) helps identify long dependency chains and inefficient register usage that can limit performance. This release also adds a frequently requested feature to enable you to view the side-by-side assembly and correlated source code for CUDA kernels in the source view, without needing to collect a profile. This standalone source viewer feature enables you to open .cubin files directly from disk in the GUI to see the code correlation.
Figure 6. Nsight Compute Register Dependency Visualization capability helps identify long dependency chains and inefficient register usage to help improve performance.
Several features, including highlighted focus metrics, report cross-links, increased rule visibility, and documentation references all add to the built-in profile and optimization guided analysis that Nsight Compute provides to help you understand and fix performance bottlenecks.
This release also includes support for OptiX 7 resource tracking, a new Python interface for reading report data, and improvements to management of baseline reports, font settings, and CLI filters.
Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren’t stream ordered. In this post, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In part … Continued
Most CUDA developers are familiar with the cudaMalloc and cudaFree API functions to allocate GPU accessible memory. However, there has long been an obstacle with these API functions: they aren’t stream ordered. In this post, we introduce new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations.
In part 2 of this series, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.
Stream ordering efficiency
The following code example on the left is inefficient because the first cudaFree call has to wait for kernelA to finish, so it synchronizes the device before freeing the memory. To make this run more efficiently, the memory can be allocated upfront and sized to the larger of the two sizes, as shown on the right.
cudaMalloc(&ptrA, sizeA);
kernelA>>(ptrA);
cudaFree(ptrA); // Synchronizes the
device before freeing memory
cudaMalloc(&ptrB, sizeB);
kernelB>>(ptrB);
cudaFree(ptrB);
This increases code complexity in the application because the memory management code is separated out from the business logic. The problem is exacerbated when other libraries are involved. For example, consider the case where kernelA is launched by a library function instead:
This is much harder for the application to make efficient because it may not have complete visibility or control over what the library is doing. To circumvent this problem, the library would have to allocate memory when that function is invoked for the first time and never free it until the library is deinitialized. This not only increases code complexity, but it also causes the library to hold on to the memory longer than it needs to, potentially denying another portion of the application from using that memory.
Some applications take the idea of allocating memory upfront even further by implementing their own custom allocator. This adds a significant amount of complexity to application development. CUDA aims to provide a low-effort, high-performance alternative.
CUDA 11.2 introduced a stream-ordered memory allocator to solve these types of problems, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift memory allocation from global-scope operations that synchronize the entire device to stream-ordered operations that enable you to compose memory management with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps restrict the lifetime of the allocation to the GPU work that accesses it. Consider the following code example:
cudaMallocAsync(&ptrA, sizeA, stream);
kernelA>>(ptrA);
cudaFreeAsync(ptrA, stream); // No synchronization necessary
cudaMallocAsync(&ptrB, sizeB, stream); // Can reuse the memory freed previously
kernelB>>(ptrB);
cudaFreeAsync(ptrB, stream);
It is now possible to manage memory at function scope, as in the following example of a library function launching kernelA.
libraryFuncA(stream);
cudaMallocAsync(&ptrB, sizeB, stream); // Can reuse the memory freed by the library call
kernelB>>(ptrB);
cudaFreeAsync(ptrB, stream);
void libraryFuncA(cudaStream_t stream) {
cudaMallocAsync(&ptrA, sizeA, stream);
kernelA>>(ptrA);
cudaFreeAsync(ptrA, stream); // No synchronization necessary
}
Stream-ordered allocation semantics
All the usual stream-ordering rules apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync can be accessed by any kernel or memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and before the deallocation operation, in stream order. Deallocation can be performed in any stream, as long as it is ordered to execute after the allocation operation and after all accesses on all streams of that memory on the GPU.
In effect, stream-ordered allocation behaves as if allocation and free were kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an application is free to access the buffer after kernelA and before kernelB in the appropriate stream order.
The following example shows various valid usages.
auto err = cudaMallocAsync(&ptr, size, streamA);
// If cudaMallocAsync completes successfully, ptr is guaranteed to be
// a valid pointer to memory that can be accessed in stream order
assert(err == cudaSuccess);
// Work launched in the same stream can access the memory because
// operations within a stream are serialized by definition
kernel>>(ptr);
// Work launched in another stream can access the memory as long as
// the appropriate dependencies are added
cudaEventRecord(event, streamA);
cudaStreamWaitEvent(streamB, event, 0);
kernel>>(ptr);
// Synchronizing the stream at a point beyond the allocation operation
// also enables any stream to access the memory
cudaEventSynchronize(event);
kernel>>(ptr);
// Deallocation requires joining all the accessing streams. Here,
// streamD will be deallocating.
// Adding an event dependency on streamB ensures that all accesses in
// streamB will be done before the deallocation
cudaEventRecord(event, streamB);
cudaStreamWaitEvent(streamD, event, 0);
// Synchronizing streamC also ensures that all its accesses are done before
// the deallocation
cudaStreamSynchronize(streamC);
cudaFreeAsync(ptr, streamD);
Figure 1 shows the various dependencies specified in the earlier code example. As you can see, all kernels are ordered to execute after the allocation operation and complete before the deallocation operation.
Figure 1. Various ways to insert dependencies between streams to ensure correctness when accessing memory allocated using cudaMallocAsync.
Memory allocation and deallocation cannot fail asynchronously. Memory errors that occur because of a call to cudaMallocAsync or cudaFreeAsync (for example, out of memory) are reported immediately through an error code returned from the call. If cudaMallocAsync completes successfully, the returned pointer is guaranteed to be a valid pointer to memory that is safe to access in the appropriate stream order.
err = cudaMallocAsync(&ptr, size, stream);
if (err != cudaSuccess) {
return err;
}
// Now you’re guaranteed that ‘ptr’ is valid when the kernel executes on stream
kernel>>(ptr);
cudaFreeAsync(ptr, stream);
The CUDA driver uses memory pools to achieve the behavior of returning a pointer immediately.
Memory pools
The stream-ordered memory allocator introduces the concept of memory pools to CUDA. A memory pool is a collection of previously allocated memory that can be reused for future allocations. In CUDA, a pool is represented by a cudaMemPool_t handle. Each device has a notion of a default pool whose handle can be queried using cudaDeviceGetDefaultMemPool.
You can also explicitly create your own pools and either use them directly or set them as the current pool for a device and use them indirectly. Reasons for explicit pool creation include custom configuration, as described later in this post. When no explicitly created pool has been set as the current pool for a device, the default pool acts as the current pool.
When called without an explicit pool argument, each call to cudaMallocAsync infers the device from the specified stream and attempts to allocate memory from that device’s current pool. If the pool has insufficient memory, the CUDA driver calls into the OS to allocate more memory. Each call to cudaFreeAsync returns memory to the pool, which is then available for re-use on subsequent cudaMallocAsync requests. Pools are managed by the CUDA driver, which means that applications can enable pool sharing between multiple libraries without those libraries having to coordinate with each other.
If a memory allocation request made using cudaMallocAsync can’t be serviced due to fragmentation of the corresponding memory pool, the CUDA driver defragments the pool by remapping unused memory in the pool to a contiguous portion of the GPU’s virtual address space. Remapping existing pool memory instead of allocating new memory from the OS also helps keep the application’s memory footprint low.
By default, unused memory accumulated in the pool is returned to the OS during the next synchronization operation on an event, stream, or device, as the following code example shows.
cudaMallocAsync(ptr1, size1, stream); // Allocates new memory into the pool
kernel>>(ptr);
cudaFreeAsync(ptr1, stream); // Frees memory back to the pool
cudaMallocAsync(ptr2, size2, stream); // Allocates existing memory from the pool
kernel>>(ptr2);
cudaFreeAsync(ptr2, stream); // Frees memory back to the pool
cudaDeviceSynchronize(); // Frees unused memory accumulated in the pool back to the OS
// Note: cudaStreamSynchronize(stream) achieves the same effect here
Retaining memory in the pool
Returning memory from the pool to the system can affect performance in some cases. Consider the following code example:
for (int i = 0; i >>(ptr);
cudaFreeAsync(ptr, stream);
cudaStreamSynchronize(stream);
}
By default, stream synchronization causes any pools associated with that stream’s device to release all unused memory back to the system. In this example, that would happen at the end of every iteration. As a result, there is no memory to reuse for the next cudaMallocAsync call and instead memory must be allocated through an expensive system call.
To avoid this expensive reallocation, the application can configure a release threshold to enable unused memory to persist beyond the synchronization operation. The release threshold specifies the maximum amount of memory the pool caches. It releases all excess memory back to the OS during a synchronization operation.
By default, the release threshold of a pool is zero. This means that allunused memory in the pool is released back to the OS during every synchronization operation. The following code example shows how to change the release threshold.
cudaMemPool_t mempool;
cudaDeviceGetDefaultMemPool(&mempool, device);
uint64_t threshold = UINT64_MAX;
cudaMemPoolSetAttribute(mempool, cudaMemPoolAttrReleaseThreshold, &threshold);
for (int i = 0; i >>(ptr);
cudaFreeAsync(ptr, stream);
cudaStreamSynchronize(stream); // Only releases memory down to “threshold” bytes
}
Using a nonzero release threshold enables reusing memory from one iteration to the next. This requires only simple bookkeeping and makes the performance of cudaMallocAsync independent of the size of the allocation, which results in dramatically improved memory allocation performance (Figure 2).
Figure 2. Cost of allocating memory using cudaMallocAsync with and without setting a release threshold (all values relative to performance of 0.4MB with threshold allocation).
The pool threshold is just a hint. Memory in the pool can also be released implicitly by the CUDA driver to enable an unrelated memory allocation request in the same process to succeed. For example, a call to cudaMalloc or cuMemCreate could cause CUDA to free unused memory from any memory pool associated with the device in the same process to serve the request.
This is especially helpful in scenarios where an application makes use of multiple libraries, some of which use cudaMallocAsync and some that do not. By automatically freeing up unused pool memory, those libraries do not have to coordinate with each other to have their respective allocation requests succeed.
There are limitations to when the CUDA driver automatically reassigns memory from a pool to unrelated allocation requests. For example, the application may be using a different interface, like Vulkan or DirectX, to access the GPU, or there may be more than one process using the GPU at the same time. Memory allocation requests in those contexts do not cause automatic freeing of unused pool memory. In such cases, the application may have to explicitly free unused memory in the pool, by invoking cudaMemPoolTrimTo.
The bytesToKeep argument tells the CUDA driver how many bytes it can retain in the pool. Any unused memory that exceeds that size is released back to the OS.
Better performance through memory reuse
The stream parameter to cudaMallocAsync and cudaFreeAsync helps CUDA reuse memory efficiently and avoid expensive calls into the OS. Consider the following trivial code example.
In this code example, ptr2 is allocated in stream order after ptr1 is freed. The ptr2 allocation could reuse some, or all, of the memory that was used for ptr1 without any synchronization, because kernelA and kernelB are launched in the same stream. So, stream-ordering semantics guarantee that kernelB cannot begin execution and access the memory until kernelA has completed. This way, the CUDA driver can help keep the memory footprint of the application low while also improving allocation performance.
The CUDA driver can also follow dependencies between streams inserted through CUDA events, as shown in the following code example:
Figure 4. Memory reuse across streams with an event dependency between them.
As the CUDA driver is aware of the dependency between streams A and B, it can reuse the memory used by ptr1 for ptr2. The dependency chain between streams A and B can contain any number of streams, as shown in the following code example.
cudaMallocAsync(&ptr1, size1, streamA);
kernelA>>(ptr1);
cudaFreeAsync(ptr1, streamA);
cudaEventRecord(event, streamA);
for (int i = 0; i >>(ptr2);
If necessary, the application can disable this feature on a per-pool basis:
int enable = 0;
cudaMemPoolSetAttribute(mempool, cudaMemPoolReuseFollowEventDependencies, &enable);
The CUDA driver can also reuse memory opportunistically in the absence of explicit dependencies specified by the application. While such heuristics may help improve performance or avoid memory allocation failures, they can add nondeterminism to the application and so can be disabled on a per-pool basis. Consider the following code example:
In this scenario, there are no explicit dependencies between streamA and streamB. However, the CUDA driver is aware of how far each stream has executed. If, on the second call to cudaMallocAsync in streamB, the CUDA driver determines that kernelA has finished execution on the GPU, then it can reuse some or all of the memory used by ptr1 for ptr2.
Figure 5. Opportunistic memory reuse across streams.
If kernelA has not finished execution, the CUDA driver can add an implicit dependency between the two streams such that kernelB does not begin executing until kernelA finishes.
Figure 6. Memory reuse through internal dependencies.
The application can disable these heuristics as follows:
int enable = 0;
cudaMemPoolSetAttribute(mempool, cudaMemPoolReuseAllowOpportunistic, &enable);
cudaMemPoolSetAttribute(mempool, cudaMemPoolReuseAllowInternalDependencies, &enable);
Summary
In part 1 of this series, we introduced the new API functions cudaMallocAsync and cudaFreeAsync , which enable memory allocation and deallocation to be stream-ordered operations. Use them to avoid expensive calls to the OS through memory pools maintained by the CUDA driver.
In part 2 of this series, we share some benchmark results to show the benefits of stream-ordered memory allocation. We also provide a step-by-step recipe for modifying your existing applications to take full advantage of this advanced CUDA capability.
Ever see a virtual kitchen materialize in real-time? If you caught NVIDIA CEO Jensen Huang’s keynote for our March 2021 GPU Technology Conference you’re no doubt wondering about more than a few of the presentation’s magic tricks. With the premiere of “Connecting in the Metaverse: The Making of the GTC Keynote,” Wednesday, Aug. 11, at Read article >
In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also … Continued
In part 1 of this series, we introduced new API functions, cudaMallocAsync and cudaFreeAsync, that enable memory allocation and deallocation to be stream-ordered operations. In this post, we highlight the benefits of this new capability by sharing some big data benchmark results and provide a code migration guide for modifying your existing applications. We also cover advanced topics to take advantage of stream-ordered memory allocation in the context of multi-GPU access and the use of IPC. This all helps you improve performance within your existing applications.
GPU Big Data Benchmark
To measure the performance impact of the new stream-ordered allocator in a real application, here are results from the RAPIDS GPU Big Data Benchmark (gpu-bdb). gpu-bdb is a benchmark of 30 queries representing real-world data science and machine learning workflows at various scale factors: SF1000 is 1 TB of data and SF10000 is 10 TB. Each query is, in fact, a model workflow that can include SQL, user-defined functions, careful subsetting and aggregation, and machine learning.
Figure 1 shows the performance of cudaMallocAsync compared to cudaMalloc for a subset of gpu-bdb queries conducted at SF1000 on an NVIDIA DGX-2 across 16 V100 GPUs. As you can see, thanks to memory reuse and eliminating extraneous synchronization, there’s a 2–5x improvement in end-to-end performance when using cudaMallocAsync.
Figure 1. Speed up of cudaMallocAsync over cudaMalloc for various queries on the RAPIDS GPU Big Data Benchmark.
Interoperability with cudaMalloc and cudaFree
An application can use cudaFreeAsync to free a pointer allocated by cudaMalloc. The underlying memory is not freed until the next synchronization of the stream passed to cudaFreeAsync.
cudaMalloc(&ptr, size);
kernel>>(ptr);
cudaFreeAsync(ptr, stream);
cudaStreamSynchronize(stream); // The memory for ptr is freed at this point
Similarly, an application can use cudaFree to free memory allocated using cudaMallocAsync. However, cudaFree does not implicitly synchronize in this case, so the application must insert the appropriate synchronization to ensure that all accesses to the to-be-freed memory are complete. Any application code that may be intentionally or accidentally relying on the implicit synchronization behavior of cudaFree must be updated.
cudaMallocAsync(&ptr, size, stream);
kernel>>(ptr);
cudaStreamSynchronize(stream); // Must synchronize first
cudaFree(ptr);
Multi-GPU access
By default, memory allocated using cudaMallocAsync is accessible from the device associated with the specified stream. Accessing the memory from any other device requires enabling access to the entire pool from that other device. It also requires the two devices to be peer capable, as reported by cudaDeviceCanAccessPeer. Unlike cudaMalloc allocations, cudaDeviceEnablePeerAccess and cudaDeviceDisablePeerAccess have no effect on memory allocated from memory pools.
For example, consider enabling device 4access to the memory pool of device 3:
Access from a device other than the device on which the memory pool resides can be revoked by using cudaMemAccessFlagsProtNone when calling cudaMemPoolSetAccess. Access from the memory pool’s own device cannot be revoked.
Interprocess communication support
Memory allocated using the default memory pool associated with a device cannot be shared with other processes. An application must explicitly create its own memory pools to share memory allocated using cudaMallocAsync with other processes. The following code sample shows how to create an explicit memory pool with interprocess communication (IPC) capabilities:
The location type Device and location ID deviceId indicate that the pool memory must be allocated on a specific GPU. The allocation type Pinned indicates that the memory should be non-migratable, also known as non-pageable. The handle type PosixFileDescriptor indicates that the user intends to query a file descriptor for the pool to share it with another process.
The first step to share memory from this pool through IPC is to query the file descriptor that represents the pool:
int fd;
cudaMemAllocationHandleType handleType = cudaMemHandleTypePosixFileDescriptor;
cudaMemPoolExportToShareableHandle(&fd, exportPool, handleType, 0);
The application can then share the file descriptor with another process, for example through a UNIX domain socket. The other process can then import the file descriptor and obtain a process-local pool handle:
There is also an overloaded version of cudaMallocAsync that takes the same arguments as cudaMallocFromPoolAsync:
cudaMallocAsync(&ptr, size, exportPool, stream);
After memory is allocated from this pool through either of these two APIs, the pointer can then be shared with the importing process. First, the exporting process gets an opaque handle representing the memory allocation:
This opaque data can then be shared with the importing process through any standard IPC mechanism, such as through shared memory, pipes, and so on The importing process then converts the opaque data into a process-local pointer:
Now both processes share access to the same memory allocation. The memory must be freed in the importing process before it is freed in the exporting process. This is to ensure that the memory does not get reutilized for another cudaMallocAsync request in the exporting process while the importing process is still accessing the previously shared memory allocation, potentially causing undefined behavior.
The existing function cudaIpcGetMemHandle works only with memory allocated through cudaMalloc and cannot be used on any memory allocated through cudaMallocAsync, regardless of whether the memory was allocated from an explicit pool.
Changing a device pool
If the application expects to use an explicit memory pool most of the time, it can consider setting that as the current pool for the device through cudaDeviceSetMemPool. This enables the application to avoid having to specify the pool argument each time that it must allocate memory from that pool.
cudaDeviceSetMemPool(device, pool);
cudaMallocAsync(&ptr, size, stream); // This now allocates from the earlier pool set instead of the device’s default pool.
This has the advantage that any other function allocating with cudaMallocAsync now automatically uses the new pool as its default. The current pool associated with a device can be queried using cudaDeviceGetMemPool.
Library composability
In general, libraries should not change a device’s pool, as doing so affects the entire top-level application. If a library must allocate memory with different properties than those of the default device pool, it may create its own pool and then allocate from that pool using cudaMallocFromPoolAsync. The library could also use the overloaded version of cudaMallocAsync that takes the pool as an argument.
To make interoperability easier for applications, libraries should consider providing APIs for the top-level application to coordinate the pools used. For example, libraries could provide set or get APIs to enable the application to control the pool in a more explicit manner. The library could also take the pool as a parameter to individual APIs.
Code migration guide
When porting an existing application that uses cudaMalloc or cudaFree to the new cudaMallocAsync or cudaFreeAsync APIs, consider the following guidelines.
Guidelines for determining the appropriate pool:
The initial default pool is suitable for many applications.
Today, an explicitly constructed pool is only required to share pool memory across processes with CUDA IPC. This may change with future features.
For convenience, consider making the explicitly created pool the device’s current pool to ensure that all cudaMallocAsync calls within the process use that pool. This must be done by the top-level application and not by libraries, so as to avoid conflicting with the goals of the top-level application.
Guidelines for setting the release threshold for all memory pools:
The choice of release threshold depends on whether and how a device is shared:
Exclusive to a single process: Use the maximum release threshold.
Shared among cooperating processes: Coordinate to use the same pool through IPC or set each process pool to an appropriate value to avoid any one process monopolizing all device memory.
Shared among unknown processes: If known, set the threshold to the working set size of the application. Otherwise, leave it at zero and use a profiler to determine whether allocation performance is a bottleneck before using a nonzero value.
Guidelines for replacing cudaMalloc with cudaMallocAsync:
Ensure that all memory accesses are ordered after the stream-ordered allocation.
If peer access is required, use cudaMemPoolSetAccess as cudaEnablePeerAccess and cudaDisablePeerAccesss have no effect on pool memory.
Unlike cudaMalloc allocations, cudaDeviceReset does not implicitly free pool memory, so it must be explicitly freed.
If freeing with cudaFree, ensure that all accesses are complete through appropriate synchronization before freeing, as there is no implicit synchronization in this case. Any subsequent code that relied on the implicit synchronization may also have to be updated.
If memory is shared with another process through IPC, allocate from an explicitly created pool with IPC support and remove all references to cudaIpcGetMemHandle, cudaIpcOpenMemHandle, and cudaIpcCloseMemHandle for that pointer.
If the memory must be used with GPUDirect RDMA, continue to use cudaMalloc for now because memory allocated through cudaMallocAsync currently does not support it. CUDA aims to support this in the future.
Unlike memory allocated with cudaMalloc, memory allocated with cudaMallocAsync is not associated with a CUDA context. This has the following implications:
Calling cuPointerGetAttribute with the attribute CU_POINTER_ATTRIBUTE_CONTEXT returns null for the context.
When calling cudaMemcpy with at least one of source or destination pointers allocated using cudaMallocAsync, that memory must be accessible from the calling thread’s current context/device. If it’s not accessible from that context or device, use cudaMemcpyPeer instead.
Guidelines for replacing cudaFree with cudaFreeAsync:
Ensure that all memory accesses are ordered before the stream-ordered deallocation.
The memory may not be freed back to the system until the next synchronization operation. If the release threshold is set to a nonzero value, the memory may not be freed back to the system until the corresponding pool is explicitly trimmed.
Unlike cudaFree, cudaFreeAsync does not implicitly synchronize the device. Any code relying on this implicit synchronization must be updated to synchronize explicitly.
Conclusion
The stream-ordered allocator and cudaMallocAsync and cudaFreeAsync API functions added in CUDA 11.2 extend the CUDA stream programming model by introducing memory allocation and deallocation as stream-ordered operations. This enables allocations to be scoped to the kernels, which use them while avoiding costly device-wide synchronization that can occur with traditional cudaMalloc/cudaFree.
Furthermore, these API functions add the concept of memory pools to CUDA, enabling the reuse of memory to avoid costly system calls and improve performance. Use the guidelines to migrate your existing code and see how much your application performance improves!
I have a self supervised problem, where I have a sequence of web page visits, each of which is denoted by some hash, and I can lookup this has to find a dense vector to represent this page.
I would like to train my model on various self supervised tasks, such as predicting a target page vector, given some context, or predicting the next vector in the sequence.
My question here is what is the best approach for implementation of such a problem? The naive approach would be to prep the data in a separate script.
I feel there must be a better approach, possibly using `tf.data.map` within the pipeline?
Would love to hear some best practices for self supervision, and the creation of datasets using TF and Keras.
Im trying to use the MNIST database of handwritten numbers to make a neural network that predicts a number. The issue is that the validation accuracy is very high, around 98%, but when i try to apply the model to numbers outside of the database, the accuracy is very low. I got an accuracy of 50% with a dense network and 60% with a CNN. Im not sure what the issue is or how I can make the network work more consistently with new data that I give it. Any advice would be appreciated. If needed, I can send the code and test images.
The NVIDIA, Facebook, and TensorFlow recommender teams will be hosting a summit with live Q&A to dive into best practices and insights on how to develop and optimize deep learning recommender systems.
The NVIDIA, Facebook, and TensorFlow recommender teams will be hosting a summit with live Q&A to dive into best practices and insights on how to develop and optimize deep learning recommender systems.
Develop and Optimize Deep Learning Recommender Systems Thursday, July 29 at 10 a.m. PT
By joining this Deep Learning Recommender Summit, you will hear from fellow ML engineers and data scientists from NVIDIA, Facebook, and TensorFlow on best practices, learnings, and insights for building and optimizing highly effective DL recommender systems.
Sessions include:
High-Performance Recommendation Model Training at Facebook In this talk, we will first analyze how model architecture affects the GPU performance and efficiency, and also present the performance optimizations techniques we applied to improve the GPU utilization, which includes optimized PyTorch-based training stack supporting both model and data parallelism, high-performance GPU operators, efficient embedding table sharding, memory hierarchy and pipelining.
RecSys2021 Challenge: Predicting User Engagements with Deep Learning Recommender Systems The NVIDIA team, a collaboration of Kaggle Grandmaster and NVIDIA Merlin, won the RecSys2021 challenge. It was hosted by Twitter, who provided almost 1 billion tweet-user pairs as a dataset. The team will present their winning solution with a focus on deep learning architectures and how to optimize them.
Revisiting Recommender Systems on GPU A new era of faster ETL, Training, and Inference is coming to the RecSys space and this talk will walk through some of the patterns of optimization that guide the tools we are building to make recommenders faster and easier to use on the GPU.
TensorFlow Recommenders TensorFlow Recommenders is an end-to-end library for recommender system models: from retrieval, through ranking, to post-ranking. In this talk, we describe how TensorFlow Recommenders can be used to fit and safely deploy sophisticated recommender systems at scale.