Categories
Misc

Ferrari of Finance: Accelerated Computing Drives Milan Bank Forward

Banks require more than cash in the vault these days, they also need accelerated computing in the back room. “The boost we’re getting with GPUs not only significantly improved our performance at the same cost, it helped us redefine our business and sharpen our focus on customers,” said Marco Airoldi, who’s been head of financial Read article >

The post Ferrari of Finance: Accelerated Computing Drives Milan Bank Forward appeared first on NVIDIA Blog.

Categories
Misc

Enabling Dynamic Control Flow in CUDA Graphs with Device Graph Launch

Graphic of device-side work schedulerCUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single…Graphic of device-side work scheduler

CUDA Graphs significantly reduce the overhead of launching a large batch of user operations by defining them as a task graph, which may be launched in a single operation. Knowing the workflow upfront enables the CUDA driver to apply various optimizations, which cannot be performed when launching through a stream model. 

However, this performance comes at the cost of flexibility: if the full workflow is not known in advance, then GPU execution must be interrupted to return to the CPU to make a decision.

CUDA device graph launch solves this problem by enabling a task graph to be performantly launched from a running GPU kernel, based on data that is determined at run time. CUDA device graph launch offers two distinct launch modes—fire and forget, and tail launch—to enable a wide range of applications and use.

This post demonstrates how to use device graph launch and the two launch modes. It features the example of a device-side work scheduler, which decompresses files for data processing.

Device graph initialization

Executing a task graph involves the four-step process outlined below: 

  1. Create the graph
  2. Instantiate the graph into an executable graph
  3. Upload the executable graph’s work descriptors to the GPU
  4. Launch the executable graph

By separating the launch step from the other steps, CUDA is able to optimize the workflow and keep graph launch as lightweight as possible. As a convenience, CUDA will also combine the upload step with the launch step the first time a graph is launched if the upload step has not been called explicitly.

In order to launch a graph from a CUDA kernel, the graph first must have been initialized for device launch during the instantiation step. Additionally, before it can be launched from the device, the device graph must have been uploaded to the device, either explicitly through a manual upload step or implicitly through a host launch. The code below, which performs the host-side steps to set up the device scheduler example, shows both options:

// This is the signature of our scheduler kernel
// The internals of this kernel will be outlined later
__global__ void schedulerKernel(
    fileData *files,
    int numFiles,
    int *currentFile,
    void **currentFileData,
    cudaGraphExec_t zipGraph,
    cudaGraphExec_t lzwGraph,
    cudaGraphExec_t deflateGraph);

void setupAndLaunchScheduler() {
    cudaGraph_t zipGraph, lzwGraph, deflateGraph, schedulerGraph;
    cudaGraphExec_t zipExec, lzwExec, deflateExec, schedulerExec;

    // Create the source graphs for each possible operation we want to perform
    // We pass the currentFileData ptr to this setup, as this ptr is how the scheduler will
    // indicate which file to decompress
    create_zip_graph(&zipGraph, currentFileData);
    create_lzw_graph(&lzwGraph, currentFileData);
    create_deflate_graph(&deflateGraph, currentFileData);

    // Instantiate the graphs for these operations and explicitly upload
    cudaGraphInstantiate(&zipExec, zipGraph, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(zipExec, stream);
    cudaGraphInstantiate(&lzwExec, lzwGraph, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(lzwExec, stream);
    cudaGraphInstantiate(&deflateExec, deflateGraph, cudaGraphInstantiateFlagDeviceLaunch);
    cudaGraphUpload(deflateExec, stream);

    // Create and instantiate the scheduler graph
    cudaStreamBeginCapture(stream, cudaStreamCaptureModeGlobal);
    schedulerKernel>>(files, numFiles, currentFile, currentFileData, zipExec, lzwExec, deflateExec);
    cudaStreamEndCapture(stream, &schedulerGraph);
    cudaGraphInstantiate(&schedulerExec, schedulerGraph, cudaGraphInstantiateFlagDeviceLaunch);

    // Launch the scheduler graph - this will perform an implicit upload
    cudaGraphLaunch(schedulerExec, stream);
}

It is important to note here that device graphs can be launched either from the host or from the device. Therefore, the same cudaGraphExec_t handles may be passed to the scheduler for launch on the device as for launch on the host.

Fire and forget launch

A scheduler kernel dispatches work based on incoming data. For work dispatch, fire and forget launch is the preferred launch method.

When a graph is launched using fire and forget launch, it is dispatched immediately. It executes independently of both the launching graph and subsequent graphs launched using fire and forget mode. Because the work executes immediately, fire and forget launch is preferable for work dispatched by a scheduler, as it starts running as quickly as possible. CUDA introduces a new device-side named stream to perform a fire and forget launch of a graph. See below for an example of a simple dispatcher.

enum compressionType {
    zip     = 1,
    lzw     = 2,
    deflate = 3
};

struct fileData {
    compressionType comprType;
    void *data; 
};

__global__ void schedulerKernel(
    fileData *files,
    int numFiles
    int *currentFile,
    void **currentFileData,
    cudaGraphExec_t zipGraph,
    cudaGraphExec_t lzwGraph,
    cudaGraphExec_t deflateGraph)
{
    // Set the data ptr to the current file so the dispatched graph
    // is operating on the correct file data
    *currentFileData = files[currentFile].data;

    switch (files[currentFile].comprType) {
        case zip:
            cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
            break;
        case lzw:
            cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
            break;
        case deflate:
            cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
            break;
        default:
            break;
    }
}

It is also important to note that graph launches can be nested and recursive, so additional device graphs can be dispatched from fire and forget launches. Although not shown in this example, the graphs that are decompressing the file data could dispatch more graphs to do further processing on that data once it is fully decompressed (image processing, for example). Device graph flow is hierarchical, just like graphs themselves.

Tail launch

CUDA work is launched asynchronously to the GPU, which means the launching thread must explicitly wait for the work to complete before consuming any result or output. This is typically done from a CPU thread using a synchronization operation such as cudaDeviceSynchronize or cudaStreamSynchronize.

It is not possible for a launching thread on the GPU to synchronize on device graph launches through traditional methods such as cudaDeviceSynchronize. Instead, when operation ordering is desired, tail launch should be used.

When a graph is submitted for tail launch, it does not execute immediately, but rather upon completion of the launching graph. CUDA encapsulates all dynamically generated work as part of the parent graph, so a tail launch will also wait for all generated fire and forget work before executing.

This is true whether the tail launch was issued before or after any fire and forget launches. Tail launches themselves execute in the order in which they are enqueued. A special case is self-relaunch, where the currently running device graph is enqueued to relaunch through tail launch. Only one pending self-relaunch is permitted at a time.

Using tail launch, you can upgrade the previous dispatcher to become a full scheduler kernel by having it relaunch itself repeatedly, effectively creating a loop in the execution flow:

__global__ void schedulerKernel(
    fileData *files,
    int numFiles,
    int *currentFile,
    void **currentFileData,
    cudaGraphExec_t zipGraph,
    cudaGraphExec_t lzwGraph,
    cudaGraphExec_t deflateGraph)
{
    // Set the data ptr to the current file so the dispatched graph
    // is operating on the correct file data
    *currentFileData = files[currentFile].data;

    switch (files[currentFile].comprType) {
        case zip:
            cudaGraphLaunch(zipGraph, cudaStreamGraphFireAndForget);
            break;
        case lzw:
            cudaGraphLaunch(lzwGraph, cudaStreamGraphFireAndForget);
            break;
        case deflate:
            cudaGraphLaunch(deflateGraph, cudaStreamGraphFireAndForget);
            break;
        default:
            break;
    }

    // If we have not finished iterating over all the files, relaunch
    if (*currentFile 



Notice how the relaunch operation uses cudaGetCurrentGraphExec to retrieve a handle to the currently executing graph. It can relaunch itself without needing a handle to its own executable graph.

Use of tail launch for the self-relaunch has the added effect of synchronizing on (waiting for) the dispatched fire and forget work before the next scheduler kernel relaunch begins. A device graph can only have one pending launch at a time (plus one self-relaunch). In order to relaunch the graph that was just dispatched, you need to make sure that the previous launch completed first. Performing a self-relaunch accomplishes this goal, so that you can dispatch whatever graph is needed for the next iteration.

Device compared to host launch performance

How would this example fare against a host-launched graph? Figure 1 compares fire and forget launch, tail launch, and host launch latencies for various topologies.

A chart comparing the two device launch modes and host launch for three topologies. The topologies are a straight-line graph, a graph which forks and joins repeatedly, and a graph which forks once into parallel straight-line sections.
Figure 1. A comparison of device and host launch latencies for various topologies

This chart shows that not only is the device-side launch latency better than 2x lower than that of host launch, but it is also not impacted by graph structure. The latency is identical for each of the given topologies.

Device launch also scales much better to the width of the graph, as shown in Figure 2.

A line-graph comparing fire and forget, tail, and host launch latencies for graphs containing 1, 2, 4, and 8 parallel straight-line sections.
Figure 2. A comparison of device and host launch latencies for graphs containing variable amounts of parallel straight-line segments

By comparison with host launch, device launch latency stays almost constant regardless of how much parallelism is in the graph.

Conclusion

CUDA device graph launch offers a performant way to enable dynamic control flow within CUDA kernels. While the example presented in this post provides a means of getting started with the feature, it is but a small representation of the ways this feature can be used.

For more documentation, see the device graph launch section of the programming guide. To try device graph launch, download CUDA Toolkit 12.0.

Categories
Misc

Explainer: What Is Computer Vision?

Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many…

Computer vision is achieved with convolutional neural networks that can use images and video to perform segmentation, classification and detection for many applications.

Categories
Misc

CUDA Toolkit 12.0 Released for General Availability

NVIDIA announces the newest CUDA Toolkit software release, 12.0. This release is the first major release in many years and it focuses on new programming models…

NVIDIA announces the newest CUDA Toolkit software release, 12.0. This release is the first major release in many years and it focuses on new programming models and CUDA application acceleration through new hardware capabilities.

You can now target architecture-specific features and instructions in the NVIDIA Hopper and NVIDIA Ada Lovelace architectures with CUDA custom code, enhanced libraries, and developer tools.

CUDA 12.0 includes many changes, both major and minor. Not all changes are listed here, but this post offers an overview of the key capabilities.

Overview

  • Support for new NVIDIA Hopper and NVIDIA Ada Lovelace architecture features with additional programming model enhancements for all GPUs, including new PTX instructions and exposure through higher-level C and C++ APIs
  • Support for revamped CUDA dynamic parallelism APIs, offering substantial performance improvements compared to the legacy APIs
  • Enhancements to the CUDA Graphs API:
    • You can now schedule graph launches from GPU device-side kernels by calling built-in functions. With this ability, user code in kernels can dynamically schedule graph launches, greatly increasing the flexibility of CUDA Graphs.
    • The cudaGraphInstantiate API has been refactored to remove unused parameters.
  • Support for the GCC 12 host compiler
  • Support for C++20
  • New nvJitLink library in the CUDA Toolkit for JIT LTO
  • Library optimizations and performance improvements
  • Updates to Nsight Compute and Nsight Systems Developer Tools
  • Updated support for the latest Linux versions

For more information, see CUDA Toolkit 12.0 Release Notes. CUDA Toolkit 12.0 is available to download.

NVIDIA Hopper and NVIDIA Ada Lovelace architecture support

CUDA applications can immediately benefit from increased streaming multiprocessor (SM) counts, higher memory bandwidth, and higher clock rates in new GPU families. The CUDA and CUDA libraries expose new performance optimizations based on GPU hardware architecture enhancements.

CUDA 12.0 exposes programmable functionality for many features of the NVIDIA Hopper and NVIDIA Ada Lovelace architectures:

  • Many tensor operations are now available through public PTX:
    • TMA operations
    • TMA bulk operations
    • 32x Ultra xMMA (including FP8 and FP16)
  • Launch parameters control membar domains in NVIDIA Hopper GPUs
  • Support for the smem sync unit PTX and C++ API
  • Support for C intrinsics for cooperative grid array (CGA) relaxed barriers
  • Support for programmatic L2 Cache to SM multicast (NVIDIA Hopper GPUs only)
  • Support for public PTX for SIMT collectives: elect_one
  • Genomics and DPX instructions are now available for NVIDIA Hopper GPUs to provide faster combined-math arithmetic operations (three-way max, fused add+max, and so on).

Lazy loading

Lazy loading is a technique for delaying the loading of both kernels and CPU-side modules until loading is required by the application. The default is preemptively loading all the modules the first time a library is initialized. This can result in significant savings, not only of device and host memory, but also in the end-to-end execution time of your algorithms.

Lazy loading has been part of CUDA since the 11.7 release. Subsequent CUDA releases have continued to augment and extend it. From the application development perspective, nothing specific is required to opt into lazy loading. Your existing applications work with lazy loading as-is.

If you have operations that are particularly latency-sensitive, you may want to profile your applications. The tradeoff with lazy loading is a minimal amount of latency at the point in the application where the functions are first loaded. This is overall lower than the total latency without lazy loading.​

Metric Baseline CUDA 11.7 CUDA 11.8+ Improvement
End-to-end runtime [s] 2.9 1.7 0.7 4x
Binary load time [s] 1.6 0.8 0.01 118x
Device memory footprint [MB] 1245 435 435 3x
Host memory footprint [MB] 1866 1229 60 31x
Table 1. Example application speedup with lazy loading

All libraries used with lazy loading must be built with 11.7+ to be eligible.

Lazy loading is not enabled in the CUDA stack by default in this release. To evaluate it for your application, run with the environment variable CUDA_MODULE_LOADING=LAZY set.

Compatibility

CUDA minor version compatibility is a feature introduced in 11.x that gives you the flexibility to dynamically link your application against any minor version of the CUDA Toolkit within the same major release. Compile your code one time, and you can dynamically link against libraries, the CUDA runtime, and the user-mode driver from any minor version within the same major version of CUDA Toolkit.

For example, 11.6 applications can link against the 11.8 runtime and the reverse. This is accomplished through API or ABI consistency within the library files. For more information, see CUDA Compatibility.

Minor version compatibility continues into CUDA 12.x. However, as 12.0 is a new major release, the compatibility guarantees a reset. Applications that used minor version compatibility in 11.x may have issues when linking against 12.0. Either recompile your application against 12.0 or statically link to the needed libraries within 11.x to ensure the continuity of your development.

Likewise, applications recompiled or built in 12.0 will link to future versions of 12.x but are not guaranteed to work in CUDA Toolkit 11.x.

JIT LTO support

CUDA 12.0 Toolkit introduces a new nvJitLink library for JIT LTO support. NVIDIA is deprecating the support for the driver version of this feature. For more information, see Deprecated Features.

C++20 compiler support

CUDA Toolkit 12.0 adds support for the C++20 standard. C++20 is enabled for the following host compilers and their minimal versions:

  • GCC 10
  • Clang 11
  • MSVC 2022
  • NVC++ 22.x
  • Arm C/C++ 22.x

For more information about features, see the corresponding host compiler documentation.

While the majority of C++20 features are available in both host and device code, some are restricted.

Module support

Modules are introduced in C++20 as a new way to import and export entities across translation units.

Because it requires complex interaction between the CUDA device compiler and the host compiler, modules are not supported in CUDA C++, in either host or device code. Uses of the module and export and import keywords are diagnosed as errors.

Coroutine support

Coroutines are resumable functions. Execution can be suspended, in which case control is returned to the caller. Subsequent invocations of the coroutine resume at the point where it was suspended.

Coroutines are supported in host code but are not supported in device code. Uses of the co_await, co_yield, and co_return keywords in the scope of a device function are diagnosed as errors during device compilation.

Three-way comparison operator

The three-way comparison operator is a new kind of relational enabling the compiler to synthetize other relational operators.

Because it is tightly coupled with utility functions from the Standard Template Library, its use is restricted in device code whenever a host function is implicitly called.

Uses where the operator is called directly and does not require implicit calls are enabled.

Nsight Developer Tools

Nsight Developer Tools are receiving updates coinciding with CUDA Toolkit 12.0.

NVIDIA Nsight Systems 2022.5 introduces a preview of InfiniBand switch metrics sampling. NVIDIA Quantum InfiniBand switches offer high-bandwidth, low-latency communication. Viewing switch metrics on the Nsight Systems timeline enables you to better understand your application’s network usage. You can use this information to optimize the application’s performance.

Screenshot of InfiniBand Switch network usage visualized in the Nsight Systems timeline. Throughput can be expanded to view bytes received and bytes sent.
Figure 1. InfiniBand Switch metrics in Nsight Systems

Nsight tools are built to be used collaboratively. Performance analysis in Nsight Systems often informs a deeper dive into kernel activity in Nsight Compute.

To streamline this process, Nsight Compute 2022.4 introduces Nsight Systems integration. This feature enables you to launch system trace activity and view the report in the Nsight Compute interface. You can then inspect the report and initiate kernel profiling from within the context menu.

With this workflow, you don’t have to run two different applications: it can all be done within one.

Screenshot of a system trace being viewed in Nsight Compute. Right-clicking on the selected kernel opens the context menu, and kernel profiling can be launched.
Figure 2. Nsight Systems integration in Nsight Compute

Nsight Compute 2022.4 also introduces a new inline function table that provides performance metrics split out for multiple inlined instances of a function. This heavily requested feature enables you to understand whether a function is suffering from performance issues in general or only in specific inlined cases.

It also enables you to understand where inlining is occurring, which can often lead to confusion when this level of detail is not available. The main source view continues to show the aggregation of metrics at a per-line level while the table lists the multiple locations where the function was inlined and the performance metrics for each location.

Screenshot of the source code viewer showing per-line metrics alongside a breakdown of where the function was inlined and its performance in those cases.
Figure 3. Inline function table in Nsight Compute

The Acceleration Structure viewer has also received a variety of optimizations and improvements, including support for NVIDIA OptiX curve profiling.

For more information, see NVIDIA Nsight Compute, NVIDIA Nsight Systems, and Nsight Visual Studio Code Edition.

Math library updates

All optimizations and features added to the library come at a cost, usually in the form of binary size. Binary size for each library has slowly increased over the course of their lifespan. NVIDIA has made significant efforts to shrink these binaries without sacrificing performance. cuFFT saw the largest size reduction, with over 50% between CUDA Toolkit 11.8 and 12.0.

There are also a few library-specific features worth calling out.

cuBLAS

cuBLASLt exposes mixed-precision multiplication operations with the new FP8 data types. These operations also support BF16 and FP16 bias fusions, as well as FP16 bias with GELU activation fusions for GEMMs with FP8 input and output data types.

Regarding performance, FP8 GEMMs can be up to 3x and 4.5x faster on H100 PCIe and SXM, respectively, compared to BF16 on A100. The CUDA Math API provides FP8 conversions to facilitate the use of the new FP8 matrix multiplication operations.

cuBLAS 12.0 extends the API to support 64-bit integer problem sizes, leading dimensions, and vector increments. These new functions have the same API as their 32-bit integer counterparts except that they have the _64 suffix in the name and declare the corresponding parameters as int64_t.

cublasStatus_t cublasIsamax(cublasHandle_t handle, int n, const float *x, int incx, int *result);

The 64-bit integer counterpart is as follows:

cublasStatus_t cublasIsamax_64(cublasHandle_t handle, int64_t n, const float *x, int64_t incx, int64_t *result);

Performance is the focus for cuBLAS. When the arguments passed to 64-bit integer API fit into the 32-bit range, the library uses the same kernels as if you called the 32-bit integer API. To try the new API, the migration should be as simple as just adding the _64 suffix to cuBLAS functions, thanks to the C/C++ autoconversion from int32_t values to int64_t.

cuFFT

During plan initialization, cuFFT conducts a series of steps, including heuristics, to determine which kernels are used as well as kernel module loads.

Starting with CUDA 12.0, cuFFT delivers a larger portion of kernels using the CUDA Parallel Thread eXecution (PTX) assembly form, instead of the binary form.

The PTX code of cuFFT kernels is loaded and compiled further to the binary code by the CUDA device driver at runtime when a cuFFT plan is initialized. The first improvement available, due to the new implementation, will enable many new accelerated kernels for the NVIDIA Maxwell, NVIDIA Pascal, NVIDIA Volta, and NVIDIA Turing architectures.

cuSPARSE

To reduce the amount of required workspace for sparse-sparse matrix multiplication (SpGEMM), NVIDIA is releasing two new algorithms with lower memory usage. The first algorithm computes a strict bound on the number of intermediate products, while the second one enables partitioning the computation in chunks. These new algorithms are beneficial for customers on devices with smaller memory storage.

INT8 support has been added to cusparseGather, cusparseScatter, and cusparseCsr2cscEx2.

Finally, for SpSV and SpSM, the preprocessing time is improved by an average factor of 2.5x. For the execution phase, SpSV is improved by an average factor of 1.1x, while SpSM is improved by an average factor of 3.0x.

Math API

The new NVIDIA Hopper architecture comes with new Genomics and DPX instructions for faster means of computing combined arithmetic operations like three-way max, fused add+max, and so on.

New DPX instructions accelerate dynamic programming algorithms by up to 7x over the A100 GPU. Dynamic programming is an algorithmic technique for solving a complex recursive problem by breaking it down into simpler sub-problems. For a better user experience, these instructions are now exposed through the Math API.

An example would be a three-way max + ReLU operation, max(max(max(a, b), c), 0).

int __vimax3_s32_relu ( const int  a, const int  b, const int  c )

For more information, see Boosting Dynamic Programming Performance Using NVIDIA Hopper GPU DPX Instructions.

Image processing updates: nvJPEG

nvJPEG now has an improved implementation that significantly reduces the GPU memory footprint. This is accomplished by using zero-copy memory operations, fusing kernels, and in-place color space conversion.

Summary

We continue to focus on helping researchers, scientists, and developers solve the world’s most complicated AI/ML and data sciences challenges through simplified programming models. 

This CUDA 12.0 release is the first major release in many years and is foundational to help accelerate applications through the use of next-generation NVIDIA GPUs. New architecture-specific features and instructions in the NVIDIA Hopper and NVIDIA Ada Lovelace architectures are now targetable with CUDA custom code, enhanced libraries, and developer tools.

With the CUDA Toolkit, you can develop, optimize, and deploy your applications on GPU-accelerated embedded systems, desktop workstations, enterprise data centers, cloud-based platforms, and HPC supercomputers. The toolkit includes GPU-accelerated libraries, debugging and optimization tools, a C/C++ compiler, a runtime library, and access to many advanced C/C++ and Python libraries.

For more information, see the following resources:

Categories
Misc

CUDA Context-Independent Module Loading

Most CUDA developers are familiar with the cuModuleLoad API and its counterparts for loading a module containing device code into a CUDA context. In…

Most CUDA developers are familiar with the cuModuleLoad API and its counterparts for loading a module containing device code into a CUDA context. In most cases, you want to load identical device code on all devices. This requires loading device code into each CUDA context explicitly. Moreover, libraries and frameworks that do not control context creation and destruction must keep track of them to explicitly load and unload modules. 

This post discusses context-independent loading introduced in CUDA 12.0, which solves these problems.

Context-dependent loading

Traditionally, module loading has always been associated with a CUDA context. The following code example shows the traditional way of loading identical device code into two devices and then launching kernels on them.

// Device 0
cuDeviceGet(&device0, 0);
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuModuleLoad(&module0, “myModule.cubin”);
// Device 1
cuDeviceGet(&device1, 1);
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuModuleLoad(&module1, “myModule.cubin”);

Launching a kernel on each of the devices requires you to retrieve a per-module CUfunction as shown in the following code example:

// Device 0
cuModuleGetFuntion(&function0, module0, “myKernel”);
cuLaunchKernel(function0, …);
// Device 1
cuModuleGetFuntion(&function1, module1, “myKernel”);
cuLaunchKernel(function1, …);

This increases code complexity in the application as you must retrieve and track the per-context and per-module types. You also have to unload each module explicitly by using the cuModuleUnload API.

The problem is exacerbated when libraries or frameworks primarily use CUDA driver APIs for loading their own modules. They may not have complete control over the lifetime of contexts owned by the application.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  map moduleContextMap;
}

libraryFunc() {
  cuCtxGetCurrent(&ctx);
  if (!moduleContextMap.contains(ctx)){
    cuModuleLoad(&module, “myModule.cubin”);
    moduleContextMap[ctx] = module;
  }
  else {
    module = moduleContextMap[ctx];
  }

  cuModuleGetFuntion(&function, module, “myKernel”);
  cuLaunchKernel(function, …);
}

libraryDeinitialize() {
  moduleContextMap.clear();
}

In the code example, the library must check for new contexts and load modules into them explicitly. It also must maintain state to check if the module is already loaded into the context. 

Ideally, the state can be freed after the context is destroyed. However, this is not possible if the library has no control over the lifetime of contexts. 

This means that the freeing of resources must be delayed until library deinitialization. This not only increases code complexity, but it also causes the library to hold on to resources longer than it must, potentially denying another portion of the application from using that memory.

Another alternative is for libraries and frameworks to force additional constraints on the users to ensure that they have sufficient control over resource allocation and cleanup.

Context-independent loading

CUDA 12.0 introduces context-independent loading with the addition of the cuLibrary* and cuKernel* APIs, which solve these problems. With context-independent loading, the loading and unloading of modules into each CUDA context is done automatically by the CUDA driver as contexts are created and destroyed by the application.

// Load library
cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
cuLibraryGetKernel(&kernel, library, “myKernel”);

// Launch kernel on the primary context of device 0
cuDevicePrimaryCtxRetain(&ctx0, device0);
cuLaunchKernel((CUkernel)kernel, …);

// Launch kernel on the primary context of device 1
cuDevicePrimaryCtxRetain(&ctx1, device1);
cuLaunchKernel((CUkernel)kernel, …);

// Unload library
cuLibraryUnload(library);

As shown in the code example, the cuLibraryLoadFromFile API takes care of loading the module when a context is created or initialized. In the example, this is done during cuDevicePrimaryCtxRetain

Moreover, you can now launch the kernels using the context-independent handle CUkernel, rather than having to maintain a per-context CUfunctioncuLibraryGetKernel retrieves a context-independent handle to the device function myKernel. The device function can then be launched with cuLaunchKernel by specifying the context-independent handle CUkernel. The CUDA driver takes care of launching the device function in the appropriate context based on the context that is active at that point.

Libraries and frameworks can now simply load and unload modules one time during initialization and deinitialization, respectively.

// Application code

libraryInitialize();
cuDevicePrimaryCtxRetain(&ctx0, device0);
libraryFunc();
cuDevicePrimaryCtxRetain(&ctx0, device1);
libraryFunc();
libraryDeinitialize();

// Library code

libraryInitialize() {
  cuLibraryLoadFromFile(&library,“myModule.cubin”, …);
  cuLibraryGetKernel(&kernel, library, “myKernel”);
}

libraryFunc() {
  cuLaunchKernel((CUkernel)kernel, …);
}

libraryDeinitialize() {
  cuLibraryUnload(library);
}

The library does not have to maintain and track per-context states anymore. The design of context-independent loading enables the CUDA driver to track modules and contexts and carry out the work of loading and unloading modules.

Accessing __managed__ variables

Managed variables can be referenced from both device and host code. For example, the address of a managed variable can be queried or it can be read or written directly from a device or host function. Unlike __device__ variables, which have the lifetime of a CUDA context in which it is created, __managed__ variables belonging to a module point to the same memory across all CUDA contexts or even devices. 

Before CUDA 12.0, there was no way to retrieve a handle through the driver API to a managed variable that would be unique across CUDA contexts. CUDA 12.0 introduces a new driver API cuLibraryGetManaged, which makes it possible to get a unique handle across CUDA contexts.

Get started with context-independent loading

In this post, we introduced new CUDA driver APIs that provide the ability to load device code independent of a CUDA context. We also discussed context-independent handles to launch kernels. Together, they provide a simpler way to load and execute code on the GPU in comparison to the traditional loading mechanisms, reducing code complexity and avoiding the need for maintaining per-context states. 

To start using these APIs, download the CUDA Driver and Toolkit version 12 or higher. For more information about the cuLibrary* and cuKernel* APIs, see the CUDA Driver API documentation.

Categories
Misc

Upcoming Workshop: Computer Vision for Industrial Inspection

Robot arms workingLearn how to create an end-to-end hardware-accelerated industrial inspection pipeline to automate defect detection in this workshop on January 18 (CET).Robot arms working

Learn how to create an end-to-end hardware-accelerated industrial inspection pipeline to automate defect detection in this workshop on January 18 (CET).

Categories
Misc

Improving Robot Motion Generation with Motion Policy Networks

Collision-free motion generation in unknown environments is a core building block for robotic applications. Generating such motions is challenging. The motion…

Collision-free motion generation in unknown environments is a core building block for robotic applications. Generating such motions is challenging. The motion generator must be fast enough for real-time performance and reliable enough for practical deployment. 

Many methods addressing these challenges have been proposed, ranging from using local controllers to global planners. However, these traditional motion planning solutions are unable to overcome shortcomings when the environment is unknown and dynamic. They also require complex visual processing procedures, such as SLAM, to generate obstacle representations by aggregating camera observations from multiple viewpoints. These representations ultimately require costly updates when the objects move and the environment changes.

Motion Policy Networks (MπNets), pronounced “M Pi Nets,” is a new end-to-end neural policy developed by the NVIDIA Robotics research team. MπNets generates collision-free, smooth motion in real time, by using a continuous stream of data coming from a single static camera. The technology is able to circumvent the challenges of traditional motion planning and is flexible enough to be applied in unknown environments.

We will be presenting this work on December 18 at the Conference on Robot Learning (CoRL) 2022 in New Zealand.

Large-scale synthetic data generation

To train the MπNets neural policy, we first needed to create a large-scale dataset for learning and benchmarking. We turned to simulation for synthetically generating vast amounts of robot trajectories and camera point cloud data. 

The expert trajectories are generated using a motion planner that creates consistent motion around complex obstacles while accounting for a robot’s physical and geometric constraints. It consists of a pipeline of geometric fabrics from NVIDIA Omniverse, an AIT* global planner, and spline-based temporal resampling. 

MπNets was trained with more than 3 million expert trajectories and 700 million point clouds rendered in 500,000 simulated environments. Training the neural policy on large-scale data was crucial for generalizing to unknown environments in the real world. 

The image shows a variety of examples of MπNets training in a variety of simulated environments.
Figure 1. MπNets is trained with a large-scale dataset consisting of 3.27 million trajectories across 575 K procedurally generated environments

End-to-end architecture for motion planning

An end-to-end neural network policy, MπNets maps directly from camera point cloud observations to robot joint positions. The policy jointly encodes three inputs: a single-view point cloud camera observation of the scene, the robot’s current state configuration, and desired target pose that the user commands the robot to achieve. 

It outputs joint positions to achieve the specified target pose, which we then execute on the robot’s low-level controller.

A workflow of the stages: point cloud image data, encoding, calculating latent space, decoding, and planning
Figure 2. The MπNets workflow, which is an end-to-end neural policy for motion planning in unknown environments

The input point cloud is automatically labeled with three classes: the robot, the obstacles, and the specified target pose of the robot. The target pose is represented as a point cloud of the robot’s gripper.

Sim2Real Transfer to the Real World

MπNets generalizes well to a real robot system with a single, static depth camera. The policy directly transfers to the real world without needing real data, due to the low domain gap in point cloud observations (vis-a-vis RGB images). 

As shown in Figure 3, it reaches into tightly confined spaces without colliding with obstacles such as the plates and mug, scenarios commonplace in human spaces. With its end-to-end policy architecture, MπNets can also be executed in a closed-loop real robot system running at 9 Hz, and react immediately to dynamic scenes, as shown in Figure 3.

Fast, global, and avoids local optima

MπNets solution time is much shorter than a state-of-the-art sampling-based planner. It is 46% more likely to find a solution than MPNets, despite not requiring a collision checker. MπNets is less likely to get stuck in challenging situations, such as tightly confined spaces, because it is learned from long-term global planning information. 

“MπNets
Figure 4. Local controllers (right) oftentimes get stuck in local optima. MπNets (left) avoids local optima, as it is trained with trajectories, which have global information

In Figure 4 both STORM and geometric fabrics are stuck in the first drawer because they can’t figure out how to retract and go into the second drawer. Neither reaches the final target pose. 

Getting started with MπNets

When trained on a large dataset of simulated scenes, MπNets is faster than traditional planners, more successful than other local controllers, and transfers well to a real robot system even in dynamic and partially observed scenes.

 To help you get started with MπNets, our paper is published on Arxiv and the source code is available on the Motion Policy Networks GitHub. You can also load our pre-trained weights and play around using our ROS RViz user interface.

Learn more about neural motion planning, in the context of robot benchmarking, at the Benchmarking workshop during CoRL on December 15.

Categories
Misc

Explainer: What Is an Autonomous Truck?

Autonomous trucks are commercial vehicles that use AI to automate everything from shipping yard operations to long-haul deliveries.

Autonomous trucks are commercial vehicles that use AI to automate everything from shipping yard operations to long-haul deliveries.

Categories
Misc

Just Released: CUDA Toolkit 12.0

CUDA Toolkit 12.0 supports NVIDIA Hopper architecture and many new features to help developers maximize performance on NVIDIA GPU-based products.

CUDA Toolkit 12.0 supports NVIDIA Hopper architecture and many new features to help developers maximize performance on NVIDIA GPU-based products.

Categories
Misc

Predict Protein Structures and Properties with Biomolecular Large Language Models

Biomolecular structureThe NVIDIA BioNeMo service is now available for early access. At GTC Fall 2022, NVIDIA unveiled BioNeMo, a domain-specific framework and service for training…Biomolecular structure

The NVIDIA BioNeMo service is now available for early access. At GTC Fall 2022, NVIDIA unveiled BioNeMo, a domain-specific framework and service for training and serving biomolecular large language models (LLMs) for chemistry and biology at supercomputing scale across billions of parameters. 

The BioNeMo service is domain-optimized for chemical, proteomic, and genomic applications, designed to support molecular data represented in the SMILES notation for chemical structures, and FASTA for amino acid and nucleic acid sequences for proteins, DNA, and RNA.

With the BioNeMo service, scientists and researchers now have access to pretrained biomolecular LLMs through a cloud API, enabling them to predict protein structures, develop workflows, and fit downstream task models from LLM embeddings.

The BioNeMo service is a turnkey cloud solution for AI drug discovery pipelines that can be used in your browser or through API endpoints. The service API endpoints offer scientists the ability to get started quickly with AI drug discovery workflows based on large language model architectures. It also provides a UI Playground to easily and quickly try these models through an API, which can be integrated into your applications.

The BioNeMo service contains the following features:

  • Fully managed, browser-based service with API endpoints for protein LLMs
  • Accelerated OpenFold model for fast 3D protein structure predictions
  • ESM-1nv LLM for protein embeddings for downstream tasks
  • Interactive inference and visualization of protein structures through a graphic user interface (GUI)
  • Programmatic access to pretrained models through the API

About the models

ESM-1nv, based on Meta AI’s state-of-the-art ESM-1b, is a large language model for the evolutionary-scale modeling of proteins. It is based on the BERT architecture and trained on millions of protein sequences with a masked language modeling objective. ESM-1nv learns the patterns and dependencies between amino acids that ultimately give rise to protein structure and function.

Embeddings from ESM-1nv can be used to fit downstream task models for protein properties of interest such as subcellular location, thermostability, and protein structure. This is accomplished by training a typically much smaller model with a supervised learning objective to infer a property from ESM-1nv embeddings of protein sequences. Using embeddings from ESM-1nv typically results in far superior accuracy in the final model.

OpenFold is a faithful reproduction of DeepMind’s AlphaFold-2 model for 3D protein structure prediction from a primary amino acid sequence. This long-standing grand challenge in structural biology reached a significant milestone at CASP14, where AlphaFold-2 achieved nearly experimental accuracy for predicted structures. While AlphaFold was developed for a JAX workflow, OpenFold bases its code on PyTorch. 

OpenFold in BioNeMo is also trainable, meaning variants may be created for specialized research. OpenFold achieves similar accuracy to the original model and predicts the median backbone at an accuracy of 0.96 Å RMSD95 and is up to 6x faster due to changes made in the MSA generation step. This means that drug discovery researchers get 3D protein structure predictions very quickly. 

Get early access to the BioNeMo service

Apply for early access to the BioNeMo service. You’ll be asked to join the NVIDIA Developer Program and fill out a short questionnaire to gain your early access.