Speeding up Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX

MatX is an experimental library that allows you to write high-performance GPU code in C++, with high-level syntax and a common data type across all functions.

Rob Smallshire once said, “You can write faster code in C++, but write code faster in Python.” Since its release more than a decade ago, CUDA has given C and C++ programmers the ability to maximize the performance of their code on NVIDIA GPUs.

More recently, libraries such as CuPy and PyTorch allowed developers of interpreted languages to leverage the speed of the optimized CUDA libraries from other languages. These interpreted languages have many excellent properties, including easy-to-read syntax, automatic memory management, and common types across all functions.

However, sometimes having these features means paying a performance penalty due to memory management and other factors outside your control. The decrease in performance is often worth it to save development time. Still, it may ultimately require rewriting portions of the application later when performance becomes an issue.

What if you could still achieve the maximum performance using C++ while still reaping all the benefits from the interpreted languages?

MatX overview

MatX is an experimental, GPU-accelerated, numerical computing C++ library aimed at bridging the gap between users wanting the highest performance possible, with the same easy syntax and types across all CUDA libraries. Using the C++17 support added in CUDA 11.0, MatX allows you to write the same natural algebraic expressions that you would in a high-level language like Python without the performance penalty that may come from it.

Tensor types

MatX includes interfaces to many of the popular math libraries, such as cuBLAS, CUTLASS, cuFFT, and CUB, but uses a common data type (tensor_t) across all these libraries. This greatly simplifies the API to these libraries by deducing information that it knows about the tensor type and calling the correct APIs based on that.

The following code examples show an FFT-based resampler.


N = min(ns, ns_resamp)
nyq = N // 2 + 1

# Create an empty vector
sv = np.empty(ns)

# Real to complex FFT
svc = np.fft.rfft(sv)

# Slice
sv = svc[0:nyq]

# Complex to real IFFT
rsv = np.fft.irfft(sv, ns_resamp)


uint32_t N = std::min(ns, ns_resamp);  
uint32_t nyq = N / 2 + 1;

auto sv  = make_tensor({ns});  
auto svc = make_tensor({ns / 2 + 1});  
auto rv  = make_tensor({ns_resamp});

// Real to complex FFT
fft(svc, sv, stream);

// Slice the vector
auto sv = svc.Slice({0}, {nyq});

// Complex to real IFFT
ifft(rsv, sv, stream);

While the code length and readability are similar, the MatX version on an A100 runs about 2100x faster than the NumPy version running on CPU. The MatX version also has many hidden benefits over directly using the CUDA libraries, such as type checking, input and output size checking, and slicing a tensor without pointer manipulation.

The tensor types are not limited to FFTs, though, and the same variables can be used inside of other libraries and expressions. For example, if you wanted to perform a GEMM using CUTLASS on the resampler output, you could write the following:

matmul(resampOut, resampView, B, stream);

In this code, resampOut and B are appropriately sized tensors for the GEMM operation. As in the FFT sample preceding, types, sizes, batches, and strides are all inferred by the tensor metadata. Using a strongly typed C++ API also means that many runtime and compile-time errors can be caught without additional debugging.

In addition to supporting the optimized CUDA libraries as backends, these same tensor types can be used in algebraic expressions to perform element-wise operations:

(C = A * B + (D / 5.0) + cos(E)).run(stream);

Lazy evaluation

MatX uses lazy evaluation to create a GPU kernel at compile time representing the expression in parentheses. Only when the run function is called on the expression does the operation execute on the GPU. Over 40 different types of operators are supported and can be mixed and matched across different size and type tensors with compatible parameters. If you look at the earlier expression written as a CUDA kernel, it would look something like this:

__global__ void Expression( float *C, 
                            const float *A,
                            const float *B,
                            const float *D,
                            const float *E,
                            int length)
    for (int idx = blockIdx.x * blockDim.x + threadIdx.x; 

While the earlier code is not complicated, it’s hiding several problems:

  • The data types are hard-coded to floats. To change to another type, you must edit the kernel signature. Astute readers would say to use templates and let the compiler deduce types for you. While that may work for some types, it won’t work for all types you may want to use. For example, cosf is not defined for half precision types, so you must use compile-time conditionals to handle different types.
  • Any small change to the function signature needs a completely different function. For example, what if you wanted to add a tensor F in some cases, but still retain this original signature? That would be two functions to maintain that are nearly identical.
  • While a grid-stride loop is good practice and is used to handle different sizes of blocks and grids, you must still have code ensuring that during kernel launch there are enough threads to keep the GPU busy.
  • All inputs are assumed to be 1D vectors; higher dimensions could break with non-unity strides.

There are numerous other deficiencies not listed, including the inability to broadcast different-sized tensors, no checking on sizes, requiring contiguous memory layouts, and more.

Obviously, this code only works under specific conditions, while the MatX version solves all these issues and more while typically maintaining the same performance as writing the kernel directly.

Additional MatX features

Other key features of MatX include the following:

  • Creating zero-copy tensor views by slicing, cloning, and permuting existing tensors.
  • Supporting arbitrary-dimension tensors.
  • Generators for generating data on-the-fly without storing in memory. Common examples would be to create a linearly spaced vector, hamming window, or a diagonal matrix.
  • Supports almost every type used in CUDA, including half precision (both FP16 and BF16) and complex numbers (both full and half precision).
  • Linear solver functions through cuSolver, sorting and scanning using CUB, random number generation using cuRAND, reductions, and more


MatX is open-sourced under the BSDv3 license. For more information, see the following resources:

Leave a Reply

Your email address will not be published.