DataBloom - Part 364

Misc

Is tensorflow a workable solution for my side project?

Post author By
Post date February 27, 2022
No Comments on Is tensorflow a workable solution for my side project?

Hello,

I am in the process of learning TensorFlow and I am wondering if TF is a workable solution for what I am trying to achieve. My side project is chess website where users can come submit their chess ratings, and then the website uses their data to compare ratings between different chess websites and orgs. My data set currently has around 7500 rows and looks like this:

7500 rows that look about like this

My backend is a Python API that is hosted on Heroku. What I would like to achieve is that once a player enters in their ratings, every rating they leave empty, I use machine learning/TensorFlow to predict each null value for that player? Is that doable with TF in backend hosted on Heroku?

Also, if anyone has any tips to lead me in the right direction, they are most welcome. I should also note that I suspect this might not be the most appropriate use of TF, or that TF might not be the best solution, but I am using this side project go grow and demonstrate skills I take interest in .

submitted by /u/DavidDoesChess
[visit reddit] [comments]

Misc

Issue in train step of custom keras model when passing data from a generator

Post author By
Post date February 27, 2022
No Comments on Issue in train step of custom keras model when passing data from a generator

I found a very similar issue to the one I am having on stack overflow here, but I’m posting here because no one figured it out. That post pretty much explains the issue I am having, but I am using an ImageDataGenerator instead. I cant seem to be able to get it to work within the train_step of a custom keras model. Any help is appreciated

submitted by /u/proxygonn
[visit reddit] [comments]

Misc

Closest pair of points problem optimized with tensorflow?

Post author By
Post date February 26, 2022
No Comments on Closest pair of points problem optimized with tensorflow?

I am a novice when it comes to tensorflow and AI.

I wrote a solution to the problem first with numpy, then numba optimized. When it comes to tensors, I have no idea how to even start the optimization. Are there some easy to follow tutorials out there? How do I even tell tensorflow the rules and goal of the optimization?

https://github.com/XeonPhobia/Pixel-replacement-algorithm-based-on-RMSE/blob/main/test.py

submitted by /u/Xeonfobia
[visit reddit] [comments]

Misc

Predict function takes super long

Post author By
Post date February 26, 2022
No Comments on Predict function takes super long

Hey (obligatory i am pretty noob). I’m training a neural network with Tensorflow (deuh) and during training the network takes about 300 micro seconds per sample. This is with dropout and layer normalization and back propagation of course and with 4 threads. During predictions however I can only predict one sample at a time (due to external needs) I would expect this to take about 1 ms or even less but when actually timing it I get more like 12ms. Are there any ways I can speed up this behavior how much can a less complex neural network bring me?

submitted by /u/davy123457
[visit reddit] [comments]

Misc

Variational Autoencoder for CIFAR-10

Post author By
Post date February 25, 2022
No Comments on Variational Autoencoder for CIFAR-10

You can read about here implemented in TensorFlow 2.8, trained in tf.GradientTape() API.

submitted by /u/grid_world
[visit reddit] [comments]

Misc

Can I use tensorflow to extend a 3D array in a direction of my choice?

Post author By
Post date February 25, 2022
No Comments on Can I use tensorflow to extend a 3D array in a direction of my choice?

This 3D array (https://imgur.com/a/BCuCJNM) is a cross-section of a forest. I would like to predict additional forest slices moving in the x or y direction. I found this tutorial online, but this tutorial is for a 1D data set (https://www.codespeedy.com/predict-next-sequence-using-deep-learning-in-python/).

submitted by /u/yhl3051
[visit reddit] [comments]

Misc

Failed to load model while model exists in folder, and success to load in IPython

Post author By
Post date February 25, 2022
No Comments on Failed to load model while model exists in folder, and success to load in IPython

I’ve a model saved in a folder containing save_model.pb and keras.metadata.pb.

In IPython notebook, I could load the model successfully with:

model_path = "/file_path/" model = keras.models.load_model(model_path)

However, when using the exact same code in python file, it shows error:

SavedModel file does not exist at: /file_path/{saved_model.pbtxt|saved_model.pb}

I’d checked twice to make sure the model file exist in the file_path, and same code works fine in Jupyter notebook.

What might cause this difference?

Thank you for any help!

submitted by /u/Laurence-Lin
[visit reddit] [comments]

Misc

Accelerating Cloud Networking the Right Way

Post author By
Post date February 24, 2022
No Comments on Accelerating Cloud Networking the Right Way

NVIDIA delivers industry-leading SDN performance benchmark results

The NVIDIA BlueField-2 data processing unit (DPU) delivers unmatched software-defined networking (SDN) performance, programmability, and scalability. It integrates eight Arm CPU cores, the secure and advanced ConnectX-6 Dx cloud network interface, and hardware accelerators that together offload, accelerate, and isolate SDN functions, performing connection tracking, flow matching, and advanced packet processing.

This post outlines the basic tenets of an accurate SDN performance benchmark and demonstrates the actual results achievable on the NVIDIA ConnectX-6 Dx with accelerators enabled. The BlueField-2 and next-generation BlueField-3 DPUs include additional acceleration capabilities and offer higher performance for a broader range of use cases.

SDN performance benchmark best practices

Any SDN performance evaluation of the BlueField DPUs or ConnectX SmartNICs should leverage the full power of the hardware accelerators. BlueField-2’s packet processing actions are programmable through the NVIDIA ASAP²(accelerated switching and packet processing) engine. The SDN accelerators featured on both the BlueField DPUs and ConnectX SmartNICs rely on ASAP²and other programmable hardware accelerators to achieve line-rate networking performance.

NVIDIA ASAP² support has been integrated into the upstream Linux Kernel and the Data Plane Development Kit (DPDK) framework and is readily available in a range of Linux OS distributions and cloud management platforms.

Connection tracking acceleration is available starting with Linux Kernel 5.6. The best practice is to use a modern enterprise Linux OS, for example, Ubuntu 20.04, Red Hat Enterprise Linux 8.4, and so on. These newer kernels include inbox support for SDN with connection tracking acceleration with ConnectX-6 Dx SmartNICs and BlueField-2 DPUs. Benchmarking SDN with connection tracking based on a Linux system with an outdated kernel would be misleading.

Finally, for any SDN benchmark to be effective, it must be representative of SDN pipelines implemented in real-world cloud data centers where hundreds of thousands of connections are the norm. Both ConnectX-6 Dx SmartNICs and BlueField-2 DPUs are designed for, and deployed in hyperscale environments, and deliver breakthrough network performance at cloud-scale.

Accelerated SDN performance

Look at the NVIDIA ConnectX-6 Dx performance. The following benchmarks show the throughput and latency of SDN pipeline performance with connection tracking hardware acceleration enabled. We ran tests using a system set up, testing tools, and procedures similar to other reported results. We ran Open VSwitch (OVS) DPDK to seamlessly enable connection tracking acceleration on the ConnectX-6 Dx SmartNIC.

The following charts describe the observed SDN performance by using the iperf3 tool for 4 and 16 iperf instances with one flow per instance.

A chart showing ConnectX 6Dx throughput out performing other offerings in an SDN based on packet size using the iperf3 tool for a 4 instance — *Figure 1. Observed SDN performance with the iperf3 tool for 4 instance*s

A chart showing ConnectX 6Dx throughput out performing other offerings in an SDN based on packet size using the iperf3 tool for a 16 instance — *Figure 2. Observed SDN performance with 16 iperf instance*s

Key findings:

ConnectX-6 Dx provides higher throughput, achieving up to 120% and 150% higher for 4 and 16 instances respectively, for all packet sizes tested.
ConnectX-6 Dx achieves >90% line rate for packets as small as 1 KB compared to 8-KB packets for the other offerings.

The following chart shows the observed performance for an SDN pipeline with 32 instances on the same system setup. The results show that ConnectX-6 Dx provides much better scaling as the number of flows increases and up to 4x higher throughput.

A graph showing the observed greater performance of the ConnectX 6Dx compared to other offerings for an SDN pipeline with 32 instances on the same system setup. — *Figure 3. *Observed SDN performance with 32 iperf instance*s*

The following benchmark measures latency using sockperf. The results indicate that ConnectX-6 Dx provides ~20-30% lower latency compared to other offerings for all packet sizes that were tested.

Graph shows the observed lower latency of the ConnectX 6Dx compared to other offerings. — Figure 4. *Observed one-way latency for an SDN pipeline with connection tracking*

Non-accelerated connection tracking implementations create bottlenecks on the host CPU. Offloading connection tracking to the on-chip accelerators means the performance achieved in these benchmarks is not strongly dependent on the host CPU or its ability to drive the test bench. These results are also indicative of the performance achievable on the BlueField-2 DPU, which integrates ConnectX-6 Dx.

BlueField-3 supports higher performance levels

NVIDIA welcomes the opportunity to test and showcase the performance of ConnectX-6 Dx and BlueField-2 while also adhering to industry best practices and operating standards. The data shown in this post compares the performance benchmark results for ConnectX-6 Dx to results reported elsewhere. The ConnectX-6 Dx provides up to 4X higher throughput and up to 30% lower latency compared to other offerings. These benchmark results demonstrate the NVIDIA leadership position in SDN acceleration technologies.

BlueField-3 is the next-generation NVIDIA DPU and integrates the advanced ConnectX-7 adapter and additional acceleration engines. Providing 400 Gb/s networking, more powerful Arm CPU cores, and a highly programmable Datapath Accelerator (DPA), BlueField-3 delivers even higher levels of performance and programmability to address the most demanding workloads in massive-scale data centers. Existing DPU-accelerated SDN applications built on BlueField-2 using DOCA will benefit from the performance enhancements that the BlueField-3 brings, without any code changes.

Learn more about modernizing your data center infrastructure with BlueField DPUs. Stay tuned for even higher SDN performance with BlueField-3 arriving in 2022.

Misc

Using Semaphore and Memory Sharing Extensions for Vulkan Interop with NVIDIA OpenCL

Post author By
Post date February 24, 2022
No Comments on Using Semaphore and Memory Sharing Extensions for Vulkan Interop with NVIDIA OpenCL

Learn about new OpenCL support for Vulkan interoperability using semaphores and memory sharing extensions.

Developers often use OpenCL for compute together with other APIs, such as OpenGL, to access functionality including graphics rendering. OpenCL has long enabled the sharing of implicit buffer and image objects with OpenGL, OpenGL ES, EGL, Direct3D 10, and Direct3D 11 through extensions:

cl_khr_gl_sharing
cl_khr_gl_event
cl_khr_egl_image
cl_khr_egl_event
cl_khr_d3d10_sharing
cl_khr_d3d11_sharing

Download sample code now.

New generation GPU APIs such as Vulkan use explicit references to external memory together with semaphores to coordinate access to shared resources. Until now, there have been no OpenCL extensions to enable external memory and semaphore sharing with this new class of API.

Interop between OpenCL and Vulkan has been in strong demand for both mobile and desktop platforms. NVIDIA has closely worked with the Khronos OpenCL Working Group to release a set of provisional cross-vendor KHR extensions. The extensions enable applications to efficiently share data between OpenCL and APIs such as Vulkan, with significantly increased flexibility compared to previous-generation interop APIs using implicit resources.

This set of new external memory and semaphore sharing extensions provide a generic framework that enables OpenCL to import external memory and semaphore handles exported by external APIs, using a methodology that will be familiar to Vulkan developers. OpenCL then uses those semaphores to synchronize the external runtime, coordinating the use of shared memory.

Diagram shows how OpenCL imports memory and semaphore handles from Vulkan, and uses semaphores to synchronize memory ownership and access. — *Figure 1. Interoperability relationship between OpenCL and Vulkan software*

External API-specific interop extensions can then be added to handle the details of interacting with specific APIs. Vulkan interop is available today, and additional APIs, such as DirectX 12, are planned.

The OpenCL new external semaphore and memory sharing functionality includes separate sets of carefully structured extensions.

Semaphore extensions

This set of extensions adds the ability to create OpenCL semaphore objects from OS-specific semaphore handles.

cl_khr_semaphore—Represents semaphores with wait and signal. This is a new class of OpenCL objects.
cl_khr_external_semaphore—Extends cl_khr_semaphore with mechanisms for importing and exporting external semaphores, similar to VK_KHR_external_semaphore.

The following extensions extend cl_khr_external_semaphore with handle-type-specific behavior:

cl_khr_external_semaphore_opaque_fd—Shares external semaphores using Linux fd handles with reference transference, similar to VK_KHR_external_semaphore_fd.
cl_khr_external_semaphore_win32—Shares external semaphores using win32 NT and KMT handles with reference transference, similar to VK_KHR_external_semaphore_win32.

Memory extensions

These extensions add the ability to create OpenCL memory objects from OS-specific memory handles. They have a similar design to the Vulkan external memory extension VK_KHR_external_memory.

cl_khr_external_memory—Imports external memory from other APIs.

The following extensions extend cl_khr_external_memory with handle-type-specific behavior:

cl_khr_external_memory_opaque_fd—Shares external memory using Linux fd handles, similar to VK_KHR_external_memory_fd.
cl_khr_external_memory_win32—Shares external memory using win32 NT and KMT handles, similar to VK_KHR_external_memory_win32.

Using OpenCL

The typical interop use case consists of the following steps.

Check if the required support is available:

Check if the required extensions cl_khr_external_semaphore and cl_khr_external_memory are supported by the underlying OpenCL platform and devices with clGetPlatformInfo and clGetDeviceInfo.
To be able to use Win32 semaphore and memory handles, check if the cl_khr_external_semaphore_win32_khr and cl_khr_external_memory_win32_khr extensions are present.
To be able to use FD semaphore and memory handles, check if the cl_khr_external_semaphore_opaque_fd_khr and cl_khr_external_memory_opaque_fd_khr extensions are present. This can also be checked by querying the supported handle types.

Importing external semaphores requires cl_khr_external_semaphore. If cl_khr_external_semaphore_opaque_fd is supported, you can import external semaphores exported by Vulkan using FD handles in OpenCL with clCreateSemaphoreWithPropertiesKHR.

// Get cl_devices of the platform. 
clGetDeviceIDs(..., &devices, &deviceCount);
// Create cl_context with just first device 
clCreateContext(..., 1, devices, ...);
// Obtain fd/win32 or similar handle for external semaphore to be imported from the other API. 
int fd = getFdForExternalSemaphore();// Create clSema of type cl_semaphore_khr usable on the only available device assuming the semaphore was imported from the same device.
cl_semaphore_properties_khr sema_props[] = 
        {(cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_BINARY_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_HANDLE_OPAQUE_FD_KHR, 
         (cl_semaphore_properties_khr)fd, 0}; 
int errcode_ret = 0; 
cl_semaphore_khr clSema = clCreateSemaphoreWithPropertiesKHR(context, 
                                                             sema_props, 
                                                             &errcode_ret);

Importing images requires cl_khr_external_memory and support for images. In OpenCL, import external semaphores exported by Vulkan using Win32 handles with clCreateSemaphoreWithPropertiesKHR.

// Get cl_devices of the platform. 
clGetDeviceIDs(..., &devices, &deviceCount);
// Create cl_context with just first device 
clCreateContext(..., 1, devices, ...);
// Obtain fd/win32 or similar handle for external semaphore to be imported from the other API. 
void *handle = getWin32HandleForExternalSemaphore(); 
// Create clSema of type cl_semaphore_khr usable on the only available device assuming the semaphore was imported from the same device. 
cl_semaphore_properties_khr sema_props[] = 
        {(cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_TYPE_BINARY_KHR, 
         (cl_semaphore_properties_khr)CL_SEMAPHORE_HANDLE_OPAQUE_WIN32_KHR, 
         (cl_semaphore_properties_khr)handle, 0}; 
int errcode_ret = 0; 
cl_semaphore_khr clSema = clCreateSemaphoreWithPropertiesKHR(context, 
                                                             sema_props, 
                                                             &errcode_ret);

In OpenCL, import external memory exported by Vulkan using the FD handle as buffer memory with clCreateBufferWithProperties.

// Get cl_devices of the platform. 
clGetDeviceIDs(..., &devices, &deviceCount); 
 
// Create cl_context with just first device 
clCreateContext(..., 1, devices, ...); 
 
// Obtain fd/win32 or similar handle for external memory to be imported from other API. 
int fd = getFdForExternalMemory(); 
 
// Create extMemBuffer of type cl_mem from fd. 
cl_mem_properties_khr extMemProperties[] = 
{   (cl_mem_properties_khr)CL_EXTERNAL_MEMORY_HANDLE_OPAQUE_FD_KHR, 
    (cl_mem_properties_khr)fd,
0

}; 
cl_mem extMemBuffer = clCreateBufferWithProperties(/*context*/          clContext, 
                                                   /*properties*/       extMemProperties, 
                                                   /*flags*/            0, 
                                                   /*size*/             size, 
                                                   /*host_ptr*/         NULL, 
                                                   /*errcode_ret*/      &errcode_ret);

In OpenCL, import external memory exported by Vulkan as image memory using clCreateImageWithProperties.

// Create img of type cl_mem. Obtain fd/win32 or similar handle for external memory to be imported from other API. 
int fd = getFdForExternalMemory();
// Set cl_image_format based on external image info 
cl_image_format clImgFormat = { }; 
     clImageFormat.image_channel_order = CL_RGBA; 
     clImageFormat.image_channel_data_type = CL_UNORM_INT8;

// Set cl_image_desc based on external image info 
size_t clImageFormatSize; 
cl_image_desc image_desc = { }; 
     image_desc.image_type = CL_MEM_OBJECT_IMAGE2D_ARRAY; 
     image_desc.image_width = width; 
     image_desc.image_height = height;    
     image_desc.image_depth = depth; 
     cl_mem_properties_khr extMemProperties[] = 
     {    (cl_mem_properties_khr)CL_EXTERNAL_MEMORY_HANDLE_OPAQUE_FD_KHR, 
          (cl_mem_properties_khr)fd, 
           0 
     };
cl_mem img = clCreateImageWithProperties(/*context*/        clContext, 
                                         /*properties*/     extMemProperties, 
                                         /*flags*/          0, 
                                         /*image_format*/   &clImgFormat, 
                                         /*image_desc*/     &image_desc, 
                                         /*errcode_ret*/    &errcode_ret)

Synchronize between OpenCL and Vulkan using semaphore wait and signal.

// Create clSema using one of the above examples of external semaphore creation. 
int errcode_ret = 0; 
cl_semaphore_khr clSema = clCreateSemaphoreWithPropertiesKHR(context, 
                                                             sema_props, 
                                                             &errcode_ret); 
while (true) { 
    // (not shown) Signal the semaphore from the other API,
    // Wait for the semaphore in OpenCL 
    clEnqueueWaitSemaphoresKHR(  /*command_queue*/           command_queue, 
                                 /*num_sema_objects*/        1, 
                                 /*sema_objects*/            &clSema, 
                                 /*num_events_in_wait_list*/ 0, 
                                 /*event_wait_list*/         NULL, 
                                 /*event*/                   NULL); 
    clEnqueueNDRangeKernel(command_queue, ...); 
    clEnqueueSignalSemaphoresKHR(/*command_queue*/           command_queue, 
                                 /*num_sema_objects*/        1, 
                                 /*sema_objects*/            &clSema, 
                                 /*num_events_in_wait_list*/ 0, 
                                 /*event_wait_list*/         NULL, 
                                 /*event*/                   NULL); 
    // (not shown) Launch work in the other API that waits on 'clSema'

Again, for more information, download the vk_ocl_interop_samples.zip samples file.

Try it out today!

Try out the NVIDIA OpenCL implementation of Vulkan interop by downloading the R510 (or later) drivers:

For more information, see Khronos Releases OpenCL 3.0 Extensions for Neural Network Inferencing and OpenCL/Vulkan Interop.

Misc

Speeding up Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX

Post author By
Post date February 24, 2022
No Comments on Speeding up Numerical Computing in C++ with a Python-like Syntax in NVIDIA MatX

MatX is an experimental library that allows you to write high-performance GPU code in C++, with high-level syntax and a common data type across all functions.

Rob Smallshire once said, “You can write faster code in C++, but write code faster in Python.” Since its release more than a decade ago, CUDA has given C and C++ programmers the ability to maximize the performance of their code on NVIDIA GPUs.

More recently, libraries such as CuPy and PyTorch allowed developers of interpreted languages to leverage the speed of the optimized CUDA libraries from other languages. These interpreted languages have many excellent properties, including easy-to-read syntax, automatic memory management, and common types across all functions.

However, sometimes having these features means paying a performance penalty due to memory management and other factors outside your control. The decrease in performance is often worth it to save development time. Still, it may ultimately require rewriting portions of the application later when performance becomes an issue.

What if you could still achieve the maximum performance using C++ while still reaping all the benefits from the interpreted languages?

MatX overview

MatX is an experimental, GPU-accelerated, numerical computing C++ library aimed at bridging the gap between users wanting the highest performance possible, with the same easy syntax and types across all CUDA libraries. Using the C++17 support added in CUDA 11.0, MatX allows you to write the same natural algebraic expressions that you would in a high-level language like Python without the performance penalty that may come from it.

Tensor types

MatX includes interfaces to many of the popular math libraries, such as cuBLAS, CUTLASS, cuFFT, and CUB, but uses a common data type (tensor_t) across all these libraries. This greatly simplifies the API to these libraries by deducing information that it knows about the tensor type and calling the correct APIs based on that.

The following code examples show an FFT-based resampler.

Python

N = min(ns, ns_resamp)
nyq = N // 2 + 1

# Create an empty vector
sv = np.empty(ns)

# Real to complex FFT
svc = np.fft.rfft(sv)

# Slice
sv = svc[0:nyq]

# Complex to real IFFT
rsv = np.fft.irfft(sv, ns_resamp)

MatX

uint32_t N = std::min(ns, ns_resamp);  
uint32_t nyq = N / 2 + 1;

auto sv  = make_tensor({ns});  
auto svc = make_tensor({ns / 2 + 1});  
auto rv  = make_tensor({ns_resamp});

// Real to complex FFT
fft(svc, sv, stream);

// Slice the vector
auto sv = svc.Slice({0}, {nyq});

// Complex to real IFFT
ifft(rsv, sv, stream);

While the code length and readability are similar, the MatX version on an A100 runs about 2100x faster than the NumPy version running on CPU. The MatX version also has many hidden benefits over directly using the CUDA libraries, such as type checking, input and output size checking, and slicing a tensor without pointer manipulation.

The tensor types are not limited to FFTs, though, and the same variables can be used inside of other libraries and expressions. For example, if you wanted to perform a GEMM using CUTLASS on the resampler output, you could write the following:

matmul(resampOut, resampView, B, stream);

In this code, resampOut and B are appropriately sized tensors for the GEMM operation. As in the FFT sample preceding, types, sizes, batches, and strides are all inferred by the tensor metadata. Using a strongly typed C++ API also means that many runtime and compile-time errors can be caught without additional debugging.

In addition to supporting the optimized CUDA libraries as backends, these same tensor types can be used in algebraic expressions to perform element-wise operations:

(C = A * B + (D / 5.0) + cos(E)).run(stream);

Lazy evaluation

MatX uses lazy evaluation to create a GPU kernel at compile time representing the expression in parentheses. Only when the run function is called on the expression does the operation execute on the GPU. Over 40 different types of operators are supported and can be mixed and matched across different size and type tensors with compatible parameters. If you look at the earlier expression written as a CUDA kernel, it would look something like this:

__global__ void Expression( float *C, 
                            const float *A,
                            const float *B,
                            const float *D,
                            const float *E,
                            int length)
{
    for (int idx = blockIdx.x * blockDim.x + threadIdx.x; 
         idx 



While the earlier code is not complicated, it’s hiding several problems:

The data types are hard-coded to floats. To change to another type, you must edit the kernel signature. Astute readers would say to use templates and let the compiler deduce types for you. While that may work for some types, it won’t work for all types you may want to use. For example, cosf is not defined for half precision types, so you must use compile-time conditionals to handle different types.
Any small change to the function signature needs a completely different function. For example, what if you wanted to add a tensor F in some cases, but still retain this original signature? That would be two functions to maintain that are nearly identical.
While a grid-stride loop is good practice and is used to handle different sizes of blocks and grids, you must still have code ensuring that during kernel launch there are enough threads to keep the GPU busy.
All inputs are assumed to be 1D vectors; higher dimensions could break with non-unity strides.

There are numerous other deficiencies not listed, including the inability to broadcast different-sized tensors, no checking on sizes, requiring contiguous memory layouts, and more.

Obviously, this code only works under specific conditions, while the MatX version solves all these issues and more while typically maintaining the same performance as writing the kernel directly.

Additional MatX features

Other key features of MatX include the following:

Creating zero-copy tensor views by slicing, cloning, and permuting existing tensors.
Supporting arbitrary-dimension tensors.
Generators for generating data on-the-fly without storing in memory. Common examples would be to create a linearly spaced vector, hamming window, or a diagonal matrix.
Supports almost every type used in CUDA, including half precision (both FP16 and BF16) and complex numbers (both full and half precision).
Linear solver functions through cuSolver, sorting and scanning using CUB, random number generation using cuRAND, reductions, and more

Summary

MatX is open-sourced under the BSDv3 license. For more information, see the following resources: