Categories

# N Ways to SAXPY: Demonstrating the Breadth of GPU Programming Options Back in 2012, NVIDIAN Mark Harris wrote Six Ways to Saxpy, demonstrating how to perform the SAXPY operation on a GPU in multiple ways, using different languages and libraries. Since then, programming paradigms have evolved and so has the NVIDIA HPC SDK. In this post, I demonstrate five ways to implement a simple SAXPY computation … Continued Back in 2012, NVIDIAN Mark Harris wrote Six Ways to Saxpy, demonstrating how to perform the SAXPY operation on a GPU in multiple ways, using different languages and libraries. Since then, programming paradigms have evolved and so has the NVIDIA HPC SDK.

In this post, I demonstrate five ways to implement a simple SAXPY computation using NVIDIA GPUs. Why is this interesting? Because it demonstrates the breadth of options that you have today for programming NVIDIA GPUs. It also covers the four main approaches to GPU computing:

• GPU-accelerated libraries
• Compiler directives
• Standard language parallelism
• GPU programming languages

SAXPY stands for Single-Precision A·X Plus Y,  a function in the standard Basic Linear Algebra Subroutines (BLAS) library. SAXPY is a combination of scalar multiplication and vector addition, and it’s simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like the following:

``````void saxpy_cpu(int n, float a, float *x, float *y)
{
for (int i = 0; i ``````

Given this basic example code, I can now show you five ways to SAXPY on GPUs. I chose SAXPY because it is a short and simple code, but it shows enough of the syntax of each programming approach to compare them. Because it does relatively little computation, SAXPY isn’t that useful for demonstrating the difference in performance between the different programming models, but that’s not my intent here. My goal is to demonstrate multiple ways to program on the NVIDIA platform today, rather than to recommend one over another. That would require taking other factors into account and is beyond the scope of this post.

I discuss implementations of SAXPY in the following models:

• CUDA C++—A C++ language extension to support the CUDA programming model and allow C++ code to be executed on NVIDIA GPUs.
• cuBLAS—A GPU-accelerated implementation of the basic linear algebra subroutines (BLAS) optimized for NVIDIA GPUs.
• OpenACC—Using compiler directives to tell the compiler that a given portion of the code can be parallelized and letting the compiler figure out how to do it.
• Standard C++—Using the NVC++ compiler and parallel execution policies added to the standard library with C++11 and 17.
• Thrust—A high-level, GPU-accelerated parallel algorithms library.

After going through all the implementations, I show what performance looks like when SAXPY is accelerated through these approaches.

## CUDA C++ SAXPY

``````__global__ void saxpy_cuda(int n, float a, float *x, float *y)
{
unsigned int t_id = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int stride = blockDim.x * gridDim.x;
for (int i = t_id; i >>(n, 2.0, dev_x, dev_y);

CUDA C++ is a GPU programming language that provides extensions to the C/C++ language for expressing parallel computation. Device functions, called kernels, are declared with the `__global__` specifier to show that they can be called either from host code or device code. Device memory to hold the float vector is allocated using `cudaMalloc`. Then, the kernel defined is called with an execution configuration:

`>>`

Each thread launched executes the kernel, using built-in variables like `threadIdx`, `blockDim`, and `blockIdx`. The variables are assigned by the device for each thread and block and are used to calculate the index of the elements in the vector for which it is responsible. In doing so, each thread does the multiply-add operation on a limited number of elements of the vector. In the case where the number of threads is less than the size of the vector, each thread computes a stride to operate on multiple elements so that the entire vector is taken care of (Figure 1). Figure 1. How GPU threads operate across a large array using a grid-stride when the number of threads is fewer than the number of elements. The stride is the number of threads in the grid, so the kernel loops over the data array one grid-size at a time.

## cuBLAS SAXPY

``````cublasHandle_t handle;
cublasCreate(&handle);
unsigned int n = 1UL ``````

SAXPY, being a BLAS operation, has an implementation in the NVIDIA cuBLAS library. It involves initializing a cuBLAS library context by passing a handle to `cublasCreate`, allocating memory for the vectors, and then calling the library function `cublasSaxpy` while passing in the vector and scalar values. Finally, `cublasDestroy` and `cudaFree` are used to release the resources associated with the cuBLAS library context and device memory allocated for the vectors, respectively.

## OpenACC C++ SAXPY

``````void saxpy(int n, float a, float *restrict x, float *restrict y)
{
#pragma acc kernels

for (int i = 0; i ``````

OpenACC is a directive-based programming model that uses compiler directives through `#pragma` to tell the compiler that a portion of the code can be parallelized. The compiler then analyzes the instruction and automatically generates code for the GPU.  OpenACC provides options for fine-tuning launch configurations in those instances where the automatically generated code may not be optimal.

Compilers with support for NVIDIA GPUs like nvc++ can offload computation to the GPU using unified memory to seamlessly copy data between the host and device. Adding `#pragma acc kernels` tells the compiler to generate a kernel for the following `for` loop. Because you allocated x and y on the host using the `malloc` instruction, the compiler uses unified memory to move the vector to the device before computation and back to the host afterward. The compiler generates instructions to move the vectors `x` and `y` into device memory and do a fused multiply-add for each element.

## std::par C++ SAXPY

``````void saxpy(int N, float a, float *restrict x, float *restrict y)
{
std::transform(std::execution::par_unseq, x, x + N, y, y,[=](float xi, float yi) { return a * xi + yi; });
}
float alpha = 2.0;
unsigned int n = 1UL ``````

With the NVIDIA NVC++ compiler, you can use GPU acceleration in standard C++ with no language extensions, pragmas, directives, or libraries, other than the C++ standard library. The code, being standard C++, is portable to other compilers and systems and can be accelerated on NVIDIA GPUs or multicore CPUs using NVC++.

With the new features for parallel execution and execution policies introduced with C++11 and 17, algorithms in the standard library like `std::transform` and `std::reduce` added an execution policy as the first parameter to any algorithm that supports execution policies. You can thus pass `std::execution::par_unseq` to `std::transform`, defining a lambda that captures by value and performs the saxpy operation. When compiled using the `-stdpar` command line option, the compiler compiles standard algorithms that are called with a parallel execution policy for execution on NVIDIA GPUs.

## Thrust SAXPY

``````struct saxpy_functor
{
const float a;
saxpy_functor(float _a) : a(_a) {}
__global__ float operator()(const float &x, const float &y)
{
return a * x + y;
}
};
float alpha = 2.0;
unsigned int n = 1UL  x(n);
thrust::device_vector y(n);
thrust::fill(x.begin(), x.end(), 1.0);
thrust::fill(y.begin(), y.end(), 1.0);
thrust::transform(x.begin(), x.end(), y.begin(), y.begin(), saxpy_functor(alpha));``````

Thrust is a parallel algorithms library that resembles the C++ Standard Template Library (STL). It provides parallel building blocks to develop fast, portable algorithms. Interoperability with established technologies like CUDA and TBB, along with its modular design, allows you to focus on the algorithms instead of the platform-specific implementations.

Here, you allocate memory on the device, in this case the NVIDIA GPU for x and y. You then use the `fill` function to initialize them. Finally, you use the Thrust `transform` algorithm along with the defined functor `saxpy_functor` to apply the `y=a*x+y` operation to each element of x and y.

## SAXPY performance

While SAXPY is a bandwidth-bound operation and not computationally complex, its highly parallel nature means that it still benefits from GPU acceleration if the problem size is large enough. When compared to a dual socket AMD EPYC 7742 system with 128 cores and 256 threads, an NVIDIA A100 GPU was 23x faster, executing more than 3000 SAXPY operations in the time that the CPU took to do 140. Furthermore, all the GPU-accelerated implementations gave a similar performance, with cuBLAS edging out the rest by a slight margin (Figure 2). Figure 2. SAXPY performance on a dual socket AMD EPYC 7742 system with 128 cores and 256 threads when compared to GPU accelerated implementations on A100.

## Accelerating your code with NVIDIA GPUs

The NVIDIA HPC SDK is a comprehensive suite of compilers, libraries, and tools enabling you to choose the programming model that works best for you and still get excellent performance by accelerating your code using NVIDIA GPUs. Learn more and get started today:

Try NVIDIA GPU acceleration for your code from any major cloud service provider. Try A100 in the cloud at Google Cloud Platform, Amazon Web Services, or Alibaba Cloud.