Jorge Cardoso wears many hats, and that’s appropriate given he has so many brains. A hundred thousand of them to be exact. Cardoso is a teacher, a CTO, an entrepreneur, a founding member of the MONAI open source consortium and a researcher in AI for medical imaging. In that last role, Cardoso and his team Read article >
Researchers grappling with today’s grand challenges are getting traction with accelerated computing, as showcased at ISC, Europe’s annual gathering of supercomputing experts. Some are building digital twins to simulate new energy sources. Some use AI+HPC to peer deep into the human brain. Others are taking HPC to the edge with highly sensitive instruments or accelerating Read article >
Atos, Dell Technologies, GIGABYTE, Hewlett Packard Enterprise, Inspur, Lenovo and Supermicro Join First Wave Planning NVIDIA Grace-Powered HGX Systems for HPC and AISANTA CLARA, Calif., May …
Highly accurate digital representations of physical objects or systems, or “digital twins,” will enable the next era of industrial virtualization and AI, executives from NVIDIA and BMW said Tuesday. Kicking off the ISC 2022 conference in Hamburg, Germany, NVIDIA’s Rev Lebaredian (left), vice president for Omniverse and simulation technology, was joined by Michele Melchiorre, senior Read article >
We present lessons learned from refactoring a Fortran application to use modern do concurrent loops in place of OpenACC for GPU acceleration.
This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing:
Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, do concurrent loops and array math intrinsics in Fortran.
Using standard language features has many advantages to offer, the chief advantage being future-proofness. As Fortran’s do concurrent is a standard language feature, the chances of support being lost in the future is slim.
This feature is also relatively simple to use in initial code development and it adds portability and parallelism. Using do concurrent for initial code development has the mental advantage of encouraging you to think about parallelism from the start as you write and implement loops.
For initial code development, do concurrent is a great way to add GPU support without having to learn directives. However, even code that has already been GPU-accelerated through the use of directives like OpenACC and OpenMP can benefit from refactoring to standard parallelism for the following reasons:
Cleaning up the code for those who do not know directives, or removing the large numbers of directives that make the source code distracting.
Increasing the portability of the code in terms of vendor support and longevity of support.
Future-proofing the code, as ISO-standard languages have a proven track record for stability and portability.
Replacing directives on multi-core CPUs and GPUs
POT3D is a Fortran code that computes potential field solutions to approximate the solar coronal magnetic field using surface field observations as input. It continues to be used for numerous studies of coronal structure and dynamics.
We recently refactored another code example with do concurrentat WACCPD 2021. The results showed that you could replace directives with do concurrent without losing performance on multicore CPUs and GPUs. However, that code was somewhat simple in that there is no MPI.
Now, we want to explore replacing directives in more complicated code. POT3D contains nontrivial features for standard Fortran parallelism to handle: reductions, atomics, CUDA-aware MPI, and local stack arrays. We want to see if do concurrentcan replace directives and retain the same performance.
To establish a performance baseline for refactoring the code to do concurrent, first review the initial timings of the original code in Figure 1. The CPU result was run on 64 MPI ranks (32 per socket) on a dual-socket AMD EPYC 7742 server, while the GPU result was run on one MPI rank on an NVIDIA A100 (40GB). The GPU code relies on data movement directives for data transfers (we do not use managed memory here) and is compiled with -acc=gpu -gpu=cc80,cuda11.5. The runtimes are an average over four runs.
The following highlighted text shows the number of lines of code and directives for the current version of the code. You can see that there are 80 directives, but we hope to reduce this number by refactoring with do concurrent.
POT3D (Original)
Fortran
3,487
Comments
3,452
OpenACC Directives
80
Total
7,019
Table 1. Code facts for POT3D including Fortran, comments, OpenACC Directives, and total lines.
Do concurrent and OpenACC
Here are some examples of do concurrentcompared to OpenACC from the code POT3D, such as a triple-nested OpenACC parallelized loop:
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
…
!$acc parallel loop default(present) collapse(3) async(1)
do k=1,np
do j=1,nt
do i=1,nrm1
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k))*dr_i(i)
enddo
enddo
enddo
…
!$acc wait
!$acc exit data delete(phi,dr_i,br)
As mentioned earlier, this OpenACC code is compiled with the flag -acc=gpu -gpu=cc80,cuda11.5 to run on an NVIDIA GPU.
You can parallelize this same loop with do concurrentand rely on NVIDIA CUDA Unified Memory for data movement instead of directives. This results in the following code:
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
As you can see, the loop has been condensed from 12 lines to three, and CPU portability and GPU parallelism are retained with the nvfortran compiler from the NVIDIA HPC SDK.
This reduction in the number of lines is thanks to collapsing multiple loops into one loop and relying on managed memory, which removes all data movement directives. Compile this code for the GPU with -stdpar=gpu -gpu=cc80,cuda11.5.
For nvfortran, activating standard parallelism (-stdpar=gpu) automatically activates managed memory. To use OpenACC directives to control data movement along with do concurrent, use the following flags: -acc=gpu -gpu=nomanaged.
The nvfortran implementation of do concurrentalso allows for locality of variables to be defined:
do concurrent (k=1:N, j(i)>0) local(M) shared(J,K)
M = mod(K(i), J(i))
K(i) = K(i)- M
enddo
This may be necessary for some code. For POT3D, the default locality of variables performs as needed. The default locality is the same as OpenACC with nvfortran.
do concurrent CPU performance and GPU implementation
Replacing all the OpenACC loops with do concurrentand relying on managed memory for data movement leads you to code with zero directives and fewer lines. We removed 80 directives and 66 lines of Fortran.
Figure 2 shows that this do concurrentversion of the code has nearly the same performance on the CPU as the original GitHub code. This means that you haven’t broken CPU compatibility by using do concurrent. Instead, multi-core parallelism has also been added, which can be used by compiling with the flag -stdpar=multicore.
Unlike the case of the CPU, to be able to run POT3D on a GPU, you must add a couple of directives.
First, to take advantage of multiple GPU with MPI, you need a directive to specify the GPU device number. Otherwise, all MPI ranks would use the same GPU.
!$acc set device_num(mpi_shared_rank_num)
In this example, mpi_shared_rank_num is the MPI rank within a node. It’s assumed that the code is launched such that the number of MPI ranks per node is the same as the number of GPUs per node. This can also be accomplished by setting CUDA_VISIBLE_DEVICES for each MPI rank, but we prefer doing this programmatically.
When using managed memory with multiple GPUs, make sure that the device selection (such as !$acc set device_num(N)) is done before any data is allocated. Otherwise, an additional CUDA context is created, introducing additional overhead.
Currently, the nvfortran compiler does not support array reductions on concurrent loops, which are required in two places in the code. Fortunately, an OpenACC atomic directive can be used in place of an array reduction:
do concurrent (k=2:npm1,i=1:nr)
!$acc atomic
sum0(i)=sum0(i)+x(i,2,k)*dph(k )*pl_i
enddo
After adding this directive, change the compiler options to enable OpenACC explicitly by using -stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5. This allows you to use only three OpenACC directives. This is the closest this code can come to not having directives at this time.
All the data movement directives are unnecessary, since CUDA managed memory is used for all data structures. Table 2 shows the number of directives and lines of code needed for this version of POT3D.
POT3D (Original)
POT3D (Do Concurrent)
Difference
Fortran
3487
3421
(-66)
Comments
3452
3448
(-4)
OpenACC Directives
80
3
(-77)
Total
7019
6872
(-147)
Table 2. Number of lines of code for the GPU compatible do concurrent version of POT3D with a breakdown of the number of Fortran lines, directives, and comments lines.
For the reduction loops in POT3D, you relied on implicit reductions, but that may not always work. Recently, nvfortran has added the upcoming Fortran 202X reduce clause, which can be used on reduction loops as follows:
do concurrent (k=1:N) reduce(+:cgdot)
cgdot=cgdot+x(i)*y(i)
enddo
GPU performance, unified memory, and data movement
You’ve developed code with the minimal number of OpenACC directives and do concurrentthat relies on managed memory for data movement. This is the closest directive-free code that you can have at this time.
Figure 3 shows that this code version takes a small performance hit of ~10% when compared to the original OpenACC GPU code. The cause of this could be do concurrent, managed memory, or a combination.
To see if managed memory causes the small performance hit, compile the original GitHub code with managed memory turned on. This is done by using the compile flag -gpu=managed in addition to the standard OpenACC flags used before for the GPU.
Figure 4 shows that the GitHub code now performs similar to the minimum directives code with managed memory. This means that the culprit for the small performance loss is unified memory.
To regain the performance of the original code with the minimal directives code, you must add the data movement directives back in. This mixture of do concurrent and data movement directives would look like the following code example:
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
This results in the code having 41 directives, with 38 being responsible for data movement. To compile the code and rely on the data movement directives, run the following command:
-stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5,nomanaged
nomanaged turns off managed memory and -acc=gpu turns on the directive recognition.
Figure 5 shows that there is nearly the same performance as the original GitHub code. This code has 50% fewer directives than the original code and gives the same performance!
MPI + DO CONCURRENT scaling
Figure 7 shows the timing results using multiple GPUs. The primary takeaway is that do concurrent works with MPI over multiple GPUs.
Looking at the codes with managed memory turned on (blue lines), you can see that the original code and the minimal directives code gave nearly the same performance as multiple GPUs were used.
Looking at the codes with managed memory turned off (green lines), you can again see the same scaling between the original GitHub code and the do concurrent version of the code. This indicates that do concurrent works with MPI and has no impact on the scaling you should see.
What you might also notice is that managed memory causes an overhead as the GPUs are scaled. The managed memory runs (blue lines) and data directive lines (green lines) are parallel to each other, meaning the overhead scales with the number of GPUs.
Fortran standard parallel programming review
You may be wondering, “Standard Fortran sounds too good to be true, what is the catch?”
Fortran standard parallel programming enables cleaner looking code and increases the future proofness of your code by relying on ISO language standards. Using the latest nvfortran compiler, you gain all the benefits mentioned earlier.
Although you lose the current GCC OpenACC/MP GPU support when you transition to do concurrent, we expect to gain more GPU support in the future as other vendors add support of do concurrent on GPUs. Given the track record of ISO language standards, we believe that this support will come.
Using do concurrent does currently come with a small number of limitations, namely the lack of support for atomics, device selection, asynchrony, or optimizing data movement. As we have shown, however, each of these limitations can be easily worked around using compiler directives. Far fewer directives are required thanks to the native parallel language features in Fortran.
This work was supported by the National Science Foundation, NASA, and the Air Force Office of Scientific Research. Computational resources were provided by the Computational Science Resource Center at San Diego State University.
Learn how RAPIDS cuDF accelerates data science with the help of GPUDirect Storage. Dive into the techniques that minimize the time to upload data to the GPU
If you work in data analytics, you know that data ingest is often the bottleneck of data preprocessing workflows. Getting data from storage and decoding it can often be one of the most time-consuming steps in the workflow because of the data volume and the complexity of commonly used formats. Optimizing data ingest can greatly reduce this bottleneck for data scientists working on large data sets.
RAPIDS cuDF greatly speeds up data decode by implementing CUDA-accelerated readers for prevalent formats in data science.
In addition, Magnum IO GPUDirect Storage (GDS) enables cuDF to speed up I/O by loading the data directly from storage into the device (GPU) memory. By providing a direct data path across the PCIe bus between the GPU and compatible storage (for example, Non-Volatile Memory Express (NVMe) drive), GDS can enable up to 3–4x higher cuDF read throughput, with an average of 30–50% higher throughput across a variety of data profiles.
In this post, we provide an overview of GPUDirect Storage and how it’s integrated into cuDF. We introduce the suite of benchmarks that we used to evaluate I/O performance. Then, we walk through the techniques that cuDF implements to optimize GDS reads. We conclude with the benchmark results and identify cases where use of GDS is the most beneficial for cuDF.
What is GPUDirect Storage?
GPUDirect Storage is a new technology that enables direct data transfer between local or remote storage (as a block device or through file systems interface) and GPU memory.
In other words, a direct memory access (DMA) engine can now quickly move data on a direct path from the storage to GPU memory, without increasing latency and burdening the CPU with extra copying through a bounce buffer.
Figure 1 shows the data flow patterns without and with GDS. When using the system memory bounce buffer, throughput is bottlenecked by the bandwidth from system memory to GPUs. On systems with multiple GPUs and NVMe drives, this bottleneck is even more pronounced.
However, with the ability to directly interface GPUs with storage devices, the intermediary CPU is taken out of the data path, and all CPU resources can be used for feature engineering or other preprocessing tasks.
The GDS cuFile library enables applications and frameworks to leverage GDS technology to increase bandwidth and achieve lower latency. cuFile is available as part of CUDA Toolkit since version 11.4.
How does RAPIDS cuDF use GPUDirect Storage?
To increase the end-to-end read throughput, cuDF uses the cuFile APIs in its data ingest interfaces, like read_parquet and read_orc. As cuDF performs nearly all parsing on the GPU, most of the data is not needed in system memory and can be directly transferred to GPU memory.
As a result, only the metadata portion of input files is accessed by the CPU, and the metadata constitutes only a small part of the total file size. This permits the cuDF data ingest API to make efficient use of GDS technology.
Since cuDF 22.02, we have enabled the use of GDS by default. If you have cuFile installed and request to read data from a supported storage location, cuDF automatically performs direct reads through the cuFile API. If GDS is not applicable to the input file, or if the user disables GDS, data follows the bounce buffer path and is copied through a paged system memory buffer.
Benchmarking data ingest in cuDF
In response to the diversity of data science datasets, we assembled a benchmark suite that targets key data and file properties. Previous examples of I/O benchmarking used data samples such as the New York Taxi trip record, the Yelp reviews dataset, or Zillow housing data to represent general use cases. However, we discovered that benchmarking specific combinations of data and file properties have been crucial for performance analysis and troubleshooting.
Covered combinations of data and file properties
Table 1 shows the parameters that we varied in our benchmarks for binary formats. We generated pseudo-random data across the range of supported data types and collected benchmarks for each group of related types. We also independently varied the run-length and cardinality of the data to exercise run-length encoding and dictionary encoding of the target formats.
Finally, we varied the compression type in the generated files, for all values of the properties preceding. We currently support two options: Use Snappy or keep the data uncompressed.
Table 1. Details on the data properties studied in the cuDF Parquet and ORC benchmark suite
With all of the properties that change independently, we have a large number of varied cases: 48 cases for each file format (ORC and Parquet). To ensure fair comparison between cases, we targeted a fixed in-memory dataframe size and column count, with a default size of 512 MiB. For the purpose of this post, we introduced an additional parameter to control the target in-memory dataframe size from 64 to 4,096 MiB.
The full cuDF benchmark suite is available in the open-source cuDF repository. To focus on the impact of I/O, this post only includes the inputs benchmarks in cuDF rather than reader options or row/column selection benchmarks.
Optimizing cuDF+GDS file read throughput
Data in formats like ORC and Parquet are stored in independent chunks, which we read in separate operations. We found that the read bandwidth is not saturated when these operations are performed in a single thread, leading to suboptimal performance regardless of GDS use.
Issuing multiple GDS read calls in parallel enables overlapping of multiple storage-to-GPU copies, raising the throughput and potentially saturating the read bandwidth. As of version 1.2.1.4, cuFile calls are synchronous, so parallelism requires multithreading from downstream users.
Controlling the level of parallelism
As we did not control the file layout and number of data chunks, creating a separate thread for each read operation could have generated an excessive number of threads, leading to a performance overhead. On the other hand, if we only had a few large read calls, we might not have had enough threads to effectively saturate the read bandwidth of the storage hardware.
To control the level of parallelism, we used a thread pool and sliced larger read calls into smaller reads that could be performed in parallel.
When we begin reading a file using GDS, we create a data ingest pipeline based on the thread-pool work of Barak Shoshany from Brock University. As Figure 2 shows, we split each file read operation (read_async call) in the cuDF reader into cuFile calls of fixed size, except for, in most cases, the last slice.
Instead of directly calling cuFile to read the data, a task is created for each slice and placed in a queue. The read_async call returns a task that waits for all of the sliced tasks to be completed. Callers can perform other work and wait on the aggregate task after the requested data is required to continue. When available, threads from the pool dequeue the tasks and make the synchronous cuFile calls.
Upon completion, a thread sets the task as completed and becomes available to execute the next task in the queue. As most tasks read the same amount of data, cuFile reads are effectively load-balanced between the threads. After all tasks from a given read_async call are completed, the aggregate task is complete and the caller code can continue, as the requested data range has been uploaded to the GPU memory.
Optimal level of parallelism
The size of the thread pool and the read slices greatly impacts the number of GDS read calls executed at the same time. To maximize throughput, we experimented with a range of different values for these parameters and identified the configuration that gives the highest throughput.
Figure 2 shows a performance optimization study on cuDF+GDS data ingest with tunable parameters of thread count and slice size. For this study, we defined the end-to-end read throughput as the binary format file size divided by the total ingest time (I/O plus parsing and decode).
The throughput shown in Figure 2 is averaged over the entire benchmark suite and based on a dataframe size of 512 MiB. We found that Parquet and ORC binary formats share the same optimization behavior across the parameter range, also shown in Figure 2. Small slice sizes cause higher overhead due to the large number of read tasks, and large slice sizes lead to poor utilization of the thread pool.
In addition, lower thread counts do not provide enough parallelism to saturate the NVMe read bandwidth, and higher thread counts lead to thread overhead. We discovered that the highest end-to-end read throughput was for 8-32 threads and 1-16 MiB slice size.
Depending on the use case, you may find optimal performance with different values. This is why, as of 22.04, the default thread pool size of 16 and slice size of 4 MiB are configurable through environment variables, as described in GPUDirect Storage Integration.
Impact of GDS on cuDF data ingest
We evaluated the effect that GDS use had on the performance of a few cuDF data ingest APIs using the benchmark suite described earlier. To isolate the impact of GDS, we ran the same set of benchmarks with default configuration (uses GDS) and with GDS manually disabled through an environment variable.
For accurate I/O benchmarking with cuDF benchmarks, file system caches must be cleared before each new read operation. Benchmarks measure the total time to execute a read_orc or read_parquet call (read time) and these are the values we used for comparison. GDS speedup was computed for each benchmark case file as the read time through the bounce buffer divided by the total ingest time through the direct data path.
For the GDS performance comparison, we used a NVIDIA DGX-2 with V100 GPUs and NVMe SSDs connected in a RAID 0 configuration. All benchmarks use a single GPU and read from a single RAID 0 of two 3.84-TB NVMe SSDs.
Larger files, higher speedups
Figure 3 shows the GDS speedup for a variety of files with in-memory dataframe size of 4,096 MiB. We observed that the performance improvement varies widely, between modest 10% and impressive 270% increase over the host path.
Some of the variance is due to differences in the data decode. For example, list types show smaller speedups due to the higher parsing complexity, and decimal types in ORC show smaller speedups due to overhead in fixed point decoding compared to Parquet.
However, the decode process cannot account for the majority of variance in the results. With high correlation between the result for ORC and Parquet, you can reasonably suspect that properties like run-length and cardinality play a major role. Both properties affect how well data can be compressed/encoded, so here’s how the file size relates to these properties and, by proxy, to the GDS speedup.
Figure 4 shows the extent in which intrinsic properties of the data impact the binary file size needed to store a fixed in-memory data size. For 4,096 MiB of in-memory data, file sizes range from 0.1 to 5.8 GB across the benchmarks cases without Snappy compression. As expected, the largest file sizes generally correspond to shorter run-length and higher cardinality because data with this profile is difficult to efficiently encode.
Figures 3 and 4 show strong correlation between file size and the magnitude of GDS performance benefit. We hypothesize that this is because, when reading an efficiently encoded file, I/O takes less time compared to decode so there’s little room for improvement from GDS. On the other hand, the results show that reading of files that encode poorly is bottlenecked on the I/O.
Larger data, higher speedups
We applied the full benchmark suite described in Table 1 to inputs with in-memory sizes in the range of 64 to 4,096 MiB. Taking the mean speedup over the benchmark suite, we discovered that speedup from GDS generally increases with the data size. We measure consistent 30-50% speedup for 512 MiB and greater in-memory data sizes. For the largest datasets, we found that the Parquet reader benefits more from GDS than the ORC reader.
Summary
GPUDirect Storage provides a direct data path to the GPU, reducing latency and increasing throughput for I/O operations. RAPIDS cuDF leverages GDS in its data ingest APIs to unlock the full read bandwidth of your storage hardware.
GDS enables you to speed up cuDF data ingest workloads up to 3x, significantly improving the end-to-end performance of your workflows.
Apply your knowledge
If you have not tried out cuDF for your data processing workloads, we encourage you to test our latest 22.04 release. We provide Docker containers for our releases as well as our nightly builds. Conda packages are also available to make testing and deployment easier. If you want to get started with RAPIDS cuDF, you can do so.
If you are already using cuDF, we encourage you to give GPUDirect Storage a try. It is easier than ever to take advantage of high capacity and high-performance NVMe drives. If you already have the hardware, you are a few configuration steps away from unlocking peak performance in data ingest. Stay tuned for upcoming releases of cuDF.
For more information about storage I/O acceleration, see the following resources:
SANTA CLARA, Calif., May 27, 2022 — NVIDIA today announced that it recently became aware of an unsolicited “mini-tender” offer by Tutanota LLC to purchase up to 215,000 …