We present lessons learned from refactoring a Fortran application to use modern do concurrent loops in place of OpenACC for GPU acceleration.
This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing:
- Developing Accelerated Code with Standard Language Parallelism
- Multi-GPU Programming with Standard Parallel C++, Part 1
- Multi-GPU Programming with Standard Parallel C++, Part 2
Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, do concurrent
loops and array math intrinsics in Fortran.
Using standard language features has many advantages to offer, the chief advantage being future-proofness. As Fortran’s do concurrent
is a standard language feature, the chances of support being lost in the future is slim.
This feature is also relatively simple to use in initial code development and it adds portability and parallelism. Using do concurrent
for initial code development has the mental advantage of encouraging you to think about parallelism from the start as you write and implement loops.
For initial code development, do concurrent
is a great way to add GPU support without having to learn directives. However, even code that has already been GPU-accelerated through the use of directives like OpenACC and OpenMP can benefit from refactoring to standard parallelism for the following reasons:
- Cleaning up the code for those who do not know directives, or removing the large numbers of directives that make the source code distracting.
- Increasing the portability of the code in terms of vendor support and longevity of support.
- Future-proofing the code, as ISO-standard languages have a proven track record for stability and portability.
Replacing directives on multi-core CPUs and GPUs
POT3D is a Fortran code that computes potential field solutions to approximate the solar coronal magnetic field using surface field observations as input. It continues to be used for numerous studies of coronal structure and dynamics.
The code is highly parallelized using MPI and is GPU-accelerated using MPI with OpenACC. It is open source and available on GitHub. It is also a part of the SpecHPC 2021 Benchmark Suites.
We recently refactored another code example with do concurrent at WACCPD 2021. The results showed that you could replace directives with do concurrent
without losing performance on multicore CPUs and GPUs. However, that code was somewhat simple in that there is no MPI.
Now, we want to explore replacing directives in more complicated code. POT3D contains nontrivial features for standard Fortran parallelism to handle: reductions, atomics, CUDA-aware MPI, and local stack arrays. We want to see if do concurrent
can replace directives and retain the same performance.
To establish a performance baseline for refactoring the code to do concurrent
, first review the initial timings of the original code in Figure 1. The CPU result was run on 64 MPI ranks (32 per socket) on a dual-socket AMD EPYC 7742 server, while the GPU result was run on one MPI rank on an NVIDIA A100 (40GB). The GPU code relies on data movement directives for data transfers (we do not use managed memory here) and is compiled with -acc=gpu -gpu=cc80
,cuda11.5
. The runtimes are an average over four runs.
The following highlighted text shows the number of lines of code and directives for the current version of the code. You can see that there are 80 directives, but we hope to reduce this number by refactoring with do concurrent
.
POT3D (Original) | |
Fortran | 3,487 |
Comments | 3,452 |
OpenACC Directives | 80 |
Total | 7,019 |
Do concurrent and OpenACC
Here are some examples of do concurrent
compared to OpenACC from the code POT3D, such as a triple-nested OpenACC parallelized loop:
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
…
!$acc parallel loop default(present) collapse(3) async(1)
do k=1,np
do j=1,nt
do i=1,nrm1
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k))*dr_i(i)
enddo
enddo
enddo
…
!$acc wait
!$acc exit data delete(phi,dr_i,br)
As mentioned earlier, this OpenACC code is compiled with the flag -acc=gpu -gpu=cc80
,cuda11.5
to run on an NVIDIA GPU.
You can parallelize this same loop with do concurrent
and rely on NVIDIA CUDA Unified Memory for data movement instead of directives. This results in the following code:
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
As you can see, the loop has been condensed from 12 lines to three, and CPU portability and GPU parallelism are retained with the nvfortran compiler from the NVIDIA HPC SDK.
This reduction in the number of lines is thanks to collapsing multiple loops into one loop and relying on managed memory, which removes all data movement directives. Compile this code for the GPU with -stdpar=gpu -gpu=cc80
,cuda11.5
.
For nvfortran, activating standard parallelism (-stdpar=gpu
) automatically activates managed memory. To use OpenACC directives to control data movement along with do concurrent
, use the following flags: -acc=gpu -gpu=nomanaged
.
The nvfortran implementation of do concurrent
also allows for locality of variables to be defined:
do concurrent (k=1:N, j(i)>0) local(M) shared(J,K)
M = mod(K(i), J(i))
K(i) = K(i)- M
enddo
This may be necessary for some code. For POT3D, the default locality of variables performs as needed. The default locality is the same as OpenACC with nvfortran.
do concurrent CPU performance and GPU implementation
Replacing all the OpenACC loops with do concurrent
and relying on managed memory for data movement leads you to code with zero directives and fewer lines. We removed 80 directives and 66 lines of Fortran.
Figure 2 shows that this do concurrent
version of the code has nearly the same performance on the CPU as the original GitHub code. This means that you haven’t broken CPU compatibility by using do concurrent. Instead, multi-core parallelism has also been added, which can be used by compiling with the flag -stdpar=multicore
.
Unlike the case of the CPU, to be able to run POT3D on a GPU, you must add a couple of directives.
First, to take advantage of multiple GPU with MPI, you need a directive to specify the GPU device number. Otherwise, all MPI ranks would use the same GPU.
!$acc set device_num(mpi_shared_rank_num)
In this example, mpi_shared_rank_num
is the MPI rank within a node. It’s assumed that the code is launched such that the number of MPI ranks per node is the same as the number of GPUs per node. This can also be accomplished by setting CUDA_VISIBLE_DEVICES
for each MPI rank, but we prefer doing this programmatically.
When using managed memory with multiple GPUs, make sure that the device selection (such as !$acc set device_num(N)
) is done before any data is allocated. Otherwise, an additional CUDA context is created, introducing additional overhead.
Currently, the nvfortran compiler does not support array reductions on concurrent loops, which are required in two places in the code. Fortunately, an OpenACC atomic directive can be used in place of an array reduction:
do concurrent (k=2:npm1,i=1:nr)
!$acc atomic
sum0(i)=sum0(i)+x(i,2,k)*dph(k )*pl_i
enddo
After adding this directive, change the compiler options to enable OpenACC explicitly by using -stdpar=gpu -acc=gpu -gpu=cc80
,cuda11.5
. This allows you to use only three OpenACC directives. This is the closest this code can come to not having directives at this time.
All the data movement directives are unnecessary, since CUDA managed memory is used for all data structures. Table 2 shows the number of directives and lines of code needed for this version of POT3D.
POT3D (Original) | POT3D (Do Concurrent) | Difference | |
Fortran | 3487 | 3421 | (-66) |
Comments | 3452 | 3448 | (-4) |
OpenACC Directives | 80 | 3 | (-77) |
Total | 7019 | 6872 | (-147) |
For the reduction loops in POT3D, you relied on implicit reductions, but that may not always work. Recently, nvfortran has added the upcoming Fortran 202X reduce clause, which can be used on reduction loops as follows:
do concurrent (k=1:N) reduce(+:cgdot)
cgdot=cgdot+x(i)*y(i)
enddo
GPU performance, unified memory, and data movement
You’ve developed code with the minimal number of OpenACC directives and do concurrent
that relies on managed memory for data movement. This is the closest directive-free code that you can have at this time.
Figure 3 shows that this code version takes a small performance hit of ~10% when compared to the original OpenACC GPU code. The cause of this could be do concurrent
, managed memory, or a combination.
To see if managed memory causes the small performance hit, compile the original GitHub code with managed memory turned on. This is done by using the compile flag -gpu=managed
in addition to the standard OpenACC flags used before for the GPU.
Figure 4 shows that the GitHub code now performs similar to the minimum directives code with managed memory. This means that the culprit for the small performance loss is unified memory.
To regain the performance of the original code with the minimal directives code, you must add the data movement directives back in. This mixture of do concurrent
and data movement directives would look like the following code example:
!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)
This results in the code having 41 directives, with 38 being responsible for data movement. To compile the code and rely on the data movement directives, run the following command:
-stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5,nomanaged
nomanaged
turns off managed memory and -acc=gpu
turns on the directive recognition.
Figure 5 shows that there is nearly the same performance as the original GitHub code. This code has 50% fewer directives than the original code and gives the same performance!
MPI + DO CONCURRENT scaling
Figure 7 shows the timing results using multiple GPUs. The primary takeaway is that do concurrent
works with MPI over multiple GPUs.
Looking at the codes with managed memory turned on (blue lines), you can see that the original code and the minimal directives code gave nearly the same performance as multiple GPUs were used.
Looking at the codes with managed memory turned off (green lines), you can again see the same scaling between the original GitHub code and the do concurrent
version of the code. This indicates that do concurrent
works with MPI and has no impact on the scaling you should see.
What you might also notice is that managed memory causes an overhead as the GPUs are scaled. The managed memory runs (blue lines) and data directive lines (green lines) are parallel to each other, meaning the overhead scales with the number of GPUs.
Fortran standard parallel programming review
You may be wondering, “Standard Fortran sounds too good to be true, what is the catch?”
Fortran standard parallel programming enables cleaner looking code and increases the future proofness of your code by relying on ISO language standards. Using the latest nvfortran compiler, you gain all the benefits mentioned earlier.
Although you lose the current GCC OpenACC/MP GPU support when you transition to do concurrent
, we expect to gain more GPU support in the future as other vendors add support of do concurrent
on GPUs. Given the track record of ISO language standards, we believe that this support will come.
Using do concurrent
does currently come with a small number of limitations, namely the lack of support for atomics, device selection, asynchrony, or optimizing data movement. As we have shown, however, each of these limitations can be easily worked around using compiler directives. Far fewer directives are required thanks to the native parallel language features in Fortran.
Ready to get started? Download the free NVIDIA HPC SDK, and start testing! If you are also interested in more details on our findings, see the From Directives to DO CONCURRENT: A Case Study in Standard Parallelism GTC session. For more information about standard language parallelism, see Developing Accelerated Code with Standard Language Parallelism.
Acknowledgments
This work was supported by the National Science Foundation, NASA, and the Air Force Office of Scientific Research. Computational resources were provided by the Computational Science Resource Center at San Diego State University.