Categories
Misc

Multi-GPU Programming with Standard Parallel C++, Part 2

Four panels vertically laid out each showing a simulation with a black backgroundBy developing applications using MPI and standard C++ language features, it is possible to program for GPUs without sacrificing portability or performance. Four panels vertically laid out each showing a simulation with a black background

This is the third post in the Standard Parallel Programming series, about the advantages of using parallelism in standard languages for accelerated computing.

In part 1, we explained:

  • The basics of C++ parallel programming
  • The lattice Boltzmann method (LBM)
  • Took the first steps towards refactoring the Palabos library to run efficiently on GPUs using standard C++.

In this post, we continue by optimizing the performance of the ISO C++ algorithms and then use MPI to scale the application across multiple GPUs.

Strive for optimal performance

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the constraints of the software architecture, the established API, and the need to account for sophisticated extra features expected by the user base. Not only that, the simplistic programming model of C++ standard parallelism allows for less manual fine-tuning than a dedicated language like CUDA.

In reality, it is often possible to control and limit this loss of performance to an extent that it becomes negligible. The key is to analyze the performance metrics of individual code portions and eliminate performance bottlenecks that do not reflect an actual need of the software framework.

A good practice consists in maintaining a separate proof-of-principle code for core components of your numerical algorithm. The performance of this approach can be more freely optimized and compared with the one of the full, complex software frameworks (like the STLBM library in the case of Palabos). Additionally, a GPU-capable profiler like nvprof can highlight the origin of performance bottlenecks efficiently.

Typical performance issues and their solutions are highlighted in the following recommendations:

  • Do not touch data on the CPU
  • Know your algorithms
  •  Establish a performance model

Do not touch data on the CPU

A frequent source of performance losses are hidden data transfers between CPU and GPU memory, which can be exceedingly slow. With the CUDA unified memory model, this type of transfer occurs whenever you access GPU data from the CPU. Touching a single byte of data can result in a catastrophic performance loss because entire memory pages are transferred at one time.

The obvious solution is to manipulate your data exclusively on the GPU whenever possible. This requires searching your code carefully for all accesses to the data and then wrapping them into a parallel algorithm call. Although this is somewhat robust, this process is needed for even the simplest operations.

Obvious places to look for are post-processing operations or intermediate evaluations of data statistics. Another classical performance bottleneck is found in the MPI communication layer, because you must remember to carry out data packing and unpacking operations on the GPU.

Expressing an algorithm on GPU is easier said than done, as the formalism of for_each and transform_reduce is mostly suitable for evenly structured memory accesses.

In the case of irregular data structures, it would be painful to avoid race conditions and guarantee coalesced memory accesses with these two algorithms. In such a case, you should follow up with the next recommendation and familiarize yourself with the family of parallelized algorithms provided in the C++ STL.

Know your algorithms

Up to this point, the parallel STL appears as little more than a way to express parallel for loops with a fancy functional syntax. In reality, the STL offers a large set of algorithms beyond for_each and transform_reduce that are useful to express your numerical method, including sorting and searching algorithms.

The exclusive_scan algorithm computes cumulative sums and deserves particular mention, as it proves generally useful for reindexing operations of unstructured data. For example, consider a packing algorithm for MPI communication, in which the number of variables contributed to the communication buffer by every grid node is unknown in advance. In this case, global communication among threads is required to determine the index at which every grid node writes into the buffer.

The following code example shows how to solve this type of problem with good parallel efficiency on GPU using parallel algorithms:

// Step 1: compute the number of variables contributed by every node.
int* numValuesPtr = allocateMemory(numberOfCells);
for_each(execution::par_unseq, numValuesPtr,
         numValuesPtrl + numberOfCells, [=](int& numValues)
{
    int i = &numValues - numValuesPtr;
    // Compute number of variables contributed by current node.
    numValues = computeNumValues(i);
} );
// 2. Compute the buffer index for every node.
int* indexPtr = allocateMemory(numberOfCells);
exclusive_scan(execution::par_unseq, numValuesPtr,
         numValuesPtr + numberOfCells, indexPtr, 0);
// 3. Pack the data into the buffer.
for_each(execution::par_unseq, indexPtr,
         indexPtr + numberOfCells, [=](int& index)
{
    int i = &index - indexPtr;
    packCellData(i, index);
} );

This example lets you enjoy the expressive power of the algorithm-based approach to GPU programming: the code requires no synchronization directives or any other low-level constructs.

Establish a performance model

A performance model establishes an upper bound for the performance of your algorithm through a bottleneck analysis. This typically considers the peak processor performance (measured in FLOPS) and the peak memory bandwidth as the principal limiting hardware characteristics.

As discussed in the Example: Lattice Boltzmann software Palabos section in the previous post, LBM codes have a low ratio of computations to memory accesses and are entirely memory-bound on modern GPUs. That is, at least if you use single-precision arithmetics or a GPU that is optimized for double-precision arithmetics.

The peak performance is simply expressed as a ratio between the memory bandwidth of the GPU and the number of memory accesses performed in the code. As a direct consequence, switching an LBM code from double– to single-precision arithmetics doubles the performance.

Figure 1 shows the performance of the GPU port of Palabos obtained on an NVIDIA A100 (40 GB) GPU for single– and double-precision floats.

Palabos performance results as MLUPS and percentages of peak capacity. Left column shows double precision results: 100% theoretical peak of the GPU is 4921 MLUPS. Palabos achieves 73% peak and 3610 MLUPs. The right column shows the same results for single precision: 100% of theoretical peak is 9482 MLUPs and Palabos achieves 74% of theoretical peak at 7050 MLUPS.
Figure 1. Palabos performance for the 3D lid-driven cavity (6003 grid nodes) on an A100 (40 GB) GPU in single and double precision. Model: TRT, D3Q19

The executed test case, a flow in a lid-driven cavity in a turbulent regime, has a simple cubic geometry. However, this case includes boundary conditions and exhibits a complex flow pattern. The performance is measured in million lattice-node updates per second (MLUPS, more is better) and compared against a theoretical peak value obtained under the assumption that GPU memory is exploited at peak capacity.

The code reaches 73% of peak performance in double-precision and 74% in single-precision. Such performance metrics are common in state-of-the-art implementations of LB models, independently of the language or library used.

Although some implementations may gain a few percentage points and reach a value closer to 80%, it is clear that we are closing in on the hard limit implied by the performance model. From a big-picture standpoint, the single-GPU performance of the code is as good as it gets.

Reuse the existing MPI backend to get a multi-GPU code

As C++ parallel algorithms integrate into an existing software project seamlessly to accelerate critical code portions, nothing prevents you from reusing the project’s communication backend to reach multi-GPU performance. However, you will want to keep an eye on the communication buffer and make sure that it does not take any detours through CPU memory, which would result in costly page faults.

Our first attempt to run the GPU-ported version of Palabos on multiple GPUs, although producing technically correct results, did not exhibit acceptable performance. Instead of a speedup, the switch from one to two GPUs delivered a decrease in speed by an order of magnitude. The issue could be traced to the packing and unpacking of the communicated data. In the original backend, this was carried out on CPUs, and to other instances of unnecessary data access in CPU memory, such as resizing of the communication buffer.

Such issues can be spotted with help of the profiler. The profiler highlights all occurrences of page faults in unified memory, and are fixed by moving the corresponding code portions to a parallel algorithm. The Know your algorithms section explained how to pack and unpack the communication buffer if the data follows an irregular pattern.

At this point, using standard C++ without any extensions other than MPI, you can get a hybrid CPU/GPU software project with state-of-the-art performance on single-GPU and solid parallel performance on multi-GPU.

Unfortunately, the multi-GPU performance remains below the expectations due to the current limitations of the language specifications and corresponding GPU implementations. Pending future improvements to the fairly young technology of C++ standard parallelism, we provide some workarounds in this post based on techniques outside the C++ standard.

Coordinate the multi-CPU and multi-GPU code execution

While this post focuses on hybrid CPU and GPU programming, we can’t avoid addressing the issue of hybrid parallelism (MPI or multithreading) in the CPU parts at some point.

The original version of Palabos, for example, is non-hybrid and uses the MPI communication layer to distribute work among the cores of a CPU as well as across the network. After porting to GPU, the resulting multi-CPU and multi-GPU code spontaneously groups a single CPU core with a full GPU in every MPI task, leaving the CPU relatively underpowered.

This leads to a performance bottleneck whenever it is necessary or convenient to keep a computationally intensive task on the CPU. In fluid dynamics, this is often the case in the preprocessing stage, such as in geometry processing or mesh generation.

The obvious solution involves the use of multithreading to access multiple CPU cores from within an MPI task. The shared memory space of these threads can then be directly shared with the GPU through the CUDA unified memory formalism.

However, C++ parallel algorithms cannot be reused to serve both purposes of GPU and multi-core CPU execution. This is because C++ does not allow choosing the target platform of parallel algorithms from within the language.

While C++ threads do provide a way to solve this problem natively, we found that OpenMP offered the most convenient and least intrusive solution. An OpenMP annotation of a for loop was sufficient in this case to distribute the grid portions assigned to the current MPI task over multiple threads.

Communicate through pinned memory

With current versions of the HPC SDK, the CUDA unified memory model exhibits another performance issue in combination with MPI.

As the MPI communication layer expects data with a fixed hardware address (so-called pinned memory), any buffer that resides in the managed memory area is first copied into a pinned memory buffer on the host CPU implicitly. Due to the transfers between the GPU and CPU, this operation can end up being rather costly.

Communication buffers should therefore be explicitly pinned to a GPU memory address.  With the nvc++ compiler, this is achieved by allocating the communication buffer with cudaMalloc:

// Allocate the communication buffer
// vector buffer(N);
// double* buffer = buffer.data();
double* buffer; cudaMalloc((void**)&buffer, N * sizeof(double));
for_each(buffer, buffer + N, … // Proceed with data packing

Another solution is to replace the STL vector with a thrust::device_vector from the Thrust library, which uses pinned GPU memory by default.

In the near future, the HPC SDK will handle these cases more efficiently and automatically for users. This is so that they do not have to reach for cudaMalloc or thrust::device_vector. So, stay tuned!

After the various improvements listed in this post, the Palabos library was tested on a DGX A100 (40-GB) workstation with four GPUs, again for the benchmark case of a lid-driven cavity. The obtained performance is shown in Figure 2 and compared to the performance achieved on a 48-core Xeon Gold 6240R CPU:

Palabos performance tested on a DGX A100 workstation diagram: This figure shows three columns. The left-most column shows performance of a Xeon Gold CPU: 421 MLUPs. The middle column shows the achieved performance of Palabos on an A100 GPU: 7050 MLUPs. The rightmost column shows a strong scaling to a DGX-Station 4x A100 GPU system in which 28200 MLUPs is the ideal parallel scaling achievable, and two results for Palabos are presented. Without pinned memory Palabos achieves 61% strong scaling efficiency at 17270 MLUPs, and with pinned memory an 82% strong scaling efficiency at 23030 MLUPs.
Figure 2. Palabos performance for the 3D lid-driven cavity (6003 grid nodes) on a 48-core Xeon Gold 6240R CPU and on a DGX A100 (40 GB) workstation, once using a single GPU and once with all four GPUs. Model: TRT, D3Q19, single precision

For the Xeon Gold, the original implementation of Palabos proved to be more efficient and was used with 48 MPI tasks, while the single-GPU and the four-GPU execution used the parallel algorithms backend, compiled with nvc++

The performance figures show a 3.27-fold speedup of the 4-GPU execution compared to the single-GPU one. This amounts to a quite satisfactory parallel efficiency of 82% in a strong scaling regime, with equal total domain size in both executions. In weak scaling, using a 4x larger problem size for the four-GPU execution, the speedup increases to 3.72 (93% efficiency).

Figure 2 also shows that when using an unpinned communication buffer, such as when the MPI communication buffer is not allocated with cudaMalloc, the parallel efficiency drops from 82% to 61%.

In the end, the four-GPU DGX workstation runs 55x faster than the Xeon Gold CPU. While the direct comparison may not be fair due to the different scope of the two machines, it provides a sense of the acceleration obtained by porting a code to GPU. The DGX is a desktop workstation connected to a common power plug, yet it delivers a performance that, on a CPU cluster, could only be obtained with thousands of CPU cores.

Conclusion

You’ve seen that C++ standard language parallelism can be used to port a library like Palabos to GPU with an astounding increase in the code’s performance.

  • For end users of the Palabos library, this performance gain is obtained with a single-line change to switch from the CPU to the GPU backend.
  • For Palabos library developers, some work was required to develop the corresponding GPU backend.

However, this work didn’t require learning a new domain-specific language nor did it depend on a detailed knowledge of the GPU architecture. 

This two-part post has provided you with guidelines that you can apply to achieve similar results with your own code. For more information, we encourage you to check out the following resources:

Categories
Misc

I m new to tensorflow and i need some help pls

I have a dataset that i downloaded from kaggle and it s in 4 folders trainA,trainB,testA,testB And i would like to put these image folders into a tf.data.dataset called DATA for example and i would like to call DATA[‘trainA’] for example to get the first folder thank you

submitted by /u/Jaded-Historian7036
[visit reddit] [comments]

Categories
Misc

CUDA Toolkit Nsight Visual Studio Edition Failed

I’m trying to install the necessary components to get GPU version of TensorFlow working. I am at the part where I am trying to install the CUDA toolkit and keep getting Nsight Visual Studio Edition Failed to install with no additional information as to why it failed. Any ideas why it may be failing or how I could check? I understand Visual Studio is needed to use the C++ compiler. I have multiple instances of Visual Studio since I’m on the computer I’ve used for programming for many years. Could that be part of the problem? Otherwise how else would it know what version of visual studio to use? Also, some more background info, I did start by uninstalling everything from NVidia and deleting all NVidia folders as recommended. Any advice how to proceed here is appreciated. Thanks.

submitted by /u/WubDubClub
[visit reddit] [comments]

Categories
Misc

Negative gradients when calculating GradCAM heatmap

I have a Segmentation network model trained for 2 classes and am able to see accurate results. But when using grad-cam for the heatmap, I am able to see good results for the last convolution layer for both the classes but have issues when trying to generate a heatmap for the second last convolution layer for one of the classes (the other class’s heatmap is working fine).

**Last 5 layers** convolution_layer(filters:8, kernel:3*3) convolution_transpose_layer(filters:2, kernel:2*2) convolution_layer(filters:2, kernel:3*3) convolution_layer(filters:10, kernel:1*1) activation_layer(softmax) 

The heatmap is empty because of all negative pooled gradients(due to mean from all the -ve gradients wrt Conv layer), resulting in negative values in pooled_grads*convolution_output on which relu is applied, giving all zeros.

What does it mean for GradCAM to be all negative?

Why is it that all channels in the convolution lead to a “negative” contribution to the true output class?

https://arxiv.org/pdf/2002.11434.pdf following this paper for heatmap for segmentation models.

submitted by /u/Ash_real
[visit reddit] [comments]

Categories
Misc

custom accuracy metric

Hi, I have been trying to implement a custom masked accuracy metric which does not consider the pad tokens (similar to the masked loss as shown here https://www.tensorflow.org/text/tutorials/nmt_with_attention). I have been trying to create a new subclass using the tf.keras. metrics.Metric, but I am very confused. Any leads would be really appreciated.

submitted by /u/Hour-Tie-7471
[visit reddit] [comments]

Categories
Misc

Neural Network getting 0 accuracy, and always predicting very high values

When I printed out the prediction, and the actual value side by side

[0.9785253] 6 [0.97852457] 5 [0.9785253] 6 [0.9785253] 5 [0.97848856] 6 [0.9785253] 7 [0.9785253] 5 [0.9785253] 7 

thats what it looked like, the value in bracket is the prediction, the other the actual one!

Have no clue why this is doing this, can you help please!!

submitted by /u/ME_I_AM_me
[visit reddit] [comments]

Categories
Misc

Gradients do not exist warning

I’ve tried to implement Yolov3 network by tf.keras, making it layer-by-layer. Then, I get outputs of layers 82, 94, 106, and pass them (and also – three training inputs with ground truth bounding boxes for every network stride) into Lambda layer to evaluate loss of net. However, when I try to train the network, I receive the warning: “WARNING:tensorflow:Gradients do not exist for variables [‘Layer_Conv_81/kernel:0’, ‘Layer_Conv_91/kernel:0’, ‘Layer_Batch_81/gamma:0’, ‘Layer_Batch_81/beta:0’, ‘Layer_Batch_91/gamma:0’, ‘Layer_Batch_91/beta:0’, ‘Output_1/kernel:0’, ‘Output_2/kernel:0’] when minimizing the loss. If you’re using `model.compile()`, did you forget to provide a `loss`argument?”

I’ve checked the sequence of layers – there are no unconnected ones, I have the loss function. What else could go wrong?

Brief version of code here:

def MakeYoloMainStructure(): inputImage = Input(shape=(IMAGE_SIDES[0], IMAGE_SIDES[1], 3), name='Main_Input') # Start placing layers layer1_1 = Conv2D(32, (3,3), strides=(1,1), use_bias=False, padding='same', name='Layer_Conv_1')(inputImage) layer1_2 = BatchNormalization(epsilon=eps, name='Layer_Batch_1')(layer1_1) layer1_3 = LeakyReLU(alpha=alp, name='Layer_Leaky_1')(layer1_2) # Start placing adding layers # Layer 1 - 64/1 layer2_1 = ZeroPadding2D(((1,0),(1,0)), name='Layer_ZeroPad_2')(layer1_3) layer2_2 = Conv2D(64, (3,3), strides=(2,2), use_bias=False, padding='valid', name='Layer_Conv_2')(layer2_1) layer2_3 = BatchNormalization(epsilon=eps, name='Layer_Batch_2')(layer2_2) layer2_4 = LeakyReLU(alpha=alp, name='Layer_Leaky_2')(layer2_3) ... layer80_2 = BatchNormalization(epsilon=eps, name='Layer_Batch_80')(layer80_1) layer80_3 = LeakyReLU(alpha=alp, name='Layer_Leaky_80')(layer80_2) layer81_1 = Conv2D(1024, (3,3), strides=(1,1), use_bias=False, padding='same', name='Layer_Conv_81')(layer80_3) # From this layer we make fork for first output (!) layer81_2 = BatchNormalization(epsilon=eps, name='Layer_Batch_81')(layer81_1) layer81_3 = LeakyReLU(alpha=alp, name='Layer_Leaky_81')(layer81_2) layer82_1 = Conv2D(3*6, (1,1), strides=(1,1), use_bias=False, padding='same', name='Output_1')(layer81_3) # FIRST output layer (!) layer84_1 = layer80_3 layer85_1 = Conv2D(256, (1,1), strides=(1,1), use_bias=False, padding='same', name='Layer_Conv_83')(layer84_1) ..... layer106_1 = Conv2D(3*6, (1,1), strides=(1,1), use_bias=False, padding='same', name='Output_3')(layer105_3) # THIRD output layer (!) # Net structure is completed yoloBoneModel = Model(inputImage, [layer82_1, layer94_1, layer106_1]) return yoloBoneModel def MakeYoloTrainStructure(yoloBoneModel): gridInput_all = [Input(shape=(GRID_SIDES[1], GRID_SIDES[1], 3, 6), name='Grid_Input_1'), Input(shape=(GRID_SIDES[2], GRID_SIDES[2], 3, 6), name='Grid_Input_2'), Input(shape=(GRID_SIDES[3], GRID_SIDES[3], 3, 6), name='Grid_Input_3')] layer_loss = Lambda(GetLoss, output_shape=(1,), name='GetLoss', arguments={'threshold': thresh})([*yoloBoneModel.output, *gridInput_all]) yoloTrainModel = Model([yoloBoneModel.input, *gridInput_all], layer_loss) return yoloTrainModel def GetLoss(args, threshold=0.5): modelOutputs = args[:3] checkInputs = args[3:] # ...... # Numerous manipulations to get loss of objects detection # ...... return loss def GetDataGenerator(batches): # Here I get image and ground truth Bounding Boxes data yield [imageData, *trueBoxes], np.zeros(batches) def main(): boneModel = MakeYoloMainStructure() trainModel = MakeYoloTrainStructure(boneModel) trainModel.compile(optimizer=Adam(lr=1e-3), loss={'GetLoss': lambda gridInput_all, y_pred: y_pred}, run_eagerly=True) batchSize = 32 trainModel.fit(GetDataGenerator(batchSize), steps_per_epoch=2000//batchSize, epochs=50, initial_epoch=0) 

submitted by /u/Nymfeparakit
[visit reddit] [comments]

Categories
Misc

How to load a model after it’s been saved?

How to load a model after it's been saved?

Tried using this for reference but it wasn’t working. Saving the model seems to work perfectly fine. Screenshots go in this order: 1. Code that trains and saves the model. 2. what the saved model looks like in the file directory. 3. Code that should load the model back and then test it. 4. Error message received when load/test code is run. How do I properly load my model back so it can be evaluated? Or is it that I’m improperly saving it?

1. Train and Save Code

2. File Directory (Saving every 100 epochs)

3. Load and Test Code

4. Console Error from 3.

submitted by /u/Kapooshi
[visit reddit] [comments]

Categories
Misc

using ROCm for tensorflow on RX 6000 series radeon cards.

Sorry if this was asked recently. I used google to search for this and pretty much only got old information. I know recently rdna 2 is supposed to work with ROCm but no one benchmarks the results.

I am looking to buy a new laptop soon and was hoping to get something in the “ultra portable” category with good battery life. Something with a 6800U or 6900HS. I had not been planning on getting one with a discrete gpu but the new Asus Zephyrus 14 has a 6800S, is fairly small considering, gets 8 to 10 hours on battery. Might get a lenovo Z13 if the price is right when it releases.

I mostly just ssh into my desktop at home so I am not buying the laptop specifically for deep learning but I cant find anything I quite want in the laptop space with a 3060/3070. Obviously I would only do small test type work if I ever have to run it on the laptop and cpu training is possible, but would a 6800S gpu or maybe even the 680m onboard graphics train faster than the cpu?

submitted by /u/ObsidianAvenger
[visit reddit] [comments]

Categories
Misc

Training TensorFlow models with HUGE datasets

submitted by /u/AM_DS
[visit reddit] [comments]