Categories
Misc

Welcome ‘In the NVIDIA Studio’: A Weekly Celebration of Extraordinary Artists, Their Inspiring Art and Innovative Techniques

Creating content is no longer tethered to using paint and stone as mediums, nor being in massive studios. Visual art can now be created anywhere, anytime. But being creative is still challenging and time-consuming. NVIDIA is making artistic workflows easier and faster by giving creators tools that enable them to remain in their flow state. Read article >

The post Welcome ‘In the NVIDIA Studio’: A Weekly Celebration of Extraordinary Artists, Their Inspiring Art and Innovative Techniques appeared first on NVIDIA Blog.

Categories
Misc

Trouble with TensorFlow Lite on Debian 9 (I’m new to this)

Trouble with TensorFlow Lite on Debian 9 (I'm new to this)

I’m completely new to TensorFlow and Python (coming from PHP) but got my first image classification running. On my development system with Ubuntu 20.04 works everything fine, and the predictions are pretty accurate.

So I wanted it on my Smarthome System, which is a small embedded system with an AMD G Series CPU, and only 16GB SSD Space, running Debian 9.

In my first attempt I was also installing the whole Tensorflow Package, and tried to run my model. But that used almost all of my available space, and still didn’t worked. Keras could not load the model.

So now I wanted to switch to TensorFlow Lite. Removed the normal TensorFlow and installed the lite version with apt (as recommended by the documentation).

I have stared to rewrite my check script, but I get errors already at import tflite_runtime.interpreter as tflite

https://preview.redd.it/u1q78csjtgu81.png?width=660&format=png&auto=webp&s=b98441b2eca0747a284c82b7a8ea84d081f6d690

I have absolutely no clue what the reason for this error is, or how I get rid of it. Google and Documentation wasn’t helpful so far.

My overall goal is to get an image via http from the local network, and give the detection back as a JSON, so that my smarthome system can run the script, and process the result.

I put my code into a github repo. The check_quiet.py is what I want to run on my smarthome device. check.py is the original script that works fine on my development system.

https://github.com/dr2okevin/tf-driveway

submitted by /u/dr2okevin
[visit reddit] [comments]

Categories
Misc

deploy an ssd mobilenet model trained with tf2 on opencv

I’ve been trying to deploy an SSD-mobilenet model on OpenCV so I can use it raspberry pi for the past week with no solution. After deploying the model with tensorflow 2. I tried to freeze it and deploy it on OpenCV But it gives an error : permIds.size() == net.node_size(). After so much digging I realized it was due to eager execution caused by TF2 that gives “StatefulPartitionedCall ” which is not supported by openCV. I tried converting to onnx but still got error due ti the same thing. Now my only hope resides on this community. I’m thankfull to any kind of help. the github repo contains the saved model along with all important details Thank you again The github repo of savedmodel along with important infos:github

submitted by /u/reyloll
[visit reddit] [comments]

Categories
Misc

Tensorflow model to return array of probabilities rather than single value for processing in LIME

I have a text classification model that I have developed based off the tutorial located here: https://www.tensorflow.org/tutorials/keras/text_classification

I am using my own dataset. Each file contains metadata for a file of interest. I am trying to get the model to classify it as either a target or as ignorable.

Currently, the model returns a single value array probability score. If score > .5 then the file is a target. If it is less than .5 it is ignorable. I want to explain predictions using the Explainable AI module LIME, but lime needs model.predict() to return a set of probabilities for each potential category, as such I need an array [x,y] where x = ignorable probability and y = target probability and x +y = 1.

Can someone please explain how to modify the tutorial code to return the probabilities of both target and ignorable? Any assistance would be most appreciated 😀

submitted by /u/an_anonymous_mate
[visit reddit] [comments]

Categories
Misc

Multi-GPU Programming with Standard Parallel C++, Part 1

Four panels vertically laid out each showing a simulation with a black backgroundBy developing applications using MPI and standard C++ language features, it is possible to program for GPUs without sacrificing portability or performance.Four panels vertically laid out each showing a simulation with a black background

This is the second post in the Standard Parallel Programming series, about the advantages of using parallelism in standard languages for accelerated computing.

The difficulty of porting an application to GPUs varies from one case to another. In the best-case scenario, you can accelerate critical code sections by calling into an existing GPU-optimized library. This is, for example, when the building blocks of your simulation software consist of BLAS linear algebra functions, which can be accelerated using cuBLAS.

But in many codes, you can’t get around doing some manual work. In these scenarios, you may consider using a domain-specific language like CUDA to target a specific accelerator. Alternatively, you can use a directive-based approach like OpenMP or OpenACC to stay within the original language and target both the host and various types of devices with the same code.

Standard parallelism

With the advent of native forms of parallelism in modern versions of the C++, Fortran, and Python programming languages, you can now take advantage of a similar high-level approach without resorting to language extensions.

Standard parallelism in C++

Our focus is on the C++ language which, as of the C++17 standard, offers parallel versions of numerous algorithms in its standard library. The underlying programming model is a mixture of the two approaches mentioned earlier. It partly works like a library-based approach, because C++ offers parallel algorithms for common tasks like sorting, searching, and cumulative sums, and may add support for additional domain-specific algorithms in upcoming versions. Furthermore, a formalism for parallel, hand-written loops is offered in the form of the generic for_each and transform_reduce algorithms.

C++ parallel algorithms express parallelism through the native syntax of the language in place of nonstandard extensions. In this way, they guarantee the long-term compatibility and portability of the developed software. This post also shows that the obtained performance is typically comparable to the one obtained with traditional approaches like CUDA C++.

What is novel about this approach is that it integrates seamlessly into the existing code base. This method allows you to preserve the software architecture and selectively accelerate the performance of critical components.

Porting a C++ project to GPU could be as simple as replacing all for loops by a call to for_each, or transform_reduce if it includes a reduction.

We walk you through typical refactoring steps that are required to overcome incompatibility issues of current C++ compilers. The post includes a list of modifications required for performance reasons, which are more universal and, in principle, independent of the programming formalism. This includes the requirement to restructure data in a way that enables coalesced memory accesses.

With current compilers, C++ parallel algorithms target single GPUs only and explicit MPI parallelism is needed to target multiple GPUs. It is straightforward to reuse the MPI backend of your existing parallel CPU code for this purpose and we will present a few guidelines to achieve state-of-the-art performance.

We discuss this list of implementation rules, guidelines, and best practices, illustrating them with the case of Palabos, a software library for computational fluid dynamics based on the lattice Boltzmann method (LBM). Palabos was ported to multi-GPU hardware in 2021 with only a few months of work and illustrates the need for the suggested refactoring steps quite nicely as the original code adapts poorly to GPUs. The original code adapts poorly to GPUs is due to the extensive use of object-oriented data structures and coding mechanisms.

You learn that the use of C++ standard parallelism allows a hybrid approach in which some algorithms are executed on GPU but some are kept on CPU. This is all dependent on their suitability for a GPU execution or simply the state of progress of their GPU port. This feature stands out as one of the major advantages of C++ standard parallelism, on par with the capability to keep the global architecture and most of the code base of the original software project intact. 

GPU programming using C++ standard parallelism

If this is your first time hearing about C++ parallel algorithms, you may want to read Developing Accelerated Code with Standard Language Parallelism, which introduces the topic of standard language parallelism in C++, Fortran, and Python.

The basic concept is strikingly simple. You can get many of the standard C++ algorithms to run in parallel on the host or on a device with the help of an execution policy, which is provided as an extra argument to the algorithm. In this post, we use the par_unseq execution policy, which expresses that the computations on the different elements are completely independent.

The following code example executes a parallel operation to multiply all elements of a std::vector by two:

for_each(execution::par_unseq, begin(v), end(v), [](double& x) {
    x *= 2.0;
});

Compiled with the nvc++ compiler and the -stdpar option, this algorithm is executed on GPU. Depending on the compiler, the compiler options, and the implementation of the parallel algorithms, you can also get a multithreaded execution on a multi-core CPU or on other types of accelerators.

This example uses the general for_each algorithm that applies any element-wise operation to the vector v in the form of a function object. In this case, it’s an inline lambda expression. An additional reduction operation can be specified by using the algorithm transform_reduce instead of for_each.

In a for_each algorithm call, the lambda function gets called with a reference to successive container elements. But sometimes, you also have to know the index of the element to access external data arrays or to implement non-local stencils.

This can be done by iterating over a counting_iterator offered by the Thrust library in C++17 (included with the NVIDIA HPC SDK) and by std::ranges::views::iota in C++20 or newer. In C++17, the simplest solution deduces the index from the address of the current element.

Jacobi example using C++ standard parallelism

To illustrate these concepts, here is a code example that uses the parallel STL to compute a non-local stencil operation and a reduction operation for the error estimate. It executes a Jacobi iteration, in which the average value of the four closest neighbors of every matrix element is evaluated:

void jacobi_iteration(vector const& v, vector& tmp) {
    double const* vptr = v.data();
    double *tmp_ptr = tmp.data();
    double l2_error = transform_reduce(execution::par_unseq,
        begin(v), end(v), 0., plus, [=](double& x)
    {
        int i = &x - vptr; // Compute index of x from its address.
    auto [iX, iY] = split(i);
        double avg = 0.25 * (
        vptr[fuse(iX-1, iY)] + vptr[fuse(iX+1, iY)] +
        vptr[fuse(iX, iY-1)] + vptr[fuse(iX, iY+1)] );
        tmp_ptr[i] = avg;
        return (avg – x) * (avg – x);
    } );
)

Here, split expresses the decomposition of the linear index i into an x- and y-coordinate, and fuse does the inverse. They are defined as follows if the domain is a uniform nx-by-ny matrix, with the Y-index running sequentially in memory:

fuse = [ny](int iX, int iY) { return iY + ny * iX; }
split = [ny](int i) { return make_tuple(i / ny, i % ny); }

The use of a temporary vector to store the computed average guarantees a deterministic result when the algorithm is executed concurrently.

A complete and generalized version of this code is available from the gitlab.com/unigehpfs/paralg GitLab repository. The repository also includes a hybrid version (C++ standard parallelism and MPI), which is built around the advice provided in this post and runs with good efficiency on multiple GPUs.

You have probably noticed the lack of explicit statements to transfer data from the host to the device and back. The C++ standard does not actually provide any such statements, and any implementation of parallel algorithms on a device must rely on automatic memory transfers instead. With the NVIDIA HPC SDK, this is achieved through CUDA unified memory, a single memory address space accessible from both the CPU and the GPU. If your code accesses data in this address space on the CPU and then on the GPU, memory pages are migrated to the accessing processor automatically.

For accelerating CPU applications with GPUs, CUDA unified memory is particularly useful because it enables you to focus on porting the algorithms of your application incrementally, function by function, without worrying about memory management.

The flipside of the coin is that the performance overhead of hidden data transfers can easily cancel the performance benefits of the GPU. As a rule, data produced on the GPU should be kept in GPU memory whenever possible by expressing all of its manipulations through parallel algorithm calls. This includes data post-processing, such as computation of data statistics and visualization. As shown in Part 2 of this post, it also includes data packing and unpacking for MPI communication.

By following the recommendations of this post, porting a code to GPU becomes as simple as replacing all time-critical loops and related data accesses by calls to parallel algorithms. It is good to remember, though, that the GPU typically possesses substantially more cores than the CPU and should be exposed to higher levels of parallelism. In the fluid dynamics problem presented in the next section, for example, the fluid domain is covered by homogeneous, matrix-like grid portions (Figure 1).

Shows the simulation domain covered by matrix-like grid portions and how they get processed by CPU or GPU execution. For CPU, one thread per matrix. For GPU, one thread per matrix element.
Figure 1. Subdivision of the fluid domain into matrix-like grid portions in Palabos and the respective parallelization strategy on CPU and GPU

In the original CPU code, each CPU core processes one or several of these grid portions sequentially, which is shown in the top part of Figure 1. As for the GPU, the loop over the elements of the grid portions should be parallelized to fully occupy the GPU cores.

Example: Lattice Boltzmann software and Palabos

The LBM solves the fluid flow equations with an explicit time-stepping scheme, covering a wide range of applications. This includes flows past complex geometries such as porous media, multi-phase flows, compressible supersonic flows, and more.

The LBM most often allocates more variables on every node of the numerical mesh than other solvers. Where a classical, incompressible, Navier-Stokes solver represents the state with just three variables for the velocity components, plus one for the temporary pressure term, the LBM approach typically requires 19 variables, called populations. The memory footprint of the LBM is therefore 5-6x higher. The actual mileage may vary if the Navier-Stokes solver uses temporary variables or if further physics, such as density and temperature, are added to the system.

As a result, abundant memory accesses and low arithmetic intensity characterize the LBM. On cluster-level GPUs, like the NVIDIA V100 and NVIDIA A100, the performance is entirely limited by the memory accesses, even for computationally intensive and sophisticated LBM schemes.

Take the NVIDIA A100 40 GB GPU, for example, which has a memory bandwidth of 1555 GB/s. At every explicit time step, each node accesses 19 variables or populations, which occupy eight bytes each in double-precision. These are counted twice: one time for the data transport from GPU memory to the GPU cores, and again for the write back to GPU memory after the compute operation.

Assuming a perfect memory subsystem and maximal data-reuse, the LBM has a peak throughput performance of processing 1555 / (19 * 8 * 2) = 5.11 billion grid nodes per second. It is common practice in LBM terminology to speak in terms of Giga Lattice-node Updates Per Second (GLUPS), such as 5.11 GLUPS.

In a real-life application, though, each node additionally reads some information to manage domain irregularities. In Palabos, that is a 32-bit integer for the node tag and a 64-bit index into an extra data array, effectively reducing the peak performance to 4.92 GLUPS. 

This model provides a simple way to estimate the best peak performance that LBM codes could achieve, since sufficiently large grids do not fit in the caches. We use this model throughout this post to demonstrate that the performance obtained with C++ parallel algorithms is as good as it gets. Outside a margin of a few percentage points, neither CUDA, OpenMP, nor any other GPU formalism can do better.

The LBM distinguishes neatly between computations, expressed by the local collision step, and memory transfer operations that are encapsulated in the streaming step. The following code example shows a typical time iteration in the case of a structured mesh with matrix-like topology:

for (int i = 0; i 



Like the Jacobi iteration in the previous section, this function writes the computed data to a temporary array ftmp to avoid race conditions during a multi-threaded execution, making it an ideal candidate to demonstrate the concepts of this post. There exist alternative in-place algorithms that avoid memory duplication. However, these are more complex and therefore less suitable for illustrative purposes. 

An introduction to the LBM is provided in the Simulation and modeling of natural processes course. For more information about how to develop LBM codes for GPUs using C++ parallel algorithms, see Cross-platform programming model for many-core lattice Boltzmann simulations.

In this post, we use the open-source LBM library Palabos to show how to port an existing C++ library to multi-GPU with parallel algorithms. At first sight, Palabos would seem unfit for a GPU port due to its strong reliance on object-oriented mechanisms. However, we walk through several workarounds that, in the case of Palabos, lead to state-of-the-art performance with only superficial changes to the code architecture.

Switch from object-oriented to data-oriented design

To serve a large community, Palabos emphasizes polymorphism and other object-oriented techniques. An object that holds both the data (the populations) and the method (the local collision model), represents every grid node. This offers a convenient API for the development of new models and a flexible mechanism to adjust the physical behavior or numerical aspects of the model from one cell to another.

This object-oriented approach is, however, poorly adapted for execution on a GPU due to an inefficient data layout, a convoluted execution path, and reliance on virtual function calls. The following sections will teach you how to refactor the code in a GPU-friendly way by adopting a development model, which we refer to under the umbrella term as data-oriented programming.

Get rid of class-based polymorphism

The left part of Figure 2 illustrates the typical code execution chain for the collision model on a grid node in Palabos. Different components of the algorithm are stacked up and invoked through virtual function calls, including the underlying numerical LBM algorithm (RR), additional physics (“Smagorinsky”), and additional numerical aspects (Left Boundary).

The path of a single integer tag used to identify a chain of virtual function calls: Left Boundary -> Smagorinsky -> Collision Model -> Grid node. These get extracted into a single integer tag to be used by a switch case in the merged code.
Figure 2. Extraction of a single integer tag to identify a chain of virtual function calls

This textbook case for object-oriented design turns into a liability for the GPU port of the code. This issue arises because current versions of the HPC SDK do not support virtual function calls on GPUs in C++ parallel algorithms. Generally, this type of design should be avoided on GPUs for performance reasons, as it limits the predictability of the execution path.

The simple workaround is to gather the execution chains into single functions, which explicitly call the individual components in sequence, and identify them with a unique tag.

In Palabos, this unique tag is generated with the help of a serialization mechanism which was originally developed to support checkpointing and network communication of dynamically adaptive simulations. This shows that most of the refactoring work of the GPU port is achieved automatically, if the architecture of the refactored software project is flexible enough.

You can now provide each grid node with a tag identifying the code for the full collision step, and express the collision step by a large switch statement:

switch(tag) {
    case rr_les: fun_rr_les(f_local); break;
    case rr_les_BCleft: fun_rr_les_BCleft(f_local); break;
    …
}

As the switch statement grows large, it can run into performance issues due to the space occupied by the generated kernel in GPU memory.

Another issue is the maintainability of the software project. Currently, it is impossible to contribute new collision models without modifying this switch statement, which is part of the core of the library. A solution to both problems consists in generating a switch statement with a selected number of cases in the end-user application at compilation time using C++ template mechanisms. This technique is detailed on the Palabos-GPU resource page.

Rearrange memory to encourage coalesced memory access

The object-oriented design also leads to a memory layout that cannot be processed efficiently on the many-core architecture of a GPU. As every node groups the 19 local populations of the LBM scheme, the data ends up in an array-of-structure (AoS) arrangement (Figure 3). For the sake of simplicity, only four populations per node are shown.

Two memory layouts are shown: Array-of-Structure and Structure-of-Array. In Array-of-Structure, the populations of each node are stored linearly in memory, i.e., the populations of node i-1 precede the populations of node i. Structure-of-Arrays stores the first population of all cells, then the second population of all cells, and so on.
Figure 3. While an array-of-structure alignment is more natural in an object-oriented design, structure-of-array improves coalesced memory access during the LBM streaming step

The AoS data layout leads to poor performance because it prevents coalesced memory accesses during the streaming step which, due to the non-local stencil, is the most critical part of the algorithm in terms of memory access.

The data should be aligned in a structure-of-array (SoA) manner in which all populations of a given type that communicate during the streaming step, are aligned at consecutive memory addresses. After this rearrangement, even a relatively straightforward LBM algorithm gets close to 80% of the memory bandwidth of a typical GPU.

Data-oriented design means that you attribute central importance to the structure and the layout of your data and build the data processing algorithms around this structure. An object-oriented approach typically adopts the reverse path.

The first step of a GPU port should be to know the ideal data layout for your application. In the case of the LBM, the superiority of the SoA layout on GPU is a well-known fact. The specifics of the memory layout and memory traversal algorithm were tested in a prior case study published as the open-source STLBM code.

Conclusion

In this post, we discussed the fundamental techniques involved in writing GPU applications using C++ standard parallel programming. We also provided background information on the lattice Boltzmann method and the Palabos application, which we used in our case study. Finally, we discussed two ways that you can refactor your source code to be more amenable to running on a GPU.

In the next post, we continue with this application and discuss how to obtain high performance in your C++ application when running on NVIDIA GPUs. We also demonstrate how to scale your application to use multiple GPUs with MPI.

Categories
Misc

Multi-GPU Programming with Standard Parallel C++, Part 2

Four panels vertically laid out each showing a simulation with a black backgroundBy developing applications using MPI and standard C++ language features, it is possible to program for GPUs without sacrificing portability or performance. Four panels vertically laid out each showing a simulation with a black background

This is the third post in the Standard Parallel Programming series, about the advantages of using parallelism in standard languages for accelerated computing.

In part 1, we explained:

  • The basics of C++ parallel programming
  • The lattice Boltzmann method (LBM)
  • Took the first steps towards refactoring the Palabos library to run efficiently on GPUs using standard C++.

In this post, we continue by optimizing the performance of the ISO C++ algorithms and then use MPI to scale the application across multiple GPUs.

Strive for optimal performance

It may seem natural to expect that the performance of your CPU-to-GPU port will range below that of a dedicated HPC code. After all, you are limited by the constraints of the software architecture, the established API, and the need to account for sophisticated extra features expected by the user base. Not only that, the simplistic programming model of C++ standard parallelism allows for less manual fine-tuning than a dedicated language like CUDA.

In reality, it is often possible to control and limit this loss of performance to an extent that it becomes negligible. The key is to analyze the performance metrics of individual code portions and eliminate performance bottlenecks that do not reflect an actual need of the software framework.

A good practice consists in maintaining a separate proof-of-principle code for core components of your numerical algorithm. The performance of this approach can be more freely optimized and compared with the one of the full, complex software frameworks (like the STLBM library in the case of Palabos). Additionally, a GPU-capable profiler like nvprof can highlight the origin of performance bottlenecks efficiently.

Typical performance issues and their solutions are highlighted in the following recommendations:

  • Do not touch data on the CPU
  • Know your algorithms
  •  Establish a performance model

Do not touch data on the CPU

A frequent source of performance losses are hidden data transfers between CPU and GPU memory, which can be exceedingly slow. With the CUDA unified memory model, this type of transfer occurs whenever you access GPU data from the CPU. Touching a single byte of data can result in a catastrophic performance loss because entire memory pages are transferred at one time.

The obvious solution is to manipulate your data exclusively on the GPU whenever possible. This requires searching your code carefully for all accesses to the data and then wrapping them into a parallel algorithm call. Although this is somewhat robust, this process is needed for even the simplest operations.

Obvious places to look for are post-processing operations or intermediate evaluations of data statistics. Another classical performance bottleneck is found in the MPI communication layer, because you must remember to carry out data packing and unpacking operations on the GPU.

Expressing an algorithm on GPU is easier said than done, as the formalism of for_each and transform_reduce is mostly suitable for evenly structured memory accesses.

In the case of irregular data structures, it would be painful to avoid race conditions and guarantee coalesced memory accesses with these two algorithms. In such a case, you should follow up with the next recommendation and familiarize yourself with the family of parallelized algorithms provided in the C++ STL.

Know your algorithms

Up to this point, the parallel STL appears as little more than a way to express parallel for loops with a fancy functional syntax. In reality, the STL offers a large set of algorithms beyond for_each and transform_reduce that are useful to express your numerical method, including sorting and searching algorithms.

The exclusive_scan algorithm computes cumulative sums and deserves particular mention, as it proves generally useful for reindexing operations of unstructured data. For example, consider a packing algorithm for MPI communication, in which the number of variables contributed to the communication buffer by every grid node is unknown in advance. In this case, global communication among threads is required to determine the index at which every grid node writes into the buffer.

The following code example shows how to solve this type of problem with good parallel efficiency on GPU using parallel algorithms:

// Step 1: compute the number of variables contributed by every node.
int* numValuesPtr = allocateMemory(numberOfCells);
for_each(execution::par_unseq, numValuesPtr,
         numValuesPtrl + numberOfCells, [=](int& numValues)
{
    int i = &numValues - numValuesPtr;
    // Compute number of variables contributed by current node.
    numValues = computeNumValues(i);
} );
// 2. Compute the buffer index for every node.
int* indexPtr = allocateMemory(numberOfCells);
exclusive_scan(execution::par_unseq, numValuesPtr,
         numValuesPtr + numberOfCells, indexPtr, 0);
// 3. Pack the data into the buffer.
for_each(execution::par_unseq, indexPtr,
         indexPtr + numberOfCells, [=](int& index)
{
    int i = &index - indexPtr;
    packCellData(i, index);
} );

This example lets you enjoy the expressive power of the algorithm-based approach to GPU programming: the code requires no synchronization directives or any other low-level constructs.

Establish a performance model

A performance model establishes an upper bound for the performance of your algorithm through a bottleneck analysis. This typically considers the peak processor performance (measured in FLOPS) and the peak memory bandwidth as the principal limiting hardware characteristics.

As discussed in the Example: Lattice Boltzmann software Palabos section in the previous post, LBM codes have a low ratio of computations to memory accesses and are entirely memory-bound on modern GPUs. That is, at least if you use single-precision arithmetics or a GPU that is optimized for double-precision arithmetics.

The peak performance is simply expressed as a ratio between the memory bandwidth of the GPU and the number of memory accesses performed in the code. As a direct consequence, switching an LBM code from double– to single-precision arithmetics doubles the performance.

Figure 1 shows the performance of the GPU port of Palabos obtained on an NVIDIA A100 (40 GB) GPU for single– and double-precision floats.

Palabos performance results as MLUPS and percentages of peak capacity. Left column shows double precision results: 100% theoretical peak of the GPU is 4921 MLUPS. Palabos achieves 73% peak and 3610 MLUPs. The right column shows the same results for single precision: 100% of theoretical peak is 9482 MLUPs and Palabos achieves 74% of theoretical peak at 7050 MLUPS.
Figure 1. Palabos performance for the 3D lid-driven cavity (6003 grid nodes) on an A100 (40 GB) GPU in single and double precision. Model: TRT, D3Q19

The executed test case, a flow in a lid-driven cavity in a turbulent regime, has a simple cubic geometry. However, this case includes boundary conditions and exhibits a complex flow pattern. The performance is measured in million lattice-node updates per second (MLUPS, more is better) and compared against a theoretical peak value obtained under the assumption that GPU memory is exploited at peak capacity.

The code reaches 73% of peak performance in double-precision and 74% in single-precision. Such performance metrics are common in state-of-the-art implementations of LB models, independently of the language or library used.

Although some implementations may gain a few percentage points and reach a value closer to 80%, it is clear that we are closing in on the hard limit implied by the performance model. From a big-picture standpoint, the single-GPU performance of the code is as good as it gets.

Reuse the existing MPI backend to get a multi-GPU code

As C++ parallel algorithms integrate into an existing software project seamlessly to accelerate critical code portions, nothing prevents you from reusing the project’s communication backend to reach multi-GPU performance. However, you will want to keep an eye on the communication buffer and make sure that it does not take any detours through CPU memory, which would result in costly page faults.

Our first attempt to run the GPU-ported version of Palabos on multiple GPUs, although producing technically correct results, did not exhibit acceptable performance. Instead of a speedup, the switch from one to two GPUs delivered a decrease in speed by an order of magnitude. The issue could be traced to the packing and unpacking of the communicated data. In the original backend, this was carried out on CPUs, and to other instances of unnecessary data access in CPU memory, such as resizing of the communication buffer.

Such issues can be spotted with help of the profiler. The profiler highlights all occurrences of page faults in unified memory, and are fixed by moving the corresponding code portions to a parallel algorithm. The Know your algorithms section explained how to pack and unpack the communication buffer if the data follows an irregular pattern.

At this point, using standard C++ without any extensions other than MPI, you can get a hybrid CPU/GPU software project with state-of-the-art performance on single-GPU and solid parallel performance on multi-GPU.

Unfortunately, the multi-GPU performance remains below the expectations due to the current limitations of the language specifications and corresponding GPU implementations. Pending future improvements to the fairly young technology of C++ standard parallelism, we provide some workarounds in this post based on techniques outside the C++ standard.

Coordinate the multi-CPU and multi-GPU code execution

While this post focuses on hybrid CPU and GPU programming, we can’t avoid addressing the issue of hybrid parallelism (MPI or multithreading) in the CPU parts at some point.

The original version of Palabos, for example, is non-hybrid and uses the MPI communication layer to distribute work among the cores of a CPU as well as across the network. After porting to GPU, the resulting multi-CPU and multi-GPU code spontaneously groups a single CPU core with a full GPU in every MPI task, leaving the CPU relatively underpowered.

This leads to a performance bottleneck whenever it is necessary or convenient to keep a computationally intensive task on the CPU. In fluid dynamics, this is often the case in the preprocessing stage, such as in geometry processing or mesh generation.

The obvious solution involves the use of multithreading to access multiple CPU cores from within an MPI task. The shared memory space of these threads can then be directly shared with the GPU through the CUDA unified memory formalism.

However, C++ parallel algorithms cannot be reused to serve both purposes of GPU and multi-core CPU execution. This is because C++ does not allow choosing the target platform of parallel algorithms from within the language.

While C++ threads do provide a way to solve this problem natively, we found that OpenMP offered the most convenient and least intrusive solution. An OpenMP annotation of a for loop was sufficient in this case to distribute the grid portions assigned to the current MPI task over multiple threads.

Communicate through pinned memory

With current versions of the HPC SDK, the CUDA unified memory model exhibits another performance issue in combination with MPI.

As the MPI communication layer expects data with a fixed hardware address (so-called pinned memory), any buffer that resides in the managed memory area is first copied into a pinned memory buffer on the host CPU implicitly. Due to the transfers between the GPU and CPU, this operation can end up being rather costly.

Communication buffers should therefore be explicitly pinned to a GPU memory address.  With the nvc++ compiler, this is achieved by allocating the communication buffer with cudaMalloc:

// Allocate the communication buffer
// vector buffer(N);
// double* buffer = buffer.data();
double* buffer; cudaMalloc((void**)&buffer, N * sizeof(double));
for_each(buffer, buffer + N, … // Proceed with data packing

Another solution is to replace the STL vector with a thrust::device_vector from the Thrust library, which uses pinned GPU memory by default.

In the near future, the HPC SDK will handle these cases more efficiently and automatically for users. This is so that they do not have to reach for cudaMalloc or thrust::device_vector. So, stay tuned!

After the various improvements listed in this post, the Palabos library was tested on a DGX A100 (40-GB) workstation with four GPUs, again for the benchmark case of a lid-driven cavity. The obtained performance is shown in Figure 2 and compared to the performance achieved on a 48-core Xeon Gold 6240R CPU:

Palabos performance tested on a DGX A100 workstation diagram: This figure shows three columns. The left-most column shows performance of a Xeon Gold CPU: 421 MLUPs. The middle column shows the achieved performance of Palabos on an A100 GPU: 7050 MLUPs. The rightmost column shows a strong scaling to a DGX-Station 4x A100 GPU system in which 28200 MLUPs is the ideal parallel scaling achievable, and two results for Palabos are presented. Without pinned memory Palabos achieves 61% strong scaling efficiency at 17270 MLUPs, and with pinned memory an 82% strong scaling efficiency at 23030 MLUPs.
Figure 2. Palabos performance for the 3D lid-driven cavity (6003 grid nodes) on a 48-core Xeon Gold 6240R CPU and on a DGX A100 (40 GB) workstation, once using a single GPU and once with all four GPUs. Model: TRT, D3Q19, single precision

For the Xeon Gold, the original implementation of Palabos proved to be more efficient and was used with 48 MPI tasks, while the single-GPU and the four-GPU execution used the parallel algorithms backend, compiled with nvc++

The performance figures show a 3.27-fold speedup of the 4-GPU execution compared to the single-GPU one. This amounts to a quite satisfactory parallel efficiency of 82% in a strong scaling regime, with equal total domain size in both executions. In weak scaling, using a 4x larger problem size for the four-GPU execution, the speedup increases to 3.72 (93% efficiency).

Figure 2 also shows that when using an unpinned communication buffer, such as when the MPI communication buffer is not allocated with cudaMalloc, the parallel efficiency drops from 82% to 61%.

In the end, the four-GPU DGX workstation runs 55x faster than the Xeon Gold CPU. While the direct comparison may not be fair due to the different scope of the two machines, it provides a sense of the acceleration obtained by porting a code to GPU. The DGX is a desktop workstation connected to a common power plug, yet it delivers a performance that, on a CPU cluster, could only be obtained with thousands of CPU cores.

Conclusion

You’ve seen that C++ standard language parallelism can be used to port a library like Palabos to GPU with an astounding increase in the code’s performance.

  • For end users of the Palabos library, this performance gain is obtained with a single-line change to switch from the CPU to the GPU backend.
  • For Palabos library developers, some work was required to develop the corresponding GPU backend.

However, this work didn’t require learning a new domain-specific language nor did it depend on a detailed knowledge of the GPU architecture. 

This two-part post has provided you with guidelines that you can apply to achieve similar results with your own code. For more information, we encourage you to check out the following resources:

Categories
Misc

I m new to tensorflow and i need some help pls

I have a dataset that i downloaded from kaggle and it s in 4 folders trainA,trainB,testA,testB And i would like to put these image folders into a tf.data.dataset called DATA for example and i would like to call DATA[‘trainA’] for example to get the first folder thank you

submitted by /u/Jaded-Historian7036
[visit reddit] [comments]

Categories
Misc

CUDA Toolkit Nsight Visual Studio Edition Failed

I’m trying to install the necessary components to get GPU version of TensorFlow working. I am at the part where I am trying to install the CUDA toolkit and keep getting Nsight Visual Studio Edition Failed to install with no additional information as to why it failed. Any ideas why it may be failing or how I could check? I understand Visual Studio is needed to use the C++ compiler. I have multiple instances of Visual Studio since I’m on the computer I’ve used for programming for many years. Could that be part of the problem? Otherwise how else would it know what version of visual studio to use? Also, some more background info, I did start by uninstalling everything from NVidia and deleting all NVidia folders as recommended. Any advice how to proceed here is appreciated. Thanks.

submitted by /u/WubDubClub
[visit reddit] [comments]

Categories
Misc

Negative gradients when calculating GradCAM heatmap

I have a Segmentation network model trained for 2 classes and am able to see accurate results. But when using grad-cam for the heatmap, I am able to see good results for the last convolution layer for both the classes but have issues when trying to generate a heatmap for the second last convolution layer for one of the classes (the other class’s heatmap is working fine).

**Last 5 layers** convolution_layer(filters:8, kernel:3*3) convolution_transpose_layer(filters:2, kernel:2*2) convolution_layer(filters:2, kernel:3*3) convolution_layer(filters:10, kernel:1*1) activation_layer(softmax) 

The heatmap is empty because of all negative pooled gradients(due to mean from all the -ve gradients wrt Conv layer), resulting in negative values in pooled_grads*convolution_output on which relu is applied, giving all zeros.

What does it mean for GradCAM to be all negative?

Why is it that all channels in the convolution lead to a “negative” contribution to the true output class?

https://arxiv.org/pdf/2002.11434.pdf following this paper for heatmap for segmentation models.

submitted by /u/Ash_real
[visit reddit] [comments]

Categories
Misc

custom accuracy metric

Hi, I have been trying to implement a custom masked accuracy metric which does not consider the pad tokens (similar to the masked loss as shown here https://www.tensorflow.org/text/tutorials/nmt_with_attention). I have been trying to create a new subclass using the tf.keras. metrics.Metric, but I am very confused. Any leads would be really appreciated.

submitted by /u/Hour-Tie-7471
[visit reddit] [comments]