Categories
Misc

Is anyone interested in this project?

Is anyone interested in this project? submitted by /u/7NoteDancing
[visit reddit] [comments]
Categories
Misc

ImportError: cannot import name ‘dtensor’ from ‘tensorflow.compat.v2.experimental’

ImportError: cannot import name 'dtensor' from 'tensorflow.compat.v2.experimental'

submitted by /u/Ok_Wish4469
[visit reddit] [comments]

Categories
Misc

Using Fortran Standard Parallel Programming for GPU Acceleration

We present lessons learned from refactoring a Fortran application to use modern do concurrent loops in place of OpenACC for GPU acceleration.

This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing:

Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, do concurrent loops and array math intrinsics in Fortran.

Using standard language features has many advantages to offer, the chief advantage being future-proofness. As Fortran’s do concurrent is a standard language feature, the chances of support being lost in the future is slim.

This feature is also relatively simple to use in initial code development and it adds portability and parallelism. Using do concurrent for initial code development has the mental advantage of encouraging you to think about parallelism from the start as you write and implement loops.

For initial code development, do concurrent is a great way to add GPU support without having to learn directives. However, even code that has already been GPU-accelerated through the use of directives like OpenACC and OpenMP can benefit from refactoring to standard parallelism for the following reasons:

  • Cleaning up the code for those who do not know directives, or removing the large numbers of directives that make the source code distracting.
  • Increasing the portability of the code in terms of vendor support and longevity of support.
  • Future-proofing the code, as ISO-standard languages have a proven track record for stability and portability.

Replacing directives on multi-core CPUs and GPUs

POT3D is a Fortran code that computes potential field solutions to approximate the solar coronal magnetic field using surface field observations as input. It continues to be used for numerous studies of coronal structure and dynamics.

The code is highly parallelized using MPI and is GPU-accelerated using MPI with OpenACC. It is open source and available on GitHub. It is also a part of the SpecHPC 2021 Benchmark Suites.

We recently refactored another code example with do concurrent at WACCPD 2021. The results showed that you could replace directives with do concurrent without losing performance on multicore CPUs and GPUs. However, that code was somewhat simple in that there is no MPI.

Now, we want to explore replacing directives in more complicated code. POT3D contains nontrivial features for standard Fortran parallelism to handle: reductions, atomics, CUDA-aware MPI, and local stack arrays. We want to see if do concurrent can replace directives and retain the same performance.

To establish a performance baseline for refactoring the code to do concurrent, first review the initial timings of the original code in Figure 1. The CPU result was run on 64 MPI ranks (32 per socket) on a dual-socket AMD EPYC 7742 server, while the GPU result was run on one MPI rank on an NVIDIA A100 (40GB). The GPU code relies on data movement directives for data transfers (we do not use managed memory here) and is compiled with -acc=gpu -gpu=cc80,cuda11.5. The runtimes are an average over four runs.

The following highlighted text shows the number of lines of code and directives for the current version of the code. You can see that there are 80 directives, but we hope to reduce this number by refactoring with do concurrent.

Graph showing performance results of original code on a 64 core CPU (6717.4 seconds) compared to the original using OpenACC on an NVIDIA A100 GPU (1481.3 seconds).
Figure 1. CPU and GPU timings for the original version of the POT3D code show substantial improvement on wall clock time for the original code when using OpenACC on an NVIDIA A100 40GB GPU
  POT3D (Original)
Fortran 3,487
Comments 3,452
OpenACC Directives 80
Total 7,019
Table 1. Code facts for POT3D including Fortran, comments, OpenACC Directives, and total lines.

Do concurrent and OpenACC

Here are some examples of do concurrent compared to OpenACC from the code POT3D, such as a triple-nested OpenACC parallelized loop:

!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
…
!$acc parallel loop default(present) collapse(3) async(1)
do k=1,np
 do j=1,nt
  do i=1,nrm1
   br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k))*dr_i(i)
  enddo
 enddo
enddo
…
!$acc wait
!$acc exit data delete(phi,dr_i,br)

As mentioned earlier, this OpenACC code is compiled with the flag -acc=gpu -gpu=cc80,cuda11.5 to run on an NVIDIA GPU.

You can parallelize this same loop with do concurrent and rely on NVIDIA CUDA Unified Memory for data movement instead of directives. This results in the following code:

do concurrent (k=1:np,j=1:nt,i=1:nrm1)
 br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo

As you can see, the loop has been condensed from 12 lines to three, and CPU portability and GPU parallelism are retained with the nvfortran compiler from the NVIDIA HPC SDK.

This reduction in the number of lines is thanks to collapsing multiple loops into one loop and relying on managed memory, which removes all data movement directives. Compile this code for the GPU with -stdpar=gpu -gpu=cc80,cuda11.5.

For nvfortran, activating standard parallelism (-stdpar=gpu) automatically activates managed memory. To use OpenACC directives to control data movement along with do concurrent, use the following flags: -acc=gpu -gpu=nomanaged.

The nvfortran implementation of do concurrent also allows for locality of variables to be defined:

do concurrent (k=1:N, j(i)>0) local(M) shared(J,K)
 M = mod(K(i), J(i))
 K(i) = K(i)- M
enddo

This may be necessary for some code. For POT3D, the default locality of variables performs as needed. The default locality is the same as OpenACC with nvfortran.

do concurrent CPU performance and GPU implementation

Replacing all the OpenACC loops with do concurrent and relying on managed memory for data movement leads you to code with zero directives and fewer lines. We removed 80 directives and 66 lines of Fortran.

Figure 2 shows that this do concurrent version of the code has nearly the same performance on the CPU as the original GitHub code. This means that you haven’t broken CPU compatibility by using do concurrent. Instead, multi-core parallelism has also been added, which can be used by compiling with the flag -stdpar=multicore.

Graph showing the performance of the original MPI code on 64 CPU cores (6717.4 seconds) compared to do concurrent on 64 CPU cores (6767.9 seconds).
Figure 2. CPU timings using MPI (32 ranks per socket) for the original and do concurrent versions of the POT3D code

Unlike the case of the CPU, to be able to run POT3D on a GPU, you must add a couple of directives.

First, to take advantage of multiple GPU with MPI, you need a directive to specify the GPU device number. Otherwise, all MPI ranks would use the same GPU.

!$acc set device_num(mpi_shared_rank_num)

In this example,  mpi_shared_rank_num is the MPI rank within a node. It’s assumed that the code is launched such that the number of MPI ranks per node is the same as the number of GPUs per node. This can also be accomplished by setting CUDA_VISIBLE_DEVICES for each MPI rank, but we prefer doing this programmatically.

When using managed memory with multiple GPUs, make sure that the device selection (such as !$acc set device_num(N)) is done before any data is allocated. Otherwise, an additional CUDA context is created, introducing additional overhead.

Currently, the nvfortran compiler does not support array reductions on concurrent loops, which are required in two places in the code. Fortunately, an OpenACC atomic directive can be used in place of an array reduction:

do concurrent (k=2:npm1,i=1:nr)
!$acc atomic
 sum0(i)=sum0(i)+x(i,2,k)*dph(k )*pl_i
enddo

After adding this directive, change the compiler options to enable OpenACC explicitly by using -stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5. This allows you to use only three OpenACC directives. This is the closest this code can come to not having directives at this time.

All the data movement directives are unnecessary, since CUDA managed memory is used for all data structures. Table 2 shows the number of directives and lines of code needed for this version of POT3D.

  POT3D (Original) POT3D (Do Concurrent) Difference
Fortran 3487 3421 (-66)
Comments 3452 3448 (-4)
OpenACC Directives 80 3 (-77)
Total 7019 6872 (-147)
Table 2. Number of lines of code for the GPU compatible do concurrent version of POT3D with a breakdown of the number of Fortran lines, directives, and comments lines.

For the reduction loops in POT3D, you relied on implicit reductions, but that may not always work. Recently, nvfortran has added the upcoming Fortran 202X reduce clause, which can be used on reduction loops as follows:

do concurrent (k=1:N) reduce(+:cgdot)
 cgdot=cgdot+x(i)*y(i)
enddo

GPU performance, unified memory, and data movement

You’ve developed code with the minimal number of OpenACC directives and do concurrent that relies on managed memory for data movement. This is the closest directive-free code that you can have at this time.

Figure 3 shows that this code version takes a small performance hit of ~10% when compared to the original OpenACC GPU code. The cause of this could be do concurrent, managed memory, or a combination.

Graph showing performance of original OpenACC code on an A100 GPU (1481.3 seconds) compared to the standard parallelism version (1644.5 seconds).
Figure 3. GPU timings for the GitHub and standard parallelism (STDPAR) with minimal OpenACC versions of the POT3D code on one NVIDIA A100 40GB GPU.

To see if managed memory causes the small performance hit, compile the original GitHub code with managed memory turned on. This is done by using the compile flag -gpu=managed in addition to the standard OpenACC flags used before for the GPU.

Figure 4 shows that the GitHub code now performs similar to the minimum directives code with managed memory. This means that the culprit for the small performance loss is unified memory.

Graph showing performance of original OpenACC code on an A100 GPU (1481.3 seconds) compared to the standard parallelism version (1644.5 seconds) and the OpenACC code using managed memory instead of data directives (1634.3 seconds).
Figure 4. GPU timings for the GitHub (managed and no managed) and STDPAR with minimal OpenACC versions of the POT3D code.

To regain the performance of the original code with the minimal directives code, you must add the data movement directives back in. This mixture of do concurrent and data movement directives would look like the following code example:

!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
 br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)

This results in the code having 41 directives, with 38 being responsible for data movement. To compile the code and rely on the data movement directives, run the following command:

-stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5,nomanaged

nomanaged turns off managed memory and -acc=gpu turns on the directive recognition.

Figure 5 shows that there is nearly the same performance as the original GitHub code. This code has 50% fewer directives than the original code and gives the same performance!

Graph showing performance of original OpenACC code on an A100 GPU (1481.3 seconds) compared to the standard parallelism version (1644.5 seconds), the OpenACC code using managed memory instead of data directives (1634.3 seconds), and the standard parallelism version using OpenACC to optimize data movement (1486.2 seconds).
Figure 5. GPU timings for the GitHub (managed and no managed), STDPAR + minimal OpenACC (managed), and STDPAR + OpenACC (no managed) versions of the POT3D code.

MPI + DO CONCURRENT scaling

Figure 7 shows the timing results using multiple GPUs. The primary takeaway is that do concurrent works with MPI over multiple GPUs.

Looking at the codes with managed memory turned on (blue lines), you can see that the original code and the minimal directives code gave nearly the same performance as multiple GPUs were used.

Looking at the codes with managed memory turned off (green lines), you can again see the same scaling between the original GitHub code and the do concurrent version of the code. This indicates that do concurrent works with MPI and has no impact on the scaling you should see.

What you might also notice is that managed memory causes an overhead as the GPUs are scaled. The managed memory runs (blue lines) and data directive lines (green lines) are parallel to each other, meaning the overhead scales with the number of GPUs.

Performance graph showing strong scaling from 1–8 GPUs for each of the four versions of the code. The two versions using managed memory scale similarly as do the two versions using OpenACC data directives.
Figure 6. GPU scaling across 1, 2, 4, and 8 GPUs for the GitHub (managed and no managed), STDPAR + minimal OpenACC (managed), and STDPAR + OpenACC (no managed) versions of the POT3D code.

Fortran standard parallel programming review

You may be wondering, “Standard Fortran sounds too good to be true, what is the catch?”

Fortran standard parallel programming enables cleaner looking code and increases the future proofness of your code by relying on ISO language standards. Using the latest nvfortran compiler, you gain all the benefits mentioned earlier.

Although you lose the current GCC OpenACC/MP GPU support when you transition to do concurrent, we expect to gain more GPU support in the future as other vendors add support of do concurrent on GPUs. Given the track record of ISO language standards, we believe that this support will come.

Using do concurrent does currently come with a small number of limitations, namely the lack of support for atomics, device selection, asynchrony, or optimizing data movement. As we have shown, however, each of these limitations can be easily worked around using compiler directives. Far fewer directives are required thanks to the native parallel language features in Fortran.

Ready to get started? Download the free NVIDIA HPC SDK, and start testing! If you are also interested in more details on our findings, see the From Directives to DO CONCURRENT: A Case Study in Standard Parallelism GTC session. For more information about standard language parallelism, see Developing Accelerated Code with Standard Language Parallelism.

Acknowledgments

This work was supported by the National Science Foundation, NASA, and the Air Force Office of Scientific Research. Computational resources were provided by the Computational Science Resource Center at San Diego State University.

Categories
Misc

Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDF

Learn how RAPIDS cuDF accelerates data science with the help of GPUDirect Storage. Dive into the techniques that minimize the time to upload data to the GPU

If you work in data analytics, you know that data ingest is often the bottleneck of data preprocessing workflows. Getting data from storage and decoding it can often be one of the most time-consuming steps in the workflow because of the data volume and the complexity of commonly used formats. Optimizing data ingest can greatly reduce this bottleneck for data scientists working on large data sets.

RAPIDS cuDF greatly speeds up data decode by implementing CUDA-accelerated readers for prevalent formats in data science.

In addition, Magnum IO GPUDirect Storage (GDS) enables cuDF to speed up I/O by loading the data directly from storage into the device (GPU) memory. By providing a direct data path across the PCIe bus between the GPU and compatible storage (for example, Non-Volatile Memory Express (NVMe) drive), GDS can enable up to 3–4x higher cuDF read throughput, with an average of 30–50% higher throughput across a variety of data profiles.

In this post, we provide an overview of GPUDirect Storage and how it’s integrated into cuDF. We introduce the suite of benchmarks that we used to evaluate I/O performance. Then, we walk through the techniques that cuDF implements to optimize GDS reads. We conclude with the benchmark results and identify cases where use of GDS is the most beneficial for cuDF.

What is GPUDirect Storage?

GPUDirect Storage is a new technology that enables direct data transfer between local or remote storage (as a block device or through file systems interface) and GPU memory.

In other words, a direct memory access (DMA) engine can now quickly move data on a direct path from the storage to GPU memory, without increasing latency and burdening the CPU with extra copying through a bounce buffer.

Figure 1 shows the data flow patterns without and with GDS. When using the system memory bounce buffer, throughput is bottlenecked by the bandwidth from system memory to GPUs. On systems with multiple GPUs and NVMe drives, this bottleneck is even more pronounced.

However, with the ability to directly interface GPUs with storage devices, the intermediary CPU is taken out of the data path, and all CPU resources can be used for feature engineering or other preprocessing tasks.

Two diagrams showing data paths without and with GPUDirect Storage. The first diagram shows that the data goes through the system memory before being sent to the GPU. The second diagram shows data taking a direct path to the GPU, potentially speeding up transfers.
Figure 1. Schematic (left) shows memory access patterns in a system where a system memory bounce buffer is used to read data from NVMe drives into GPU memory. A second schematic (right) shows the direct data path between NVMe drives and GPU memory over the PCIe bus.

The GDS cuFile library enables applications and frameworks to leverage GDS technology to increase bandwidth and achieve lower latency. cuFile is available as part of CUDA Toolkit since version 11.4.

How does RAPIDS cuDF use GPUDirect Storage?

To increase the end-to-end read throughput, cuDF uses the cuFile APIs in its data ingest interfaces, like read_parquet and read_orc. As cuDF performs nearly all parsing on the GPU, most of the data is not needed in system memory and can be directly transferred to GPU memory.

As a result, only the metadata portion of input files is accessed by the CPU, and the metadata constitutes only a small part of the total file size. This permits the cuDF data ingest API to make efficient use of GDS technology.

Since cuDF 22.02, we have enabled the use of GDS by default. If you have cuFile installed and request to read data from a supported storage location, cuDF automatically performs direct reads through the cuFile API. If GDS is not applicable to the input file, or if the user disables GDS, data follows the bounce buffer path and is copied through a paged system memory buffer.

Benchmarking data ingest in cuDF

In response to the diversity of data science datasets, we assembled a benchmark suite that targets key data and file properties. Previous examples of I/O benchmarking used data samples such as the New York Taxi trip record, the Yelp reviews dataset, or Zillow housing data to represent general use cases. However, we discovered that benchmarking specific combinations of data and file properties have been crucial for performance analysis and troubleshooting.

Covered combinations of data and file properties

Table 1 shows the parameters that we varied in our benchmarks for binary formats. We generated pseudo-random data across the range of supported data types and collected benchmarks for each group of related types. We also independently varied the run-length and cardinality of the data to exercise run-length encoding and dictionary encoding of the target formats.

Finally, we varied the compression type in the generated files, for all values of the properties preceding. We currently support two options: Use Snappy or keep the data uncompressed. 

Data Property Description
File format Parquet, ORC
Data type Integer (signed and unsigned)
Float (32-, 64-bit)
String (90% ASCII, 10% Unicode)
Timestamp (day, s, ms, us, ns)
Decimal (32-, 64-, 128-bit)
List (nesting depth 2, int32 elements)
Run-length Average data value repetition (1x or 32x)
Cardinality Unique data values (unlimited or 1000)
Compression Lossless compression type (Snappy or none)
Table 1. Details on the data properties studied in the cuDF Parquet and ORC benchmark suite

With all of the properties that change independently, we have a large number of varied cases: 48 cases for each file format (ORC and Parquet). To ensure fair comparison between cases, we targeted a fixed in-memory dataframe size and column count, with a default size of 512 MiB. For the purpose of this post, we introduced an additional parameter to control the target in-memory dataframe size from 64 to 4,096 MiB.

The full cuDF benchmark suite is available in the open-source cuDF repository. To focus on the impact of I/O, this post only includes the inputs benchmarks in cuDF rather than reader options or row/column selection benchmarks.

Optimizing cuDF+GDS file read throughput

Data in formats like ORC and Parquet are stored in independent chunks, which we read in separate operations. We found that the read bandwidth is not saturated when these operations are performed in a single thread, leading to suboptimal performance regardless of GDS use.

Issuing multiple GDS read calls in parallel enables overlapping of multiple storage-to-GPU copies, raising the throughput and potentially saturating the read bandwidth. As of version 1.2.1.4, cuFile calls are synchronous, so parallelism requires multithreading from downstream users.

Controlling the level of parallelism

As we did not control the file layout and number of data chunks, creating a separate thread for each read operation could have generated an excessive number of threads, leading to a performance overhead. On the other hand, if we only had a few large read calls, we might not have had enough threads to effectively saturate the read bandwidth of the storage hardware.

To control the level of parallelism, we used a thread pool and sliced larger read calls into smaller reads that could be performed in parallel.

When we begin reading a file using GDS, we create a data ingest pipeline based on the thread-pool work of Barak Shoshany from Brock University. As Figure 2 shows, we split each file read operation (read_async call) in the cuDF reader into cuFile calls of fixed size, except for, in most cases, the last slice.

Instead of directly calling cuFile to read the data, a task is created for each slice and placed in a queue. The read_async call returns a task that waits for all of the sliced tasks to be completed. Callers can perform other work and wait on the aggregate task after the requested data is required to continue. When available, threads from the pool dequeue the tasks and make the synchronous cuFile calls.

Upon completion, a thread sets the task as completed and becomes available to execute the next task in the queue. As most tasks read the same amount of data, cuFile reads are effectively load-balanced between the threads. After all tasks from a given read_async call are completed, the aggregate task is complete and the caller code can continue, as the requested data range has been uploaded to the GPU memory.

Diagram on the left shows how asynchronous read calls are split into tasks, which are completed by threads in the thread pool. Chart on the right shows the average read throughput with different slice and thread count. Values in the diagram suggest that the optimal slice size is around 4MiB, and the optimal thread count is 16.
Figure 2. (left) Execution diagram of asynchronous read calls in cuDF. (right) Impact of thread count and slice size on the average end-to-end read throughput across the benchmark suite.

Optimal level of parallelism

The size of the thread pool and the read slices greatly impacts the number of GDS read calls executed at the same time. To maximize throughput, we experimented with a range of different values for these parameters and identified the configuration that gives the highest throughput. 

Figure 2 shows a performance optimization study on cuDF+GDS data ingest with tunable parameters of thread count and slice size. For this study, we defined the end-to-end read throughput as the binary format file size divided by the total ingest time (I/O plus parsing and decode).

The throughput shown in Figure 2 is averaged over the entire benchmark suite and based on a dataframe size of 512 MiB. We found that Parquet and ORC binary formats share the same optimization behavior across the parameter range, also shown in Figure 2. Small slice sizes cause higher overhead due to the large number of read tasks, and large slice sizes lead to poor utilization of the thread pool.

In addition, lower thread counts do not provide enough parallelism to saturate the NVMe read bandwidth, and higher thread counts lead to thread overhead. We discovered that the highest end-to-end read throughput was for 8-32 threads and 1-16 MiB slice size.

Depending on the use case, you may find optimal performance with different values. This is why, as of 22.04, the default thread pool size of 16 and slice size of 4 MiB are configurable through environment variables, as described in GPUDirect Storage Integration.

Impact of GDS on cuDF data ingest

We evaluated the effect that GDS use had on the performance of a few cuDF data ingest APIs using the benchmark suite described earlier. To isolate the impact of GDS, we ran the same set of benchmarks with default configuration (uses GDS) and with GDS manually disabled through an environment variable.

For accurate I/O benchmarking with cuDF benchmarks, file system caches must be cleared before each new read operation. Benchmarks measure the total time to execute a read_orc or read_parquet call (read time) and these are the values we used for comparison. GDS speedup was computed for each benchmark case file as the read time through the bounce buffer divided by the total ingest time through the direct data path.

For the GDS performance comparison, we used a NVIDIA DGX-2 with V100 GPUs and NVMe SSDs connected in a RAID 0 configuration. All benchmarks use a single GPU and read from a single RAID 0 of two 3.84-TB NVMe SSDs.

Larger files, higher speedups

Figure 3 shows the GDS speedup for a variety of files with in-memory dataframe size of 4,096 MiB. We observed that the performance improvement varies widely, between modest 10% and impressive 270% increase over the host path.

Some of the variance is due to differences in the data decode. For example, list types show smaller speedups due to the higher parsing complexity, and decimal types in ORC show smaller speedups due to overhead in fixed point decoding compared to Parquet.

Tables show the relative speedup due to GDS when reading Parquet (left) and ORC (right) files. Each row shows the speedup for a different group of data types, and columns show speedups for different combinations of run length and cardinality. The values in the table vary greatly, between 1.1 and 3.7.
Figure 3. GDS speedup for data ingests with in-memory data size of 4,096 MiB, where the Y-axis shows the data type and the X-axis shows the run-length (RL), cardinality (C) for (left) parquet and (right) ORC binary formats. GDS speedup values are shown for the uncompressed subset of the benchmark suite.

However, the decode process cannot account for the majority of variance in the results. With high correlation between the result for ORC and Parquet, you can reasonably suspect that properties like run-length and cardinality play a major role. Both properties affect how well data can be compressed/encoded, so here’s how the file size relates to these properties and, by proxy, to the GDS speedup.

Figure 4 shows the extent in which intrinsic properties of the data impact the binary file size needed to store a fixed in-memory data size. For 4,096 MiB of in-memory data, file sizes range from 0.1 to 5.8 GB across the benchmarks cases without Snappy compression. As expected, the largest file sizes generally correspond to shorter run-length and higher cardinality because data with this profile is difficult to efficiently encode.

Tables show the Parquet (left) and ORC (right) file sizes for a fixed in-memory data size. Each row shows sizes for different groups of data types. Each column shows sizes for different combinations of run length and cardinality. Tables show that, depending on the data properties, file size can vary by an order of magnitude.
Figure 4. File size on disk for 4,096 MiB of in-memory data, where the Y-axis shows the data types and the X-axis shows the run-length (RL) and cardinality (C) for (left) Parquet and (right) ORC binary formats. File sizes are shown for the uncompressed subset of the benchmark suite.

Figures 3 and 4 show strong correlation between file size and the magnitude of GDS performance benefit. We hypothesize that this is because, when reading an efficiently encoded file, I/O takes less time compared to decode so there’s little room for improvement from GDS. On the other hand, the results show that reading of files that encode poorly is bottlenecked on the I/O.

Larger data, higher speedups

We applied the full benchmark suite described in Table 1 to inputs with in-memory sizes in the range of 64 to 4,096 MiB. Taking the mean speedup over the benchmark suite, we discovered that speedup from GDS generally increases with the data size. We measure consistent 30-50% speedup for 512 MiB and greater in-memory data sizes. For the largest datasets, we found that the Parquet reader benefits more from GDS than the ORC reader.

Bar chart on the left contains total ingest times for different in-memory data sizes. The chart includes four data points for each file size - read times with and without GDS, for both ORC and Parquet. The chart shows that the ingest times grow nearly linearly with data size. Bar chart on the right shows relative speedup from GDS use at in-memory sizes in range from 64MiB to 4096MiB. Chart shows that GDS leads to larger speedups at larger data sizes.
Figure 5. (left) Mean total ingest time with host reads and GDS reads for ORC and Parquet binary formats. (right) Mean GDS speedup across the binary format cuDF benchmark suite, showing >40% speedup for >500 MB in-memory data size.

Summary

GPUDirect Storage provides a direct data path to the GPU, reducing latency and increasing throughput for I/O operations. RAPIDS cuDF leverages GDS in its data ingest APIs to unlock the full read bandwidth of your storage hardware.

GDS enables you to speed up cuDF data ingest workloads up to 3x, significantly improving the end-to-end performance of your workflows.

Apply your knowledge

If you have not tried out cuDF for your data processing workloads, we encourage you to test our latest 22.04 release. We provide Docker containers for our releases as well as our nightly builds. Conda packages are also available to make testing and deployment easier. If you want to get started with RAPIDS cuDF, you can do so.

If you are already using cuDF, we encourage you to give GPUDirect Storage a try. It is easier than ever to take advantage of high capacity and high-performance NVMe drives. If you already have the hardware, you are a few configuration steps away from unlocking peak performance in data ingest. Stay tuned for upcoming releases of cuDF.

For more information about storage I/O acceleration, see the following resources:

Categories
Misc

NVIDIA Recommends Stockholders Reject ‘Mini-Tender’ Offer by Tutanota LLC

SANTA CLARA, Calif., May 27, 2022 — NVIDIA today announced that it recently became aware of an unsolicited “mini-tender” offer by Tutanota LLC to purchase up to 215,000 …

Categories
Misc

Prototyping Faster with the Newest UDF Enhancements in the NVIDIA cuDF API

This post highlights helpful new cuDF features that allow you to think about a single row of data and write code faster.

Over the past few releases, the NVIDIA cuDF team has added several new features to user-defined functions (UDFs) that can streamline the development process while improving overall performance. In this post, I walk through the new UDF enhancements and show how you can take advantage of them within your own applications:

  • The cuDF Series.apply API and how to use it
  • The cuDF DataFrame.apply API and how to write a UDF in terms of “rows”
  • Enhanced support for missing data using both apply APIs
  • A real-world use case example with timing
  • Practical considerations, limitations, and future plans

apply API for cuDF series

If you’re not familiar with pandas, series apply is the main entry point used for mapping an arbitrary Python function onto a single series of data. For example, you might want to convert temperature in Celsius to Fahrenheit using a formula already written as a Python function.

Here is a quick refresher followed by the output of running this code:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

Technically, you can write any valid Python code within the function f and pandas runs the function in a loop over the series. This makes apply extremely flexible in the context of pandas, as any UDF can be successfully applied as long as it can successfully handle all of the input data—even UDFs that rely on external libraries or ones that expect or return arbitrary Python objects.

But, this flexibility comes at the cost of performance. Running a Python function in a long loop is not known for being an efficient strategy for a variety of reasons (for example, Python being interpreted from the outset). As a result, this performance constraint can be frustrating if your UDFs are simpler, such as those composed of purely mathematical operations on scalar values.

Luckily, these use cases are what cuDF was built for. Recent cuDF improvements within the scope of UDFs have motivated the introduction of the equivalent apply API:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

view raw

ctof_cudf.ipynb

hosted with ❤ by GitHub

If you are familiar with pandas, you can produce the same results as you did using pandas for numeric values. The only notable difference is that the resulting data is always a cuDF dtype and not an object, which is usually the case in pandas.

The function f can contain any Python UDF that is composed of pure math or Python operations. cuDF deduces an appropriate return dtype based on the inspection of the function through Numba and compiles and runs an equivalent function on the GPU.

Functions can also be written to accept an arbitrary number of scalar arguments. In the following code example, you can see that args= is supported:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

view raw

cudf_args.ipynb

hosted with ❤ by GitHub

While there are other ways of accomplishing the same goal in cuDF using custom kernels and other methods, this method of writing UDFs helps to abstract the GPU away from the process, which can cut down on development time for data scientists working on fast-paced, real-world projects.

So far, I’ve covered only the case of Series-based data. That is, I’ve shown you how to write a UDF with a single input and output. Many use cases require multi-column input, however, and this requires slightly different thinking.

DataFrame UDFs and thinking in terms of rows

UDFs that expect multiple columns as input and produce a single column as output are the set of functions supported by the pandas DataFrame apply API.

In these cases, the first function argument represents a row of data rather than just one value from a single input column. By row, I mean some kind of data structure that is keyable to obtain values, where the keys are the column names and the values are the scalars corresponding to the values of those columns in that row. It is conceptually what you get when you use iloc in pandas:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

view raw

pd_iloc.ipynb

hosted with ❤ by GitHub

The following code example shows how you would write and use a UDF in pandas that consumes this kind of row object:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

cuDF now enables you to do the exact thing without rewriting your UDF.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

When applying these functions, it is important to note that even though the cuDF API expects you to write the functions in terms of rows, no rows are actually involved when it comes to the execution of this function.

cuDF avoids the use of a for-loop and instead executes CUDA kernels that “pretend” rows of data exist. With a little magic, Numba knows how to write a proper kernel to get the same result as pandas. Because there is no loop, you should see higher performance when executing functions through this API.

Support for missing values using the series and DataFrame apply

Historically, UDFs in cuDF have not provided full support for missing values. This is due to architectural choices inside cuDF that relate to the way cuDF records which elements are null, specifically its use of a null mask to conserve memory.

The looping design of pandas apply APIs just works if the data contains null values. If a null is encountered in the data, the UDF receives the special value pd.NA. As a result, if the special value does not trigger an error, the execution proceeds as normal. However, cuDF does not work this way, and it requires a little extra machinery to support the same functionality. If you use the cuDF apply API, you should find that your UDFs treat null values in a natural manner:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

You can even condition on the cudf.NA singleton and get the expected answer, or return it directly from the function:

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

Evidently, the same thing is true here as is the case with rows: cuDF does not actually run the Python function as pandas does. Instead, it uses more Numba magic to translate this class of functions into an equivalent CUDA kernel and then returns the result of that instead.

In the next section, I look at a real-world example and perform some rough timing.

Real-world example using apply

Consider this scenario: An online streaming service is investigating which segments of its subscribers tend to hold their subscriptions the longest. Additionally, leadership has requested a specific segmentation scheme that breaks subscribers up by age:

  • 18–19
  • 20–29
  • 30–39
  • 40–49
  • 50–59
  • 60–69
  • 69+

The provided data only has two fields: age and days_subscribed.

Here’s how a UDF can solve the problem. First, write the row-wise custom function that applies the grouping. Next, take the results, group by the group ID, and average over the number of renewals.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Viewer requires iframe.

In this code example, the data is randomly generated so your mileage may vary on the actual answer. However, it demonstrates the process. Timing the UDF section of the code involves creating a variable pdf through pdf = df.to_pandas , and accomplishing a rough comparison using IPython:

%timeit df.apply(f, axis=1)

# 1.64 ms ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit pdf.apply(f, axis=1)
# 19.2 s ± 63.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Although this is not an official benchmark, the CUDA kernel is over four orders of magnitude faster on average in this particular case, which was run on a 32 GB V100 GPU.

Practical considerations, limitations, and the future

While these cuDF improvements represent significantly broader capabilities than previous iterations, there is always room to grow. Here is a list of key items to consider when writing UDFs for apply in cuDF:

  • JIT compilation. The first time a function is executed against a cuDF object, you encounter overhead effects of compiling the correct CUDA kernel. Subsequent uses of the function do not require recompilation, unless the dtypes of the target dataset change.
  • dtype support. So far, only numeric dtypes are supported in apply. However, support for additional types is on the roadmap, starting with strings.
  • External libraries. A common pattern is performing data prep in pandas and then using an external library for processing inside the UDF for each row. Because you cannot map external code onto the GPU arbitrarily, this is not currently supported.

Summary

UDFs are an easy way of solving particular problems quickly. They help you think in terms of a single datum when designing the logic of your pipeline. With these new cuDF UDF enhancements, the aim is to expedite the development of workflows involving cuDF and allow you to quickly prototype solutions, as well as reuse existing business logic. In addition, null support lets you be explicit about how to handle missing values without needing extra processing steps.

As a reminder, UDFs are an area of active development in cuDF and updates are ongoing. If you choose to try these new UDF enhancements out, as always, I’d love to hear about your experience in the comments section.

Categories
Misc

Can anyone helpme with this code? requests.post is not working. When Im checking in postman the page is still loading as sending request?

@/app.route(‘/myapp/detectObjects’,methods = [‘POST’])

def detect():

imge = request.files.getlist(“image”)

img = imge[0].filename

req = requests.post(‘http://127.0.0.1:5000/give‘,json={“image”:img})

return req.json()

submitted by /u/Savings-Stop-3300
[visit reddit] [comments]

Categories
Misc

Jetson nano Ubuntu 18.04 AARCH64 architecture

I’m a beginner so please be patient with my ignorance.

I’m trying to install Tensorflow but there doesn’t seem to be a method that is supported by the aarch64 architecture.

Has anyone dealt with and solved this issue? If so…. Wtf do I do.. I’m about 25 hours into attempting this install

submitted by /u/DamnitName
[visit reddit] [comments]

Categories
Misc

Cats and Dogs Kaggle dataset corrupt jpeg images

I’m just trying to train the Kaggle Cats and Dogs dataset but I keep getting the error

tensorflow.python.framework.errors_impl.InvalidArgumentError: Input size should match (header_size + row_size * abs_height) but they differ by 2

[[{{node decode_image/DecodeImage}}]] [Op:IteratorGetNext]

Along with:

Corrupt JPEG data: 242 extraneous bytes before marker 0xd9

Corrupt JPEG data: 217 extraneous bytes before marker 0xd9

Corrupt JPEG data: 133 extraneous bytes before marker 0xd9

Even though I implemented 3 different methods to check for and remove corrupt image files… So I have no idea why I’m still getting this error. Please help!

CODE:

#check for corrupted files (method 1)
from pathlib import Path
import imghdr
classes = [‘Cat’,’Dog’]
image_extensions = [“.png”, “.jpg”] # add there all your images file extensions
for item in classes:
data_dir = f’kagglecatsanddogs_5340/PetImages/{item}’
img_type_accepted_by_tf = [“bmp”, “gif”, “jpeg”, “png”]
for filepath in Path(data_dir).rglob(“*”):
if filepath.suffix.lower() in image_extensions:
img_type = imghdr.what(filepath)
if img_type is None:
print(f”{filepath} is not an image”)
# remove image
os.remove(filepath)
print(f”successfully removed {filepath}”)
elif img_type not in img_type_accepted_by_tf:
print(f”{filepath} is a {img_type}, not accepted by TensorFlow”)
try:
img = Image.open(filepath) # open the image file
img.verify() # verify that it is, in fact an image
# print(‘valid image’)
except Exception:
print(‘Bad file:’, filepath)
# method 2
import glob
img_paths = glob.glob(os.path.join(‘kagglecatsanddogs_5340/PetImages/Dog’,’*/*.*’)) # assuming you point to the directory containing the label folders.
bad_paths = []
for image_path in img_paths:
try:
img_bytes = tf.io.read_file(image_path)
decoded_img = tf.decode_image(img_bytes)
except Exception:
print(f”Found bad path {image_path}”)
bad_paths.append(image_path)

print(f”{image_path}: OK”)
print(“BAD PATHS:”)
for bad_path in bad_paths:
print(f”{bad_path}”)
# method 3
num_skipped = 0
for folder_name in (“Cat”, “Dog”):
folder_path = os.path.join(“kagglecatsanddogs_5340/PetImages”, folder_name)
for fname in os.listdir(folder_path):
fpath = os.path.join(folder_path, fname)
try:
fobj = open(fpath, “rb”)
is_jfif = tf.compat.as_bytes(“JFIF”) in fobj.peek(10)
finally:
fobj.close()
if not is_jfif:
num_skipped += 1
# Delete corrupted image
os.remove(fpath)
print(“Deleted %d images” % num_skipped)

def normalize(x,y):
x = tf.cast(x,tf.float32) / 255.0
return x, y
def convert_to_categorical(input):
if input == 1:
return “Dog”
else:
return “Cat”
def to_list(ds):
ds_list = []
for sample in ds:
image, label = sample
ds_list.append((image, label))
return ds_list

# load dataset
directory = ‘archive/PetImages’
ds_train = tf.keras.utils.image_dataset_from_directory(
directory,
labels=’inferred’,
label_mode=’binary’,
color_mode=’rgb’,
batch_size=1,
shuffle=False,
validation_split=0.3,
subset=’training’,
image_size=(180,180)
)
ds_test = tf.keras.utils.image_dataset_from_directory(
directory,
labels=’inferred’,
label_mode=’binary’,
color_mode=’rgb’,
batch_size=1,
shuffle=False,
validation_split=0.3,
subset=’validation’,
image_size=(180,180)
)

# normalize data
ds_train.map(normalize)
ds_test.map(normalize)
# plot 10 random images from training set
num = len(ds_train)
ds_train_list = to_list(ds_train)
for i in range(1,11):
random_index = np.random.randint(num)
img, label = ds_train_list[random_index]
label = convert_to_categorical(np.array(label))
img = np.reshape(img,(300,300,3))
plt.subplot(2,5,i)
plt.imshow(img)
plt.title(label)
plt.savefig(‘figures/example_images.png’)

submitted by /u/berimbolo21
[visit reddit] [comments]

Categories
Misc

Plotting example images from TensorFlow dataset not working

I’m trying to plot a few example images using Matplotlib from a cats and dogs Kaggle dataset that I loaded into the script using TensorFlow’s import_image_dataset_from_directory, but the images aren’t displaying correctly. The Matplotlib plots are either empty or contain some speckled blue or yellow dots… Does anyone know how to fix this? (code below)

def normalize(x,y):
x = tf.cast(x,tf.float32) / 255.0
return x, y
def convert_to_categorical(input):
if input == 1:
return “Dog”
else:
return “Cat”
def to_list(ds):
ds_list = []
for sample in ds:
image, label = sample
ds_list.append((image, label))
return ds_list

# load dataset
directory = ‘train’
ds_train = tf.keras.utils.image_dataset_from_directory(
directory,
labels=’inferred’,
label_mode=’binary’,
batch_size=1,
shuffle=False,
validation_split=0.3,
subset=’training’,
image_size=(300,300)
)
ds_test = tf.keras.utils.image_dataset_from_directory(
directory,
labels=’inferred’,
label_mode=’binary’,
batch_size=1,
shuffle=False,
validation_split=0.3,
subset=’validation’,
image_size=(300,300)
)
# normalize data
ds_train.map(normalize)
ds_test.map(normalize)
# plot 10 random images from training set
num = len(ds_train)
ds_train_list = to_list(ds_train)
for i in range(1,11):
random_index = np.random.randint(num)
img, label = ds_train_list[random_index]
label = convert_to_categorical(np.array(label))
img = np.reshape(img,(300,300,3))
plt.subplot(2,5,i)
plt.imshow(img)
plt.title(label)
plt.savefig(‘figures/example_images.png’)

submitted by /u/berimbolo21
[visit reddit] [comments]