Categories
Misc

The Man With 100,000 Brains: AI’s Big Donation to Science

Jorge Cardoso wears many hats, and that’s appropriate given he has so many brains. A hundred thousand of them to be exact. Cardoso is a teacher, a CTO, an entrepreneur, a founding member of the MONAI open source consortium and a researcher in AI for medical imaging. In that last role, Cardoso and his team Read article >

The post The Man With 100,000 Brains: AI’s Big Donation to Science appeared first on NVIDIA Blog.

Categories
Misc

NVIDIA Accelerates AI, Digital Twins, Quantum Computing and Edge HPC at ISC 2022

Researchers grappling with today’s grand challenges are getting traction with accelerated computing, as showcased at ISC, Europe’s annual gathering of supercomputing experts. Some are building digital twins to simulate new energy sources. Some use AI+HPC to peer deep into the human brain. Others are taking HPC to the edge with highly sensitive instruments or accelerating Read article >

The post NVIDIA Accelerates AI, Digital Twins, Quantum Computing and Edge HPC at ISC 2022 appeared first on NVIDIA Blog.

Categories
Misc

Top Global Systems Makers Accelerate Adoption of NVIDIA Grace and Grace Hopper

Atos, Dell Technologies, GIGABYTE, Hewlett Packard Enterprise, Inspur, Lenovo and Supermicro Join First Wave Planning NVIDIA Grace-Powered HGX Systems for HPC and AISANTA CLARA, Calif., May …

Categories
Misc

why are some models in .lite format and others in .tflite format

submitted by /u/MatsudaYagami
[visit reddit] [comments]

Categories
Misc

Hyperscale Digital Twins to Give Us “Amazing Superpowers,” NVIDIA Exec Says at ISC 2022

Highly accurate digital representations of physical objects or systems, or “digital twins,” will enable the next era of industrial virtualization and AI, executives from NVIDIA and BMW said Tuesday. Kicking off the ISC 2022 conference in Hamburg, Germany, NVIDIA’s Rev Lebaredian (left), vice president for Omniverse and simulation technology, was joined by Michele Melchiorre, senior Read article >

The post Hyperscale Digital Twins to Give Us “Amazing Superpowers,” NVIDIA Exec Says at ISC 2022 appeared first on NVIDIA Blog.

Categories
Misc

Is anyone interested in this project?

Is anyone interested in this project? submitted by /u/7NoteDancing
[visit reddit] [comments]
Categories
Misc

ImportError: cannot import name ‘dtensor’ from ‘tensorflow.compat.v2.experimental’

ImportError: cannot import name 'dtensor' from 'tensorflow.compat.v2.experimental'

submitted by /u/Ok_Wish4469
[visit reddit] [comments]

Categories
Misc

Using Fortran Standard Parallel Programming for GPU Acceleration

We present lessons learned from refactoring a Fortran application to use modern do concurrent loops in place of OpenACC for GPU acceleration.

This is the fourth post in the Standard Parallel Programming series, which aims to instruct developers on the advantages of using parallelism in standard languages for accelerated computing:

Standard languages have begun adding features that compilers can use for accelerated GPU and CPU parallel programming, for instance, do concurrent loops and array math intrinsics in Fortran.

Using standard language features has many advantages to offer, the chief advantage being future-proofness. As Fortran’s do concurrent is a standard language feature, the chances of support being lost in the future is slim.

This feature is also relatively simple to use in initial code development and it adds portability and parallelism. Using do concurrent for initial code development has the mental advantage of encouraging you to think about parallelism from the start as you write and implement loops.

For initial code development, do concurrent is a great way to add GPU support without having to learn directives. However, even code that has already been GPU-accelerated through the use of directives like OpenACC and OpenMP can benefit from refactoring to standard parallelism for the following reasons:

  • Cleaning up the code for those who do not know directives, or removing the large numbers of directives that make the source code distracting.
  • Increasing the portability of the code in terms of vendor support and longevity of support.
  • Future-proofing the code, as ISO-standard languages have a proven track record for stability and portability.

Replacing directives on multi-core CPUs and GPUs

POT3D is a Fortran code that computes potential field solutions to approximate the solar coronal magnetic field using surface field observations as input. It continues to be used for numerous studies of coronal structure and dynamics.

The code is highly parallelized using MPI and is GPU-accelerated using MPI with OpenACC. It is open source and available on GitHub. It is also a part of the SpecHPC 2021 Benchmark Suites.

We recently refactored another code example with do concurrent at WACCPD 2021. The results showed that you could replace directives with do concurrent without losing performance on multicore CPUs and GPUs. However, that code was somewhat simple in that there is no MPI.

Now, we want to explore replacing directives in more complicated code. POT3D contains nontrivial features for standard Fortran parallelism to handle: reductions, atomics, CUDA-aware MPI, and local stack arrays. We want to see if do concurrent can replace directives and retain the same performance.

To establish a performance baseline for refactoring the code to do concurrent, first review the initial timings of the original code in Figure 1. The CPU result was run on 64 MPI ranks (32 per socket) on a dual-socket AMD EPYC 7742 server, while the GPU result was run on one MPI rank on an NVIDIA A100 (40GB). The GPU code relies on data movement directives for data transfers (we do not use managed memory here) and is compiled with -acc=gpu -gpu=cc80,cuda11.5. The runtimes are an average over four runs.

The following highlighted text shows the number of lines of code and directives for the current version of the code. You can see that there are 80 directives, but we hope to reduce this number by refactoring with do concurrent.

Graph showing performance results of original code on a 64 core CPU (6717.4 seconds) compared to the original using OpenACC on an NVIDIA A100 GPU (1481.3 seconds).
Figure 1. CPU and GPU timings for the original version of the POT3D code show substantial improvement on wall clock time for the original code when using OpenACC on an NVIDIA A100 40GB GPU
  POT3D (Original)
Fortran 3,487
Comments 3,452
OpenACC Directives 80
Total 7,019
Table 1. Code facts for POT3D including Fortran, comments, OpenACC Directives, and total lines.

Do concurrent and OpenACC

Here are some examples of do concurrent compared to OpenACC from the code POT3D, such as a triple-nested OpenACC parallelized loop:

!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
…
!$acc parallel loop default(present) collapse(3) async(1)
do k=1,np
 do j=1,nt
  do i=1,nrm1
   br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k))*dr_i(i)
  enddo
 enddo
enddo
…
!$acc wait
!$acc exit data delete(phi,dr_i,br)

As mentioned earlier, this OpenACC code is compiled with the flag -acc=gpu -gpu=cc80,cuda11.5 to run on an NVIDIA GPU.

You can parallelize this same loop with do concurrent and rely on NVIDIA CUDA Unified Memory for data movement instead of directives. This results in the following code:

do concurrent (k=1:np,j=1:nt,i=1:nrm1)
 br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo

As you can see, the loop has been condensed from 12 lines to three, and CPU portability and GPU parallelism are retained with the nvfortran compiler from the NVIDIA HPC SDK.

This reduction in the number of lines is thanks to collapsing multiple loops into one loop and relying on managed memory, which removes all data movement directives. Compile this code for the GPU with -stdpar=gpu -gpu=cc80,cuda11.5.

For nvfortran, activating standard parallelism (-stdpar=gpu) automatically activates managed memory. To use OpenACC directives to control data movement along with do concurrent, use the following flags: -acc=gpu -gpu=nomanaged.

The nvfortran implementation of do concurrent also allows for locality of variables to be defined:

do concurrent (k=1:N, j(i)>0) local(M) shared(J,K)
 M = mod(K(i), J(i))
 K(i) = K(i)- M
enddo

This may be necessary for some code. For POT3D, the default locality of variables performs as needed. The default locality is the same as OpenACC with nvfortran.

do concurrent CPU performance and GPU implementation

Replacing all the OpenACC loops with do concurrent and relying on managed memory for data movement leads you to code with zero directives and fewer lines. We removed 80 directives and 66 lines of Fortran.

Figure 2 shows that this do concurrent version of the code has nearly the same performance on the CPU as the original GitHub code. This means that you haven’t broken CPU compatibility by using do concurrent. Instead, multi-core parallelism has also been added, which can be used by compiling with the flag -stdpar=multicore.

Graph showing the performance of the original MPI code on 64 CPU cores (6717.4 seconds) compared to do concurrent on 64 CPU cores (6767.9 seconds).
Figure 2. CPU timings using MPI (32 ranks per socket) for the original and do concurrent versions of the POT3D code

Unlike the case of the CPU, to be able to run POT3D on a GPU, you must add a couple of directives.

First, to take advantage of multiple GPU with MPI, you need a directive to specify the GPU device number. Otherwise, all MPI ranks would use the same GPU.

!$acc set device_num(mpi_shared_rank_num)

In this example,  mpi_shared_rank_num is the MPI rank within a node. It’s assumed that the code is launched such that the number of MPI ranks per node is the same as the number of GPUs per node. This can also be accomplished by setting CUDA_VISIBLE_DEVICES for each MPI rank, but we prefer doing this programmatically.

When using managed memory with multiple GPUs, make sure that the device selection (such as !$acc set device_num(N)) is done before any data is allocated. Otherwise, an additional CUDA context is created, introducing additional overhead.

Currently, the nvfortran compiler does not support array reductions on concurrent loops, which are required in two places in the code. Fortunately, an OpenACC atomic directive can be used in place of an array reduction:

do concurrent (k=2:npm1,i=1:nr)
!$acc atomic
 sum0(i)=sum0(i)+x(i,2,k)*dph(k )*pl_i
enddo

After adding this directive, change the compiler options to enable OpenACC explicitly by using -stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5. This allows you to use only three OpenACC directives. This is the closest this code can come to not having directives at this time.

All the data movement directives are unnecessary, since CUDA managed memory is used for all data structures. Table 2 shows the number of directives and lines of code needed for this version of POT3D.

  POT3D (Original) POT3D (Do Concurrent) Difference
Fortran 3487 3421 (-66)
Comments 3452 3448 (-4)
OpenACC Directives 80 3 (-77)
Total 7019 6872 (-147)
Table 2. Number of lines of code for the GPU compatible do concurrent version of POT3D with a breakdown of the number of Fortran lines, directives, and comments lines.

For the reduction loops in POT3D, you relied on implicit reductions, but that may not always work. Recently, nvfortran has added the upcoming Fortran 202X reduce clause, which can be used on reduction loops as follows:

do concurrent (k=1:N) reduce(+:cgdot)
 cgdot=cgdot+x(i)*y(i)
enddo

GPU performance, unified memory, and data movement

You’ve developed code with the minimal number of OpenACC directives and do concurrent that relies on managed memory for data movement. This is the closest directive-free code that you can have at this time.

Figure 3 shows that this code version takes a small performance hit of ~10% when compared to the original OpenACC GPU code. The cause of this could be do concurrent, managed memory, or a combination.

Graph showing performance of original OpenACC code on an A100 GPU (1481.3 seconds) compared to the standard parallelism version (1644.5 seconds).
Figure 3. GPU timings for the GitHub and standard parallelism (STDPAR) with minimal OpenACC versions of the POT3D code on one NVIDIA A100 40GB GPU.

To see if managed memory causes the small performance hit, compile the original GitHub code with managed memory turned on. This is done by using the compile flag -gpu=managed in addition to the standard OpenACC flags used before for the GPU.

Figure 4 shows that the GitHub code now performs similar to the minimum directives code with managed memory. This means that the culprit for the small performance loss is unified memory.

Graph showing performance of original OpenACC code on an A100 GPU (1481.3 seconds) compared to the standard parallelism version (1644.5 seconds) and the OpenACC code using managed memory instead of data directives (1634.3 seconds).
Figure 4. GPU timings for the GitHub (managed and no managed) and STDPAR with minimal OpenACC versions of the POT3D code.

To regain the performance of the original code with the minimal directives code, you must add the data movement directives back in. This mixture of do concurrent and data movement directives would look like the following code example:

!$acc enter data copyin(phi,dr_i)
!$acc enter data create(br)
do concurrent (k=1:np,j=1:nt,i=1:nrm1)
 br(i,j,k)=(phi(i+1,j,k)-phi(i,j,k ))*dr_i(i)
enddo
!$acc exit data delete(phi,dr_i,br)

This results in the code having 41 directives, with 38 being responsible for data movement. To compile the code and rely on the data movement directives, run the following command:

-stdpar=gpu -acc=gpu -gpu=cc80,cuda11.5,nomanaged

nomanaged turns off managed memory and -acc=gpu turns on the directive recognition.

Figure 5 shows that there is nearly the same performance as the original GitHub code. This code has 50% fewer directives than the original code and gives the same performance!

Graph showing performance of original OpenACC code on an A100 GPU (1481.3 seconds) compared to the standard parallelism version (1644.5 seconds), the OpenACC code using managed memory instead of data directives (1634.3 seconds), and the standard parallelism version using OpenACC to optimize data movement (1486.2 seconds).
Figure 5. GPU timings for the GitHub (managed and no managed), STDPAR + minimal OpenACC (managed), and STDPAR + OpenACC (no managed) versions of the POT3D code.

MPI + DO CONCURRENT scaling

Figure 7 shows the timing results using multiple GPUs. The primary takeaway is that do concurrent works with MPI over multiple GPUs.

Looking at the codes with managed memory turned on (blue lines), you can see that the original code and the minimal directives code gave nearly the same performance as multiple GPUs were used.

Looking at the codes with managed memory turned off (green lines), you can again see the same scaling between the original GitHub code and the do concurrent version of the code. This indicates that do concurrent works with MPI and has no impact on the scaling you should see.

What you might also notice is that managed memory causes an overhead as the GPUs are scaled. The managed memory runs (blue lines) and data directive lines (green lines) are parallel to each other, meaning the overhead scales with the number of GPUs.

Performance graph showing strong scaling from 1–8 GPUs for each of the four versions of the code. The two versions using managed memory scale similarly as do the two versions using OpenACC data directives.
Figure 6. GPU scaling across 1, 2, 4, and 8 GPUs for the GitHub (managed and no managed), STDPAR + minimal OpenACC (managed), and STDPAR + OpenACC (no managed) versions of the POT3D code.

Fortran standard parallel programming review

You may be wondering, “Standard Fortran sounds too good to be true, what is the catch?”

Fortran standard parallel programming enables cleaner looking code and increases the future proofness of your code by relying on ISO language standards. Using the latest nvfortran compiler, you gain all the benefits mentioned earlier.

Although you lose the current GCC OpenACC/MP GPU support when you transition to do concurrent, we expect to gain more GPU support in the future as other vendors add support of do concurrent on GPUs. Given the track record of ISO language standards, we believe that this support will come.

Using do concurrent does currently come with a small number of limitations, namely the lack of support for atomics, device selection, asynchrony, or optimizing data movement. As we have shown, however, each of these limitations can be easily worked around using compiler directives. Far fewer directives are required thanks to the native parallel language features in Fortran.

Ready to get started? Download the free NVIDIA HPC SDK, and start testing! If you are also interested in more details on our findings, see the From Directives to DO CONCURRENT: A Case Study in Standard Parallelism GTC session. For more information about standard language parallelism, see Developing Accelerated Code with Standard Language Parallelism.

Acknowledgments

This work was supported by the National Science Foundation, NASA, and the Air Force Office of Scientific Research. Computational resources were provided by the Computational Science Resource Center at San Diego State University.

Categories
Misc

Boosting Data Ingest Throughput with GPUDirect Storage and RAPIDS cuDF

Learn how RAPIDS cuDF accelerates data science with the help of GPUDirect Storage. Dive into the techniques that minimize the time to upload data to the GPU

If you work in data analytics, you know that data ingest is often the bottleneck of data preprocessing workflows. Getting data from storage and decoding it can often be one of the most time-consuming steps in the workflow because of the data volume and the complexity of commonly used formats. Optimizing data ingest can greatly reduce this bottleneck for data scientists working on large data sets.

RAPIDS cuDF greatly speeds up data decode by implementing CUDA-accelerated readers for prevalent formats in data science.

In addition, Magnum IO GPUDirect Storage (GDS) enables cuDF to speed up I/O by loading the data directly from storage into the device (GPU) memory. By providing a direct data path across the PCIe bus between the GPU and compatible storage (for example, Non-Volatile Memory Express (NVMe) drive), GDS can enable up to 3–4x higher cuDF read throughput, with an average of 30–50% higher throughput across a variety of data profiles.

In this post, we provide an overview of GPUDirect Storage and how it’s integrated into cuDF. We introduce the suite of benchmarks that we used to evaluate I/O performance. Then, we walk through the techniques that cuDF implements to optimize GDS reads. We conclude with the benchmark results and identify cases where use of GDS is the most beneficial for cuDF.

What is GPUDirect Storage?

GPUDirect Storage is a new technology that enables direct data transfer between local or remote storage (as a block device or through file systems interface) and GPU memory.

In other words, a direct memory access (DMA) engine can now quickly move data on a direct path from the storage to GPU memory, without increasing latency and burdening the CPU with extra copying through a bounce buffer.

Figure 1 shows the data flow patterns without and with GDS. When using the system memory bounce buffer, throughput is bottlenecked by the bandwidth from system memory to GPUs. On systems with multiple GPUs and NVMe drives, this bottleneck is even more pronounced.

However, with the ability to directly interface GPUs with storage devices, the intermediary CPU is taken out of the data path, and all CPU resources can be used for feature engineering or other preprocessing tasks.

Two diagrams showing data paths without and with GPUDirect Storage. The first diagram shows that the data goes through the system memory before being sent to the GPU. The second diagram shows data taking a direct path to the GPU, potentially speeding up transfers.
Figure 1. Schematic (left) shows memory access patterns in a system where a system memory bounce buffer is used to read data from NVMe drives into GPU memory. A second schematic (right) shows the direct data path between NVMe drives and GPU memory over the PCIe bus.

The GDS cuFile library enables applications and frameworks to leverage GDS technology to increase bandwidth and achieve lower latency. cuFile is available as part of CUDA Toolkit since version 11.4.

How does RAPIDS cuDF use GPUDirect Storage?

To increase the end-to-end read throughput, cuDF uses the cuFile APIs in its data ingest interfaces, like read_parquet and read_orc. As cuDF performs nearly all parsing on the GPU, most of the data is not needed in system memory and can be directly transferred to GPU memory.

As a result, only the metadata portion of input files is accessed by the CPU, and the metadata constitutes only a small part of the total file size. This permits the cuDF data ingest API to make efficient use of GDS technology.

Since cuDF 22.02, we have enabled the use of GDS by default. If you have cuFile installed and request to read data from a supported storage location, cuDF automatically performs direct reads through the cuFile API. If GDS is not applicable to the input file, or if the user disables GDS, data follows the bounce buffer path and is copied through a paged system memory buffer.

Benchmarking data ingest in cuDF

In response to the diversity of data science datasets, we assembled a benchmark suite that targets key data and file properties. Previous examples of I/O benchmarking used data samples such as the New York Taxi trip record, the Yelp reviews dataset, or Zillow housing data to represent general use cases. However, we discovered that benchmarking specific combinations of data and file properties have been crucial for performance analysis and troubleshooting.

Covered combinations of data and file properties

Table 1 shows the parameters that we varied in our benchmarks for binary formats. We generated pseudo-random data across the range of supported data types and collected benchmarks for each group of related types. We also independently varied the run-length and cardinality of the data to exercise run-length encoding and dictionary encoding of the target formats.

Finally, we varied the compression type in the generated files, for all values of the properties preceding. We currently support two options: Use Snappy or keep the data uncompressed. 

Data Property Description
File format Parquet, ORC
Data type Integer (signed and unsigned)
Float (32-, 64-bit)
String (90% ASCII, 10% Unicode)
Timestamp (day, s, ms, us, ns)
Decimal (32-, 64-, 128-bit)
List (nesting depth 2, int32 elements)
Run-length Average data value repetition (1x or 32x)
Cardinality Unique data values (unlimited or 1000)
Compression Lossless compression type (Snappy or none)
Table 1. Details on the data properties studied in the cuDF Parquet and ORC benchmark suite

With all of the properties that change independently, we have a large number of varied cases: 48 cases for each file format (ORC and Parquet). To ensure fair comparison between cases, we targeted a fixed in-memory dataframe size and column count, with a default size of 512 MiB. For the purpose of this post, we introduced an additional parameter to control the target in-memory dataframe size from 64 to 4,096 MiB.

The full cuDF benchmark suite is available in the open-source cuDF repository. To focus on the impact of I/O, this post only includes the inputs benchmarks in cuDF rather than reader options or row/column selection benchmarks.

Optimizing cuDF+GDS file read throughput

Data in formats like ORC and Parquet are stored in independent chunks, which we read in separate operations. We found that the read bandwidth is not saturated when these operations are performed in a single thread, leading to suboptimal performance regardless of GDS use.

Issuing multiple GDS read calls in parallel enables overlapping of multiple storage-to-GPU copies, raising the throughput and potentially saturating the read bandwidth. As of version 1.2.1.4, cuFile calls are synchronous, so parallelism requires multithreading from downstream users.

Controlling the level of parallelism

As we did not control the file layout and number of data chunks, creating a separate thread for each read operation could have generated an excessive number of threads, leading to a performance overhead. On the other hand, if we only had a few large read calls, we might not have had enough threads to effectively saturate the read bandwidth of the storage hardware.

To control the level of parallelism, we used a thread pool and sliced larger read calls into smaller reads that could be performed in parallel.

When we begin reading a file using GDS, we create a data ingest pipeline based on the thread-pool work of Barak Shoshany from Brock University. As Figure 2 shows, we split each file read operation (read_async call) in the cuDF reader into cuFile calls of fixed size, except for, in most cases, the last slice.

Instead of directly calling cuFile to read the data, a task is created for each slice and placed in a queue. The read_async call returns a task that waits for all of the sliced tasks to be completed. Callers can perform other work and wait on the aggregate task after the requested data is required to continue. When available, threads from the pool dequeue the tasks and make the synchronous cuFile calls.

Upon completion, a thread sets the task as completed and becomes available to execute the next task in the queue. As most tasks read the same amount of data, cuFile reads are effectively load-balanced between the threads. After all tasks from a given read_async call are completed, the aggregate task is complete and the caller code can continue, as the requested data range has been uploaded to the GPU memory.

Diagram on the left shows how asynchronous read calls are split into tasks, which are completed by threads in the thread pool. Chart on the right shows the average read throughput with different slice and thread count. Values in the diagram suggest that the optimal slice size is around 4MiB, and the optimal thread count is 16.
Figure 2. (left) Execution diagram of asynchronous read calls in cuDF. (right) Impact of thread count and slice size on the average end-to-end read throughput across the benchmark suite.

Optimal level of parallelism

The size of the thread pool and the read slices greatly impacts the number of GDS read calls executed at the same time. To maximize throughput, we experimented with a range of different values for these parameters and identified the configuration that gives the highest throughput. 

Figure 2 shows a performance optimization study on cuDF+GDS data ingest with tunable parameters of thread count and slice size. For this study, we defined the end-to-end read throughput as the binary format file size divided by the total ingest time (I/O plus parsing and decode).

The throughput shown in Figure 2 is averaged over the entire benchmark suite and based on a dataframe size of 512 MiB. We found that Parquet and ORC binary formats share the same optimization behavior across the parameter range, also shown in Figure 2. Small slice sizes cause higher overhead due to the large number of read tasks, and large slice sizes lead to poor utilization of the thread pool.

In addition, lower thread counts do not provide enough parallelism to saturate the NVMe read bandwidth, and higher thread counts lead to thread overhead. We discovered that the highest end-to-end read throughput was for 8-32 threads and 1-16 MiB slice size.

Depending on the use case, you may find optimal performance with different values. This is why, as of 22.04, the default thread pool size of 16 and slice size of 4 MiB are configurable through environment variables, as described in GPUDirect Storage Integration.

Impact of GDS on cuDF data ingest

We evaluated the effect that GDS use had on the performance of a few cuDF data ingest APIs using the benchmark suite described earlier. To isolate the impact of GDS, we ran the same set of benchmarks with default configuration (uses GDS) and with GDS manually disabled through an environment variable.

For accurate I/O benchmarking with cuDF benchmarks, file system caches must be cleared before each new read operation. Benchmarks measure the total time to execute a read_orc or read_parquet call (read time) and these are the values we used for comparison. GDS speedup was computed for each benchmark case file as the read time through the bounce buffer divided by the total ingest time through the direct data path.

For the GDS performance comparison, we used a NVIDIA DGX-2 with V100 GPUs and NVMe SSDs connected in a RAID 0 configuration. All benchmarks use a single GPU and read from a single RAID 0 of two 3.84-TB NVMe SSDs.

Larger files, higher speedups

Figure 3 shows the GDS speedup for a variety of files with in-memory dataframe size of 4,096 MiB. We observed that the performance improvement varies widely, between modest 10% and impressive 270% increase over the host path.

Some of the variance is due to differences in the data decode. For example, list types show smaller speedups due to the higher parsing complexity, and decimal types in ORC show smaller speedups due to overhead in fixed point decoding compared to Parquet.

Tables show the relative speedup due to GDS when reading Parquet (left) and ORC (right) files. Each row shows the speedup for a different group of data types, and columns show speedups for different combinations of run length and cardinality. The values in the table vary greatly, between 1.1 and 3.7.
Figure 3. GDS speedup for data ingests with in-memory data size of 4,096 MiB, where the Y-axis shows the data type and the X-axis shows the run-length (RL), cardinality (C) for (left) parquet and (right) ORC binary formats. GDS speedup values are shown for the uncompressed subset of the benchmark suite.

However, the decode process cannot account for the majority of variance in the results. With high correlation between the result for ORC and Parquet, you can reasonably suspect that properties like run-length and cardinality play a major role. Both properties affect how well data can be compressed/encoded, so here’s how the file size relates to these properties and, by proxy, to the GDS speedup.

Figure 4 shows the extent in which intrinsic properties of the data impact the binary file size needed to store a fixed in-memory data size. For 4,096 MiB of in-memory data, file sizes range from 0.1 to 5.8 GB across the benchmarks cases without Snappy compression. As expected, the largest file sizes generally correspond to shorter run-length and higher cardinality because data with this profile is difficult to efficiently encode.

Tables show the Parquet (left) and ORC (right) file sizes for a fixed in-memory data size. Each row shows sizes for different groups of data types. Each column shows sizes for different combinations of run length and cardinality. Tables show that, depending on the data properties, file size can vary by an order of magnitude.
Figure 4. File size on disk for 4,096 MiB of in-memory data, where the Y-axis shows the data types and the X-axis shows the run-length (RL) and cardinality (C) for (left) Parquet and (right) ORC binary formats. File sizes are shown for the uncompressed subset of the benchmark suite.

Figures 3 and 4 show strong correlation between file size and the magnitude of GDS performance benefit. We hypothesize that this is because, when reading an efficiently encoded file, I/O takes less time compared to decode so there’s little room for improvement from GDS. On the other hand, the results show that reading of files that encode poorly is bottlenecked on the I/O.

Larger data, higher speedups

We applied the full benchmark suite described in Table 1 to inputs with in-memory sizes in the range of 64 to 4,096 MiB. Taking the mean speedup over the benchmark suite, we discovered that speedup from GDS generally increases with the data size. We measure consistent 30-50% speedup for 512 MiB and greater in-memory data sizes. For the largest datasets, we found that the Parquet reader benefits more from GDS than the ORC reader.

Bar chart on the left contains total ingest times for different in-memory data sizes. The chart includes four data points for each file size - read times with and without GDS, for both ORC and Parquet. The chart shows that the ingest times grow nearly linearly with data size. Bar chart on the right shows relative speedup from GDS use at in-memory sizes in range from 64MiB to 4096MiB. Chart shows that GDS leads to larger speedups at larger data sizes.
Figure 5. (left) Mean total ingest time with host reads and GDS reads for ORC and Parquet binary formats. (right) Mean GDS speedup across the binary format cuDF benchmark suite, showing >40% speedup for >500 MB in-memory data size.

Summary

GPUDirect Storage provides a direct data path to the GPU, reducing latency and increasing throughput for I/O operations. RAPIDS cuDF leverages GDS in its data ingest APIs to unlock the full read bandwidth of your storage hardware.

GDS enables you to speed up cuDF data ingest workloads up to 3x, significantly improving the end-to-end performance of your workflows.

Apply your knowledge

If you have not tried out cuDF for your data processing workloads, we encourage you to test our latest 22.04 release. We provide Docker containers for our releases as well as our nightly builds. Conda packages are also available to make testing and deployment easier. If you want to get started with RAPIDS cuDF, you can do so.

If you are already using cuDF, we encourage you to give GPUDirect Storage a try. It is easier than ever to take advantage of high capacity and high-performance NVMe drives. If you already have the hardware, you are a few configuration steps away from unlocking peak performance in data ingest. Stay tuned for upcoming releases of cuDF.

For more information about storage I/O acceleration, see the following resources:

Categories
Misc

NVIDIA Recommends Stockholders Reject ‘Mini-Tender’ Offer by Tutanota LLC

SANTA CLARA, Calif., May 27, 2022 — NVIDIA today announced that it recently became aware of an unsolicited “mini-tender” offer by Tutanota LLC to purchase up to 215,000 …