Enterprises challenged with running accelerated workloads have an answer: NVIDIA-Certified Systems. Available from nearly 20 global computer makers, these servers have been validated for running a diverse range of accelerated workloads with optimum performance, reliability and scale. Now NVIDIA-Certified Systems are expanding to the desktop with workstations that undergo the same testing to validate their Read article >
This post discussing pros and cons of distinct memory layouts as well as memory pools for asynchronous memory allocation to enable zero-copy functionality.
Introduction
Efficient pipeline design is crucial for data scientists. When composing complex end-to-end workflows, you may choose from a wide variety of building blocks, each of them specialized for a dedicated task. Unfortunately, repeatedly converting between data formats is an error-prone and performance-degrading endeavor. Let’s change that!
Figure 1: Interoperability between data science and machine learning frameworks.
In this post series, we discuss different aspects of efficient framework interoperability:
We start with this post discussing pros and cons of distinct memory layouts as well as memory pools for asynchronous memory allocation to enable zero-copy functionality.
In the second post, we highlight bottlenecks occurring during data loading/transfers and how to mitigate them using Remote Direct Memory Access (RDMA) technology.
In the third post, we dive into the implementation of an end-to-end pipeline demonstrating the discussed techniques for optimal data transfer across data science frameworks.
Zero-copy functionality is a crucial technique to efficiently copy data across GPU-accelerated data science frameworks: TensorFlow, PyTorch, MXNet, cuDF, CuPy, Numba, and JAX (see Figure 2). In the following, we will show you how to achieve that in a systematic manner. If you are only here to look up the commands on how to transfer data from one framework to another, you might want to have a look at this conversion table.
Figure 2 Conversion paths between data science and machine learning frameworks.
Memory layouts, data formats and memory pools
Memory Layouts
Before we start talking about how to copy data efficiently, let’s discuss how to store tabular data. In practice, all data formats inherit from one of the two major memory layouts known to computer scientists (see Figure 3):
Array of Structures (AoS): A sequence of one or more data points x, y, z, … of potentially distinct type is represented as a structureS. Several instances of those data points are allocated as an array s of the new data type S. The original list of points x, y, z, … of the k-th instance is then accessed through the members s[k].x, s[k].y, s[k].z, … of the struct instance s[k].
Structure of Arrays (SoA): Several instances of data points x, y, z, … are stored in separated arrays s_x, s_y, s_z, … The original points x, y, z, … of the k-th instance are then accessed by s_x[k], s_y[k], s_z[k], … Finally, these arrays can be interpreted as a single instance of a (merely virtually existing) structure, hence the name SoA.
Figure 3: Comparison of AoS (left) and SoA (right) memory layouts. White arrows denote the read order in linear memory. Note that AoS and SoA are isomorphic through transposition.
While the AoS layout looks more structured (pun intended) than SoA from a programmatic and abstraction point of view, it tends to be less suited for massively parallel algorithms in terms of achievable performance. This can be explained by a less efficient utilization of cache lines when consistently accessing a subset of the structure members, for example, during the reduction of values along one coordinate axis. You can even find cases in the literature where on-the-fly AoS-to-SoA conversion can significantly improve performance compared to plain processing in an AOS memory layout.
The SoA memory layout exhibits further advantages when copying coordinate-slices of the data. Assume you want to transfer all the x-coordinates at once, then you can access the corresponding array without the time-consuming slicing of members in the AoS layout. Even better, one can avoid allocating auxiliary memory when transferring data by simply exposing the address of the array in memory without copying a single byte. Apache Arrow is built on top of this methodology: storing data of distinct data types in different arrays for the discussed reasons (see Figure 4). Note that mainstream data science frameworks treat the entries of an array in the SoA layout as if they were stored in columns instead of rows, as depicted in Figure 3. However, this is merely a convention as we all know that virtually all memory is linearly ordered.
Figure 4: Comparison of row-wise (AoS, left) and column-wise (SoA, right) memory layouts for the same table shown on top. SoA is ideally suited for massively parallel processing on GPUs.
Data formats and zero-copy mechanism
In recent years, different libraries have been developed to address different needs. At the same time, data science pipelines have become more and more complex, requiring the usage of multiple libraries to accomplish a wide variety of tasks. Unfortunately, when these libraries were designed, interoperability between frameworks was not conceived as a top priority. As a result, there was a lack of standardized data formats suitable for data science tasks. Some people were concerned about data standards at that moment, like Wes McKinney, creator of the pandas project. In 2011, he published this post about a future roadmap for rich scientific data structures in Python.
Since every library implemented its custom in-memory data layout and file formats, expensive copy-and-convert operations had to be performed when these libraries needed to collaborate. It was quite common that a significant portion of the total execution time was invested in meaningless copy-and-convert operations.
In October 2016, Apache Foundation released Arrow, a language-independent columnar-wise data format specification meant to deal efficiently with flat and hierarchical data on both CPUs and GPUs. Since then, many different frameworks have adopted it, facilitating zero-copy data exchange among them. Other key features of Apache Arrow columnar data format include:
O(1) (constant-time) random access
SIMD and vectorization-friendly
Data adjacency for sequential access (scans)
Relocatable without “pointer swizzling”, allowing for true zero-copy access in shared memory
Figure 5: Comparison of traditional framework interoperability to a zero-copy approach using Apache Arrow where all frameworks agree on the same memory layout.
Zero-copy mechanisms avoid unnecessary data transfers, reducing your application execution time drastically. Data science frameworks have added support to one or more of the following data formats: DLPack, CUDA Array Interface, and NumPy Array Interface.
DLPack is an open in-memory tensor structure for sharing tensors among frameworks. CUDA Array Interface and Numpy Array Interface are the de facto standards to exchange GPU and CPU array-like objects.
Table 1: Data Formats Support Matrix.
Please, note that some libraries like cuDF and CuPy exclusively run on GPU devices. Although it is possible to convert a NumPy array into a cuDF or CuPy object, we have marked its support as n/a because it requests data movement between host memory (CPU) and device memory (GPU).
In the following, we address memory layouts of the associated data objects in various frameworks, the efficient conversion of data objects using zero-copy, as well as the usage of a joint memory pool when mixing frameworks.
Memory pools
Memory allocations are expensive. They often impose global barriers, which block the remaining operations until the allocation is accomplished. Hence, repeatedly allocating memory in tight for-loops such as during the training of neural networks is prohibitive from a performance point of view. Modern data science and deep learning frameworks address this with dedicated memory pools. It is either preallocating a huge chunk of memory at the beginning of the program (for example, TensorFlow) or incrementally growing the pool using a few infrequent allocations (for example, PyTorch). Then, the preallocated memory is reused in a smart way by asynchronously assigning and retracting subsets of this memory range to/from whomever requests it. As an example, the RAPIDS Memory Manager (RMM) is a memory pool originally written for the RAPIDS data science framework. RMM facilitates blazingly fast host and device memory allocations. Mark Harris quantified the impact of RMM in this post: “We centralized memory management in cuDF by replacing all calls to cudaMalloc and cudaFree with RMM allocations. This was a lot of work, but it paid off. RMM calls are on the order of 1,000 times faster than cudaMalloc and cudaFree. The result was a 10x speedup for the mortgage demo.”
Several library-specific memory pools might compete for the same video RAM when combining distinct data science libraries. A straightforward workaround would be to limit the capacity of each memory pool to a fixed partition of the available memory. A better solution would be to use the same memory pool for all frameworks. Note that this does not necessarily mean that all frameworks must agree on the same memory pool implementation being shipped in their vanilla release. It is sufficient that all vendors agree to use an External Allocator Interface (EAI) for requesting and freeing memory in their frameworks.
Further advantages of an EAI are straightforward logging functionality, memory leak checking, as well as rate or resource limiting capabilities. For instance, the RAPIDS Memory Manager leverages unified memory to transparently oversubscribe GPU memory. The former translates into significantly reducing the chances of facing an Out of Memory error when working with huge datasets that do not fit in GPU memory.
The good news is that you can use RMM with CuPy and Numba by simply importing RAPIDS cuDF before importing everything else.
import cudf #
Alternatively, you can combine Numba and RMM without using RAPIDS cuDF.
import rmm
from numba import cuda
cuda.set_memory_manager(rmm.RMMNumbaManager)
Conclusion
In this post of our framework interoperability series, you have learned about distinct memory layouts and how the Apache Arrow format can significantly speed up data transfers across distinct data science and machine learning frameworks such as TensorFlow, PyTorch, MXNet, cuDF, CuPy, Numba, and JAX. We have also discussed how asynchronous memory allocation facilitated by memory pools is crucial to avoid overheads as big as 90% of the overall runtime of your pipeline.
In the second part of the series, you will learn how Remote Direct Memory Access (RDMA) can be exploited to further accelerate data loading and data transfers in a multiple GPU setting.
Autonomous trucking startup Embark is planning for universal autonomy of commercial semi-trucks, developing one AI platform that fits all. The company announced today that it will use NVIDIA DRIVE to develop its Embark Universal Interface (EUI), a manufacturer-agnostic platform that includes the compute and multimodal sensors necessary for autonomous trucks. This flexible approach, combined with Read article >
Computer graphics and AI are cornerstones of NVIDIA. Combined, they’re bringing creators closer to the goal of cinema-quality 3D imagery rendered in real time. At a series of graphics conferences this summer, NVIDIA Research is sharing groundbreaking work in real-time path tracing and content creation, much of it based on cutting-edge AI techniques. These projects Read article >
Hora is an approximate nearest neighbor search algorithm (wiki) library. We implement all code in Rust🦀 for reliability, high level abstraction and high speeds comparable to C++.
Hora, 「ほら」in Japanese, sounds like [hōlə], and means Wow, You see!or Look at that!. The name is inspired by a famous Japanese song 「小さな恋のうた」 .
import numpy as np from horapy import HNSWIndex dimension = 50 n = 1000 # init index instance index = HNSWIndex(dimension, "usize") samples = np.float32(np.random.rand(n, dimension)) for i in range(0, len(samples)): # add node index.add(np.float32(samples[i]), i) index.build("euclidean") # build index target = np.random.randint(0, n) # 410 in Hora ANNIndex <HNSWIndexUsize> (dimension: 50, dtype: usize, max_item: 1000000, n_neigh: 32, n_neigh0: 64, ef_build: 20, ef_search: 500, has_deletion: False) # has neighbors: [410, 736, 65, 36, 631, 83, 111, 254, 990, 161] print("{} in {} nhas neighbors: {}".format( target, index, index.search(samples[target], 10))) # search
we are pretty glad to have you participate, any contributions are welcome, including the documentation and tests. We use GitHub issues for tracking suggestions and bugs, you can do the Pull Requests, Issue on the github, and we will review it as soon as possible.
Hi, i wanted to build a deep learning model to color white blood cells and platelets, without actually staining them. For instance, i input a blood smear image without any colors added, and the model should output the WBCs and platelets colored as if it was stained by hand. What would be the best way to build such a model?
I have followed everything from https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/install.html#protobuf-installation-compilation.
Until I reached the protobuf installation part.
Even if I add protoc to path, while cmd.exe recognizes it, Anaconda prompt with my environment does not.
If I try to run it from cmd.exe, I cannot use _conda activate tf_2.5 then. It tells me my shell is not properly configured. Please help!
How can i get conda prompt to recognize protoc command to install object_detection?
I’m trying to create and save a basic TFLite model backed by a HashTable. Following the page, I was able to create the table, but I don’t really know enough about Tensorflow in order to turn it into a model. My goal is to create and save something I can later load with the tf.lite.Interpreter, does anyone know how I can accomplish this?
It would be really great to create a model that can take in two values, look each up in its own table, and return both results. Or even for the sake of example, fetching the values from the hashtable and then returning the average or something like that.
Edit: I was able to get this to work, but the Hashtable op isn’t available as a builtin op, despite doc here claiming that it is. Is the doc correct? Is there any plan to add HashTable to the builtin ops?
I have a Convolutional network which is pretty good! But I’ve found it quite strange that the accuracy and precision show this very smooth line, while the graph for the loss is jittering and getting more overfitted. I just found this behavior rather peculiar, and was wondering if anyone here could offer any insight as to what is happening here.