This post highlights helpful new cuDF features that allow you to think about a single row of data and write code faster.
Over the past few releases, the NVIDIA cuDF team has added several new features to user-defined functions (UDFs) that can streamline the development process while improving overall performance. In this post, I walk through the new UDF enhancements and show how you can take advantage of them within your own applications:
- The cuDF
Series.applyAPI and how to use it
- The cuDF
DataFrame.applyAPI and how to write a UDF in terms of “rows”
- Enhanced support for missing data using both apply APIs
- A real-world use case example with timing
- Practical considerations, limitations, and future plans
apply API for cuDF series
If you’re not familiar with pandas, series apply is the main entry point used for mapping an arbitrary Python function onto a single series of data. For example, you might want to convert temperature in Celsius to Fahrenheit using a formula already written as a Python function.
Here is a quick refresher followed by the output of running this code:
Technically, you can write any valid Python code within the function
f and pandas runs the function in a loop over the series. This makes
apply extremely flexible in the context of pandas, as any UDF can be successfully applied as long as it can successfully handle all of the input data—even UDFs that rely on external libraries or ones that expect or return arbitrary Python objects.
But, this flexibility comes at the cost of performance. Running a Python function in a long loop is not known for being an efficient strategy for a variety of reasons (for example, Python being interpreted from the outset). As a result, this performance constraint can be frustrating if your UDFs are simpler, such as those composed of purely mathematical operations on scalar values.
Luckily, these use cases are what cuDF was built for. Recent cuDF improvements within the scope of UDFs have motivated the introduction of the equivalent
If you are familiar with pandas, you can produce the same results as you did using pandas for numeric values. The only notable difference is that the resulting data is always a cuDF
dtype and not an
object, which is usually the case in pandas.
f can contain any Python UDF that is composed of pure math or Python operations. cuDF deduces an appropriate return
dtype based on the inspection of the function through Numba and compiles and runs an equivalent function on the GPU.
Functions can also be written to accept an arbitrary number of scalar arguments. In the following code example, you can see that
args= is supported:
While there are other ways of accomplishing the same goal in cuDF using custom kernels and other methods, this method of writing UDFs helps to abstract the GPU away from the process, which can cut down on development time for data scientists working on fast-paced, real-world projects.
So far, I’ve covered only the case of
Series-based data. That is, I’ve shown you how to write a UDF with a single input and output. Many use cases require multi-column input, however, and this requires slightly different thinking.
DataFrame UDFs and thinking in terms of rows
UDFs that expect multiple columns as input and produce a single column as output are the set of functions supported by the pandas DataFrame apply API.
In these cases, the first function argument represents a row of data rather than just one value from a single input column. By row, I mean some kind of data structure that is keyable to obtain values, where the keys are the column names and the values are the scalars corresponding to the values of those columns in that row. It is conceptually what you get when you use
iloc in pandas:
The following code example shows how you would write and use a UDF in pandas that consumes this kind of row object:
cuDF now enables you to do the exact thing without rewriting your UDF.
When applying these functions, it is important to note that even though the cuDF API expects you to write the functions in terms of rows, no rows are actually involved when it comes to the execution of this function.
cuDF avoids the use of a for-loop and instead executes CUDA kernels that “pretend” rows of data exist. With a little magic, Numba knows how to write a proper kernel to get the same result as pandas. Because there is no loop, you should see higher performance when executing functions through this API.
Support for missing values using the series and DataFrame apply
Historically, UDFs in cuDF have not provided full support for missing values. This is due to architectural choices inside cuDF that relate to the way cuDF records which elements are null, specifically its use of a null mask to conserve memory.
The looping design of pandas
apply APIs just works if the data contains null values. If a null is encountered in the data, the UDF receives the special value
pd.NA. As a result, if the special value does not trigger an error, the execution proceeds as normal. However, cuDF does not work this way, and it requires a little extra machinery to support the same functionality. If you use the cuDF apply API, you should find that your UDFs treat null values in a natural manner:
You can even condition on the
cudf.NA singleton and get the expected answer, or return it directly from the function:
Evidently, the same thing is true here as is the case with rows: cuDF does not actually run the Python function as pandas does. Instead, it uses more Numba magic to translate this class of functions into an equivalent CUDA kernel and then returns the result of that instead.
In the next section, I look at a real-world example and perform some rough timing.
Real-world example using apply
Consider this scenario: An online streaming service is investigating which segments of its subscribers tend to hold their subscriptions the longest. Additionally, leadership has requested a specific segmentation scheme that breaks subscribers up by age:
The provided data only has two fields:
Here’s how a UDF can solve the problem. First, write the row-wise custom function that applies the grouping. Next, take the results, group by the group ID, and average over the number of renewals.
In this code example, the data is randomly generated so your mileage may vary on the actual answer. However, it demonstrates the process. Timing the UDF section of the code involves creating a variable
pdf = df.to_pandas , and accomplishing a rough comparison using IPython:
%timeit df.apply(f, axis=1) # 1.64 ms ± 34.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) %timeit pdf.apply(f, axis=1) # 19.2 s ± 63.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Although this is not an official benchmark, the CUDA kernel is over four orders of magnitude faster on average in this particular case, which was run on a 32 GB V100 GPU.
Practical considerations, limitations, and the future
While these cuDF improvements represent significantly broader capabilities than previous iterations, there is always room to grow. Here is a list of key items to consider when writing UDFs for apply in cuDF:
- JIT compilation. The first time a function is executed against a cuDF object, you encounter overhead effects of compiling the correct CUDA kernel. Subsequent uses of the function do not require recompilation, unless the
dtypesof the target dataset change.
- dtype support. So far, only numeric
dtypesare supported in
apply. However, support for additional types is on the roadmap, starting with strings.
- External libraries. A common pattern is performing data prep in pandas and then using an external library for processing inside the UDF for each row. Because you cannot map external code onto the GPU arbitrarily, this is not currently supported.
UDFs are an easy way of solving particular problems quickly. They help you think in terms of a single datum when designing the logic of your pipeline. With these new cuDF UDF enhancements, the aim is to expedite the development of workflows involving cuDF and allow you to quickly prototype solutions, as well as reuse existing business logic. In addition, null support lets you be explicit about how to handle missing values without needing extra processing steps.
As a reminder, UDFs are an area of active development in cuDF and updates are ongoing. If you choose to try these new UDF enhancements out, as always, I’d love to hear about your experience in the comments section.