Jupyter Notebook Best Practices for Using RAPIDS A leading global retailer has invested heavily in becoming one of the most competitive technology companies around. Accurate and timely demand forecasting for millions of item-by-store combinations is critical to serving their millions of weekly customers. Key to their success in forecasting is RAPIDS, an open-source suite of GPU-accelerated libraries. RAPIDS helps them tear through their massive-scale data and … Continued
Jupyter Notebook Best Practices for Using RAPIDS
A leading global retailer has invested heavily in becoming one of the most competitive technology companies around.
Accurate and timely demand forecasting for millions of item-by-store combinations is critical to serving their millions of weekly customers. Key to their success in forecasting is RAPIDS, an open-source suite of GPU-accelerated libraries. RAPIDS helps them tear through their massive-scale data and has improved forecasting accuracy by several percentage points – it now runs orders of magnitude faster on a reduced infrastructure GPU footprint. This enables them to respond in real-time to shopper trends and have more of the right products on the shelves, fewer out-of-stock situations, and increased sales.
With RAPIDS, data practitioners can accelerate pipelines on NVIDIA GPUs, reducing data operations including data loading, processing, and training from days to minutes. RAPIDS abstracts the complexities of accelerated data science by building on and integrating with popular analytics ecosystems like PyData and Apache Spark, enabling users to see benefits immediately. Compared to similar CPU-based implementations, RAPIDS delivers 50x performance improvements for classical data analytics and machine learning (ML) processes at scale which drastically reduces the total cost of ownership (TCO) for large data science operations.
To learn and solve complex data science and AI challenges, leaders in retail often leverage what are called ‘Kaggle competitions’. Kaggle is a platform that brings together data scientists and other developers to solve challenging and interesting problems posted by companies. In fact, there have been over 20 competitions for solving retail challenges within the past year.
Leveraging RAPIDS and best practices for a forecasting competition, NVIDIA Kaggle Grandmaster Kazuki Onodera won 2nd place in the Instacart Market Basket Analysis Kaggle competition using complex feature engineering, gradient boosted tree models, and special modeling of the competition’s F1 evaluation metric. Along the way, we documented the best practices for ETL, feature engineering, building and customizing the best models for building an AI based Retail forecasting solution.
This blog post will walk readers through the components of a Kaggle competition to explain data science best practices for improving forecasting in retail. Specifically, the blog post explains the Instacart Market Basket Analysis Kaggle competition goals, introduces RAPIDS, then offers a workflow to show how to explore the data visually, develop features, train the model, and run a forecasting prediction. Then, the post will dive into some advanced techniques for feature engineering with model explainability and hyperparameter optimization (HPO).
The Forecasting Challenge
Instacart Market Basket Analysis competition challenged Kagglers to predict which grocery products a consumer will purchase again and when. Imagine, for example, having milk ready to be added to your cart right when you run out, or knowing that it’s time to stock up again on your favorite ice cream.
This focus on understanding temporal behavior patterns makes the problem fairly different from standard item recommendation, where user needs and preferences are often assumed to be relatively constant across short windows of time. Whereas Netflix might be fine assuming you want to watch another movie like the one you just watched, it’s less clear that you’ll want to reorder a fresh batch of almond butter or toilet paper if you bought them yesterday.
Problem Overview
The goal of this competition was to predict grocery reorders: given a user’s purchase history (a set of orders, and the products purchased within each order), which of their previously purchased products will they repurchase in their next order?
The problem is a little different from the general recommendation problem, where we often face a cold start issue of making predictions for new users and new items that we’ve never seen before. For example, a movie site may need to recommend new movies and make recommendations for new users.
The sequential and time-based nature of the problem also makes it interesting: how do we take the time since a user last purchased an item into account? Do users have specific purchase patterns and do they buy different kinds of items at different times of the day?
To get started, we’ll first load some of the modules we’ll be using in this notebook and set the random seed for any random number generator we’ll be using.
RAPIDS Overview
Data scientists typically work with two types of data: unstructured and structured. Unstructured data often comes in the form of text, images, or videos. Structured data – as the name suggests – comes in a structured form, often represented by a table or CSV. We’ll focus the majority of the tutorials on working with these types of data.
There are many tools in the Python ecosystem for structured, tabular data but few are as widely used as pandas. pandas represents data in a table and allows a data scientist to manipulate the data to perform a number of useful operations such as filtering, transforming, aggregating, merging, visualizing, and many more.
For more information on pandas, check out the excellent documentation here: http://pandas.pydata.org/pandas-docs/stable/
pandas is fantastic for working with small datasets that fit into your system’s memory. However, datasets are growing larger and data scientists are working with increasingly complex workloads – the need for accelerated compute arises.
cuDF is a package within the RAPIDS ecosystem that allows data scientists to easily migrate their existing pandas workflows from CPU to GPU, where computations can leverage the immense parallelization that GPUs provide.
Getting familiar with the data
The dataset for this competition contains several files capturing orders from Instacart users over time, with the goal of the competition to predict if a user will re-order a product and specifically, which products will those customers will re-order. From the Kaggle data description (https://www.kaggle.com/c/instacart-market-basket-analysis/data), we see that we have over three million grocery orders with a customer base of over 200,000 Instacart users. And that for each user, we are provided between 4 and 100 of their orders, with the sequence of products purchased in each order as well as the time of their orders and a relative measure of time between orders. Also provided are the week and hour of the day the order was placed, and a relative measure of time between orders.
Our products, aisles, and departments datasets are composed of metadata about our products, aisles, and departments respectively. Each dataset (products, aisles, departments, and orders, etc.) has a unique identifier mapping for each entity in that dataset e.g. order_id represents a unique order within the orders dataset, product_id represents a unique product within the products dataset, etc. We’ll use these unique identifiers later to combine all of these separate datasets into one coherent view for exploratory data analysis, feature engineering, and modeling.
Below, we will read in our data and inspect our different tables using cuDF.
Additionally, we’ll read in our orders datasets. The first indicates to which set (prior, train, test) an order belongs. Additional files specify which products were purchased in each order. Again, from the Kaggle description of the data, we see that the order_products__prior.csv contains previous order contents for all customers. And that the column ‘reordered’ indicates that the customer has a previous order that contains the product. We are informed that some orders will have no reordered items.
Exploring the data
When we think about our data science workflow, one of the most important steps is Exploratory Data Analysis. This is where we examine our data and look for clues and insights into which features we can use (or need to create) to feed our model. There are many ways to explore the data and each Exploratory Data Analysis is different for each problem – however, it still remains incredibly important as it informs our feature engineering process, ultimately determining how accurate our model will be.
In the notebook, we look at a couple different cross sections of the day. Specifically, we examine the distribution of the order counts, the days of week and times customers typically place orders, the distribution of number of days since the last order, and the most popular items across all orders and unique customers (de-duplicating so as to ignore customers who have a “favorite” item that they place repeated orders for).
From this we see respectively that:
- There are no orders less than 4 and max is capped at 100.
- The orders are high Saturday and Sunday (days 0 and 1) and low during Wednesday.
- The majority of orders are made during the daytime. And customers primarily order once a week or month (see the peaks at days 7 and 30).
Similar exploratory analysis for product popularity is provided in the notebook.
Feature Engineering
If Exploratory Data Analysis is the most important part of our data science workflow, Feature Engineering is a close second. This is where we identify which features should be fed into the model and create features where we believe they might be able to help the model do a better job of predicting.
We start by just identifying our unique User X Item combinations and sorting them. We’ll create a dataset where each user maps to their most recent order number, day of week and hour, and how many days it’s been since that order. And we’ll extend our dataset, creating labels and features to be used later in our machine learning model, such as:
- How many kinds of products have the user ordered?
- How many products have the user ordered within one cart?
- From which departments have the user-ordered products?
- When has the user ordered products (day of the week)?
- Has this user ordered this product at least once before?
- How many orders have a user placed that have included this item?
Solving for the business problem (train and predict)
The mathematical operations underlying many machine learning algorithms are often matrix multiplications. These types of operations are highly parallelizable and can be greatly accelerated using a GPU. RAPIDS makes it easy to build machine learning models in an accelerated fashion while still using a nearly identical interface to Scikit-Learn and XGBoost.
There are many ways to create a model – one can use Linear Regression models, SVMs, tree-based models like Random Forest and XGBoost, or even Neural Networks. In general, tree-based models tend to work better with tabular data for forecasting than Neural Networks. Neural Networks work by mapping the input (feature space) to another complex boundary space and determining what values should belong to those points within that boundary space (regression, classification). Tree-based models on the other hand work by taking the data, identifying a column, and then finding a split point in that column to map a value to, all the while optimizing the accuracy. We can create multiple trees using different columns, and even different columns within each tree.
For a more detailed description of tree-based models XGBoost, see this fantastic documentation: https://xgboost.readthedocs.io/en/latest/tutorials/model.html
In addition to their better accuracy performance, tree-based models are very easy to interpret (important for when predictions or decisions resulting from the predictions must be explained and justified, maybe for compliance and legal reasons e.g. finance, insurance, healthcare). Tree-based models are very robust and work well even when there is a small set of data points.
In the section below, we’ll set the different parameters for our XGBoost model and train five different models – each on a different subset of users to avoid overfitting to a particular set of users.
import xgboost as xgb
NFOLD = 5
PARAMS = {
'max_depth':8,
'eta':0.1,
'colsample_bytree':0.4,
'subsample':0.75,
'silent':1,
'nthread':40,
'eval_metric':'logloss',
'objective':'binary:logistic',
'tree_method':'gpu_hist'
}
models = []
for i in range(NFOLD):
train_ = train[train.user_id % NFOLD != i]
valid_ = train[train.user_id % NFOLD == i]
dtrain = xgb.DMatrix(train_.drop(['user_id', 'product_id', 'label'], axis=1), train_['label'])
dvalid = xgb.DMatrix(valid_.drop(['user_id', 'product_id', 'label'], axis=1), valid_['label'])
model = xgb.train(PARAMS, dtrain, 9999, [(dtrain, 'train'),(dvalid, 'valid')],
early_stopping_rounds=50, verbose_eval=5)
models.append(model)
break
There are several parameters that should be set before XGBoost can be run.
- General parameters relate to which booster we are using to do boosting, commonly tree or linear model.
- Booster parameters depend on which booster you have chosen.
- Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.
For more information on the configurable parameters within the XGBoost module, see the documentation here: https://xgboost.readthedocs.io/en/latest/parameter.html
Feature Importance
Once we’ve trained our models, we might want to look at the internal workings and understand which of the features we’ve crafted are contributing the most to the predictions. This is called Feature Importance. One of the advantages for tree-based models for forecasting is that understanding the differing importance of our features is very easy.
With understanding how our features contribute to the model accuracy, we can choose to remove features that aren’t important or try to iterate and create new features, re-train, and re-assess if those new features are more important. Ultimately, being able to iterate quickly and try new things in this workflow will lead to the most accurate model and the greatest ROI (for forecasting, oftentimes cost-savings from reduced out-of-stock and poorly placed inventory). Iteration traditionally can take a significant amount of time due to computational intensity. RAPIDS allows users to churn through model iteration with NVIDIA accelerated computing so users can iterate quickly and determine the best performing model.
In the Feature Importance section of the notebook, we define convenience code to access the importance of the features in each model. We then pass in our list of models that we trained, iterate over them one by one, and average the importance of each variable across all the models. Lastly, we visualize feature importance using a horizontal bar chart.
We see specifically that three of our features are contributing the most to our predictions:
- user_product_size – How many orders has a user placed that has included this item?
- user_product_t-1 – Has this user ordered this product at least once before?
- order_number – The number of orders that user has created.
All of this makes sense and aligns with our understanding of the problem. Customers who have placed an order for an item before are more likely to repeat an order for that product, and users who place multiple orders of that product are even more likely to re-order. Additionally, the number of orders a customer has created correlates with their likelihood of re-ordering.
The code uses the default XGBoost implementation of feature importance – but we are free to choose any implementation or technique. A wonderful technique (also developed by an NVIDIA Kaggle Grandmaster, Ahmet Erdem) is called LOFO.
From the description of the LOFO GitHub page, we see that LOFO (Leave One Feature Out) Importance calculates the importance of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric. And that LOFO first evaluates the performance of the model with all the input features included, then iteratively removes one feature at a time, retrains the model, and evaluates its performance on a validation set.
This methodology allows us to effectively determine which features are important for the model. LOFO has several advantages compared to other importance types:
- It does not favor granular features.
- It generalizes well to unseen test sets.
- It is model agnostic.
- It gives negative importance to features that hurt performance upon inclusion.
For more information on LOFO, see here: https://github.com/aerdem4/lofo-importance
Hyperparamater Optimization (HPO)
When we trained our XGBoost models, we used the following parameters:
PARAMS = { 'max_depth':8, 'eta':0.1, 'colsample_bytree':0.4, 'subsample':0.75, 'silent':1, 'nthread':40, 'eval_metric':'logloss', 'objective':'binary:logistic', 'tree_method':'gpu_hist' }
Of these, only a few may be changed and effect the accuracy of our model: max_depth, eta, colsample_bytree, and subsample. However, these may not be the most optimal parameters. The art and science of identifying and training models with the model optimal hyperparamers is called hyperparameter optimization.
While there is no magic button one can press to automatically identify the most optimal hyperparameters, there are techniques that allow you to explore the range of all possible hyperparameter values, quickly test them, and find values that are closest.
A full exploration of these techniques is beyond the scope of this notebook. However, RAPIDS is integrated into many Cloud ML Frameworks for doing HPO as well as with many of the different open source tools. And being able to use the incredible speedups from RAPIDS allows you to go through your ETL, feature engineering, and model training workflow very quickly for each possible experiment – ultimately resulting in fast HPO explorations through large hyperparameter spaces and a significant reduction in Total Cost of Ownership (TCO).
Conclusion
In this blog, we walked through the components of a Kaggle competition to explained data science best practices for improving forecasting in retail. Specifically, the blog post explained the Instacart Market Basket Analysis Kaggle competition goals, introduced RAPIDS, then offered a workflow to show how to explore the data visually, develop features, train the model, and run a forecasting prediction. Then, the post reviewed techniques for feature engineering with model explainability and hyperparameter optimization (HPO).
To learn more, be sure to:
- See this Jupyter notebook on forecasting
where we show best practices for GPU accelerated forecasting within the context of the Instacart Market Basket Analysis Kaggle competition in which NVIDIA Kaggle Grandmaster Kazuki Onodera won 2nd place, using complex feature engineering, gradient boosted tree models and special modeling of the competition’s F1 evaluation metric.
- Join Paul Hendricks at NVIDIA GTC 2021, on Best Practices for ETL, Feature Engineering, and Model Development for Retail Forecasting Using NVIDIA RAPIDS Data Science Libraries.
- Read Kazuki Onodera’s detailed interview at Medium.com.
- And go to the Rapids.ai open-source website.