Explain Your Machine Learning Model Predictions with GPU-Accelerated SHAP

Machine learning (ML) is increasingly used across industries. Fraud detection, demand sensing, and credit underwriting are a few examples of specific use…

Machine learning (ML) is increasingly used across industries. Fraud detection, demand sensing, and credit underwriting are a few examples of specific use cases.

These machine learning models make decisions that affect everyday lives. Therefore, it’s imperative that model predictions are fair, unbiased, and nondiscriminatory. Accurate predictions become vital in high-risk applications where transparency and trust are crucial.

One way to ensure fairness in AI is to analyze the predictions obtained from a machine learning model. This exposes disparities and provides the opportunity to take corrective actions to diagnose and rectify the underlying cause.

Explainable AI (XAI) is a field of Responsible AI dedicated to studying techniques that explain how a machine learning model makes predictions. These explanations are human-understandable, enabling all stakeholders to make sense of the model’s output and make the necessary decisions. SHAP is one such technique used widely in industry to evaluate and explain a model’s prediction.

This post explains how you can train an XGBoost model, implement the SHAP technique in Python using a CPU and GPU, and finally compare results between the two. By the end of the post, you should be able to answer the following questions:

Why is it crucial to explain machine learning models, especially in high-stakes decisions?
How do we differentiate between Interpretable and Explainable techniques?
What is the SHAP technique, and how is it used to explain a model’s predictions?
What is the advantage of GPU-accelerated SHAP?

Explainability versus interpretability

In the context of artificial intelligence and machine learning, it is helpful to distinguish explainability from interpretability. The terms have distinct meanings but are often used interchangeably.

In the seminal paper, Psychological Foundations of Explainability and Interpretability in Artificial Intelligence, the researchers at the US National Institute of Standards and Technology (NIST) have proposed the following definitions of explainability and interpretability:

Explanability

Explainability is a low-level, detailed mental representation that seeks to describe some complex processes. An explanation describes how some model mechanism or output came to be.

Interpretability

Interpretability is a high-level, meaningful mental representation that contextualizes a stimulus and leverages human background knowledge. An interpretable model should provide users with a description of what a data point or model output means in context.

In addition, according to Explanation in Artificial Intelligence: Insights from the Social Sciences, interpretability refers to the degree to which humans can understand and trust an ML model’s predictions.

Overview of explanation methods

The approaches to explaining model predictions can be broadly divided into model-specific and post-hoc techniques.

Model-specific

Algorithms like generalized linear models, decision trees, and generalized additive models are designed to be interpretable. These are called glassbox models because it is possible to trace and reason how a prediction was made. The techniques used to explain such models are model-specific because each method is based on some specific model’s internals. For instance, the interpretation of weights in linear models counts toward model-specific explanations.

Post-hoc

Post-hoc explainability techniques, as the name suggests, are applied after a model has been trained. Some well-known post-hoc techniques include SHAP, LIME, and Partial Dependence Plots. These are model agnostic. They work by treating the model as a BlackBox and assume they only have access to the model’s inputs and outputs. This makes them beneficial for complex algorithms, like boosted trees and deep neural nets, which are not explainable through model-specific techniques.

This post focuses on SHAP, a post-hoc technique for explaining model predictions.

Using the SHAP technique to explain models

SHAP is an acronym for SHapley Additive Explanations. It is one of the most commonly used post-hoc explainability techniques. SHAP leverages the concept of cooperative game theory to break down a prediction to measure the impact of each feature on the prediction.

Shapley values are defined as the average marginal contribution of a feature value across all possible feature coalitions. A technique with origins in economics and game theory, Shapley values assign fair payouts to players in a coalition depending upon their contribution to the total gain. Translating this into a machine learning scenario means assigning importance to features in a model depending on their contribution to the model’s prediction.

SHAP unifies several approaches to generate accurate local feature importance values using Shapley values which can then be aggregated to obtain global explanations. SHAP values interpret the impact on the model’s prediction of a given feature having a specific value, compared to the prediction we’d make if that feature took some baseline value. A baseline value is a value that the model would predict if it had no information about any feature values.

SHAP is one of the most widely used post-hoc explainability technique for calculating feature attributions. It is model agnostic, can be used both as a local and global feature attribution technique and has credible theoretical support from economics. Additionally, a variant of SHAP for tree based models reduces the computation time considerably, thereby helping users to gain insights from models quickly.

The following section provides an example of how to use the SHAP technique.

Step 1: Training an XGBoost model and calculating SHAP values

Use the well-known Adult Income Dataset to perform the following :

Train an XGBoost model on the given dataset to predict whether a person earns more than $50K a year. Such data could be helpful in various use cases like target marketing.
Compute the SHAP values to explain the individual feature contributions.
Visualize and interpret the SHAP values.

Installation

SHAP can be installed using its stand-alone Python package called shap available on GitHub:

pip install shap

or
conda install -c conda-forge shap

SHAP is also inherently supported by popular algorithms like LightGBM, and XGBoost, and several R packages.

Setting up the environment

Start by setting up the environment and importing the necessary libraries:

import numpy as np   
import pandas as pd  

# Visualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline

## Machine learning packages
from sklearn.model_selection import train_test_split
import xgboost as xgb

## Model Interpretation package
import shap
shap.initjs()

# Ensuring Reproducibility
SEED = 12345

# Ignoring the warnings
import warnings  
warnings.filterwarnings(action = "ignore")

Dataset

This dataset comes from the UCI Machine Learning Repository Irvine and is available on Kaggle. It contains information about the demographic information of people based on census data. The dataset has attributes such as education, and hours of work per week, age, and so on.

The shap library ships with some commonly used datasets, including the preprocessed version of the Adult Income Dataset used below.

Code block displaying the datasets available in the shap library — *Figure 1. Accessing datasets from the shap library*

X,y = shap.datasets.adult()
X_view,y_view = shap.datasets.adult(display=True)
X_view.head()

Code block generating a dataframe consisting of various predictor variables like Age, Work class, Education, etc., and one target variable. The target variable is 'True' if a person earns >$50K annually and 'False' if the earned income is <figcaption><em>Table 1. DataFrame showing the first five rows of the dataset</em></figcaption></figure></div>

<p>As shown above, the dataset consists of various predictor variables like Age, Work class, and Education, plus one target variable. The target variable is <code>True</code> if a person earns >$50K annually and <code>False</code> if the earned income is ≤$50K. After ensuring the dataset is preprocessed and clean, proceed with the model training.</p>

<h3>Training an XGBoost model </h3>

<p>XGBoost performs exceptionally well for tabular datasets and is very popular in the machine learning community. To begin, split the dataset into train and validation sets.</p>

<pre class= — *Figure 2. SHAP force plot*

A Force plot explaining the predictions for the first person in the dataset. The base_value here is -1.143, while the target value for the selected sample is -3.89. Features such as Age and Education Number are depicted in red and they push the prediction towards the base value. Features like Capital Gain, Relationship and marital status are shown in blue and they push the prediction away from the base value. — *Figure 2. SHAP force plot*

Calculating SHAP values

SHAP force plot

SHAP summary plot

Step 2: GPU-accelerated SHAP

Installation

Training an XGBoost model with GPU acceleration

Calculating SHAP values using GPU

Summary

Explainability versus interpretability

Explanability

Interpretability

Overview of explanation methods

Model-specific

Post-hoc

Using the SHAP technique to explain models

Step 1: Training an XGBoost model and calculating SHAP values

Installation

Setting up the environment

Dataset

Calculating SHAP values

SHAP force plot

SHAP summary plot

Step 2: GPU-accelerated SHAP

Installation

Training an XGBoost model with GPU acceleration

Calculating SHAP values using GPU

Summary

Leave a Reply Cancel reply