Machine learning (ML) is increasingly used across industries. Fraud detection, demand sensing, and credit underwriting are a few examples of specific use…
Machine learning (ML) is increasingly used across industries. Fraud detection, demand sensing, and credit underwriting are a few examples of specific use cases.
These machine learning models make decisions that affect everyday lives. Therefore, it’s imperative that model predictions are fair, unbiased, and nondiscriminatory. Accurate predictions become vital in high-risk applications where transparency and trust are crucial.
One way to ensure fairness in AI is to analyze the predictions obtained from a machine learning model. This exposes disparities and provides the opportunity to take corrective actions to diagnose and rectify the underlying cause.
Explainable AI (XAI) is a field of Responsible AI dedicated to studying techniques that explain how a machine learning model makes predictions. These explanations are human-understandable, enabling all stakeholders to make sense of the model’s output and make the necessary decisions. SHAP is one such technique used widely in industry to evaluate and explain a model’s prediction.
This post explains how you can train an XGBoost model, implement the SHAP technique in Python using a CPU and GPU, and finally compare results between the two. By the end of the post, you should be able to answer the following questions:
- Why is it crucial to explain machine learning models, especially in high-stakes decisions?
- How do we differentiate between Interpretable and Explainable techniques?
- What is the SHAP technique, and how is it used to explain a model’s predictions?
- What is the advantage of GPU-accelerated SHAP?
Explainability versus interpretability
In the context of artificial intelligence and machine learning, it is helpful to distinguish explainability from interpretability. The terms have distinct meanings but are often used interchangeably.
In the seminal paper, Psychological Foundations of Explainability and Interpretability in Artificial Intelligence, the researchers at the US National Institute of Standards and Technology (NIST) have proposed the following definitions of explainability and interpretability:
Explanability
Explainability is a low-level, detailed mental representation that seeks to describe some complex processes. An explanation describes how some model mechanism or output came to be.
Interpretability
Interpretability is a high-level, meaningful mental representation that contextualizes a stimulus and leverages human background knowledge. An interpretable model should provide users with a description of what a data point or model output means in context.
In addition, according to Explanation in Artificial Intelligence: Insights from the Social Sciences, interpretability refers to the degree to which humans can understand and trust an ML model’s predictions.
Overview of explanation methods
The approaches to explaining model predictions can be broadly divided into model-specific and post-hoc techniques.
Model-specific
Algorithms like generalized linear models, decision trees, and generalized additive models are designed to be interpretable. These are called glassbox models because it is possible to trace and reason how a prediction was made. The techniques used to explain such models are model-specific because each method is based on some specific model’s internals. For instance, the interpretation of weights in linear models counts toward model-specific explanations.
Post-hoc
Post-hoc explainability techniques, as the name suggests, are applied after a model has been trained. Some well-known post-hoc techniques include SHAP, LIME, and Partial Dependence Plots. These are model agnostic. They work by treating the model as a BlackBox and assume they only have access to the model’s inputs and outputs. This makes them beneficial for complex algorithms, like boosted trees and deep neural nets, which are not explainable through model-specific techniques.
This post focuses on SHAP, a post-hoc technique for explaining model predictions.
Using the SHAP technique to explain models
SHAP is an acronym for SHapley Additive Explanations. It is one of the most commonly used post-hoc explainability techniques. SHAP leverages the concept of cooperative game theory to break down a prediction to measure the impact of each feature on the prediction.
Shapley values are defined as the average marginal contribution of a feature value across all possible feature coalitions. A technique with origins in economics and game theory, Shapley values assign fair payouts to players in a coalition depending upon their contribution to the total gain. Translating this into a machine learning scenario means assigning importance to features in a model depending on their contribution to the model’s prediction.
SHAP unifies several approaches to generate accurate local feature importance values using Shapley values which can then be aggregated to obtain global explanations. SHAP values interpret the impact on the model’s prediction of a given feature having a specific value, compared to the prediction we’d make if that feature took some baseline value. A baseline value is a value that the model would predict if it had no information about any feature values.
SHAP is one of the most widely used post-hoc explainability technique for calculating feature attributions. It is model agnostic, can be used both as a local and global feature attribution technique and has credible theoretical support from economics. Additionally, a variant of SHAP for tree based models reduces the computation time considerably, thereby helping users to gain insights from models quickly.
The following section provides an example of how to use the SHAP technique.
Step 1: Training an XGBoost model and calculating SHAP values
Use the well-known Adult Income Dataset to perform the following :
- Train an XGBoost model on the given dataset to predict whether a person earns more than $50K a year. Such data could be helpful in various use cases like target marketing.
- Compute the SHAP values to explain the individual feature contributions.
- Visualize and interpret the SHAP values.
Installation
SHAP can be installed using its stand-alone Python package called shap available on GitHub:
pip install shap
or
conda install -c conda-forge shap
SHAP is also inherently supported by popular algorithms like LightGBM, and XGBoost, and several R packages.
Setting up the environment
Start by setting up the environment and importing the necessary libraries:
import numpy as np
import pandas as pd
# Visualization Libraries
import matplotlib.pyplot as plt
%matplotlib inline
## Machine learning packages
from sklearn.model_selection import train_test_split
import xgboost as xgb
## Model Interpretation package
import shap
shap.initjs()
# Ensuring Reproducibility
SEED = 12345
# Ignoring the warnings
import warnings
warnings.filterwarnings(action = "ignore")
Dataset
This dataset comes from the UCI Machine Learning Repository Irvine and is available on Kaggle. It contains information about the demographic information of people based on census data. The dataset has attributes such as education, and hours of work per week, age, and so on.
The shap library ships with some commonly used datasets, including the preprocessed version of the Adult Income Dataset used below.
X,y = shap.datasets.adult()
X_view,y_view = shap.datasets.adult(display=True)
X_view.head()