Giter Club home page Giter Club logo

fastshap's Introduction

fastshap

CRAN status R-CMD-check Codecov test coverage Lifecycle: experimental R-CMD-check

The goal of fastshap is to provide an efficient and speedy approach (at least relative to other implementations) for computing approximate Shapley values, which help explain the predictions from any machine learning model.

Installation

# Install the latest stable version from CRAN:
install.packages("fastshap")

# Install the latest development version from GitHub:
if (!requireNamespace("remotes")) {
  install.packages("remotes")
}
remotes::install_github("bgreenwell/fastshap")

fastshap's People

Contributors

bgreenwell avatar brandongreenwell-8451 avatar eddelbuettel avatar mayer79 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fastshap's Issues

Error in UseMethod("explain") : no applicable method for 'explain' applied to an object of class "xgb.Booster"

Hello,

After fitting a xgboost model I do:

shap_values <- explain(model_n, X = trainval)

I receive this error:

[[Error in UseMethod("explain") : 
  no applicable method for 'explain' applied to an object of class ["xgb.Booster"]]

If I use this line:

shap_values <- fastshap::explain(model_n, X = trainval)

I receive this error:

Error in explain.default(object, feature_names = feature_names, X = X,  : 
  argument "pred_wrapper" is missing, with no default

trainval is a dataframe with traindata. It does not contain the Y values.
I have also tried to feed it as matrix although this gives the same error.

Any idea what could be the cause?

Thanks a lot!

Question, not issue SHAP Values or shapley values?

Maybe I am just very confused: Does this package calculate shapley values or SHAP values?
In the book Interpretable Machine Learning, a distinction is made between shapley values and SHAP values.
My interpretation of the documentation and this issue (Shap values don't add up to prediction values ) is that SHAP values are calculated if exact = TRUE (I am using xgboost). Thanks for a great package.

Calculation of bias with weighting

First of all thanks for the really nice library 👍

(I can add code demonstrating the problems I found here, but I wanted to discuss the possible solutions first)

For specific purposes I wanted to use fastshap to calculate approximate shap values of an xgboost object instead of using either fastshap with exact=TRUE for this or just using the internal xgboost with predcontrib=T.

The reason one might want to do this, is if you use a specific loss function in your model, the exact shaps might not sum up to target, but you have to use a link function to transform them to a target. Not helpful if you need the additive property.

There's one caveat here. If you use a weight variable as an input to the xgboost::xgb.DMatrix the Bias term calculated from the method won't be just an average of training predictions, but it will be a weighted average. I noticed that the adjust argument calculates fnull <- mean(pred_wrapper(object, newdata = X)). here

I solved this by forking your explain and explain.default while adding a new argument to them fnull that defaults to NULL. If this is null it will calculate it as mean of training predictions, or you can just supply a number(calculate the weighted mean of training predictions outside).

What would you say would be the best way to recreate shap values from xgboost? Same as the approach that I've mentioned here?

I'm very open to creating a pull request with such a fix.

`explain` fails when variables are removed as part of `recipes::recipe` preprocessing

Reprex:

library(tidyverse)
library(tidymodels)
library(fastshap)

# Sample data: `mtcars` with 50% of the entries in `qsec` replaced with `NA`s
set.seed(2022)
data <- mtcars %>%
    select(-c(vs, am)) %>%
    mutate(qsec = replace(
        qsec, sample.int(nrow(mtcars), size = nrow(mtcars) / 2), NA_real_))

recipe <- recipe(mpg ~ ., data = data) %>%
    # Remove variables with more than 30% of missing data; this will be `qsec`
    step_filter_missing(all_numeric_predictors(), threshold = 0.3)

# We can confirm that `qsec` has been removed after pre-processing
# recipe %>% prep() %>% bake(new_data = NULL)

# Define & fit model to data
spec <- linear_reg() %>% set_engine("glm")
fitted_model <- workflow() %>%
    add_recipe(recipe) %>%
    add_model(spec) %>%
    fit(data = data)

# FastSHAP
fshap <- explain(
    fitted_model,
    X = recipe %>% prep() %>% bake(new_data = NULL) %>% select(-c(mpg)),
    pred_wrapper = function(model, newdata) predict(model, newdata)$.pred,
    shap_only = FALSE)

This throws an error

Error in { :
task 1 failed - "The following required columns are missing: 'qsec'."

This is because fitted_model retains a reference to qsec even though the variable was removed during pre-processing in recipe.

Question: What is the canonical way to supply X here? I could reference data directly

fshap <- explain(
    fitted_model,
    X = data %>% select(-c(mpg)),
    pred_wrapper = function(model, newdata) predict(model, newdata)$.pred,
    shap_only = FALSE)

but (1) this doesn't seem to be very tidymodels-canonical, and (2) this then includes qsec in the SHAP analysis (which it shouldn't). A fix to that issue would be to use the feature_names argument to exclude qsec, but this seems unnecessarily complicated.

What is the fastshap-way to provide X via a recipe?

Fastshap does not work with randomForest package

I tried to modify the simple example from fastshap webpage so that it would use randomForest instead of ranger. However, the returned shap values are all zero. What could be wrong? The slightly modified example program is below.

Jarkko

# Load required packages
library(fastshap)  # for fast (approximate) Shapley values
library(randomForest)

# Simulate training data
trn <- gen_friedman(3000, seed = 101)
X <- subset(trn, select = -y)  # feature columns only

# Fit a random forest
set.seed(102)

rfo <- randomForest::randomForest(y ~ ., data =  trn)

# Prediction wrapper
pfun <- function(object, newdata) {
  unname(predict(object, data = newdata))
}

# Compute fast (approximate) Shapley values using 10 Monte Carlo repetitions
system.time({  # estimate run time
  set.seed(5038)
  shap <- fastshap::explain(rfo, X = X, pred_wrapper = pfun, nsim = 100)
})

# Results are returned as a tibble (with the additional "shap" class)
shap

Approximate method with glm model returns zeros.

I've been perusing various sites that describe how to determine approximate values with fastshap for a binomial glm model, but so far have been unsuccessful in making it work. Here's what I have been using:

x1 <- c(1,1,1,0,0,0,0,0,0,0)
x2 <- c(1,0,0,1,1,1,0,0,0,0)
x3 <- c(3,2,1,3,2,1,3,2,1,3)
x4 <- c(1,0,1,1,0,1,0,1,0,1)
y  <- c(1,0,1,0,1,1,0,0,0,1)

df <- data.frame(x1, x2, x3, x4, y)

fit <- glm(y ~ ., data=df, family=binomial)
X <- model.matrix(y ~., df)[,-1]

pfun <- function(object, newdata) {
  predict(object, type="response")
}

shap <- explain(fit , X = X, pred_wrapper = pfun, nsim = 10)

Here's the result:

> summary(shap)
       x1          x2          x3          x4   
 Min.   :0   Min.   :0   Min.   :0   Min.   :0  
 1st Qu.:0   1st Qu.:0   1st Qu.:0   1st Qu.:0  
 Median :0   Median :0   Median :0   Median :0  
 Mean   :0   Mean   :0   Mean   :0   Mean   :0  
 3rd Qu.:0   3rd Qu.:0   3rd Qu.:0   3rd Qu.:0  
 Max.   :0   Max.   :0   Max.   :0   Max.   :0 

Obviously not what I expect. With the exact method, I do get values that make sense:

shap <- explain(fit , X = X, exact=TRUE, nsim = 10)
summary(shap)

       x1                x2                x3                 x4         
 Min.   :-0.3659   Min.   :-0.8149   Min.   :-0.62699   Min.   :-1.0497  
 1st Qu.:-0.3659   1st Qu.:-0.8149   1st Qu.:-0.62699   1st Qu.:-1.0497  
 Median :-0.3659   Median :-0.8149   Median : 0.06967   Median : 0.6998  
 Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
 3rd Qu.: 0.5489   3rd Qu.: 1.2223   3rd Qu.: 0.59215   3rd Qu.: 0.6998  
 Max.   : 0.8538   Max.   : 1.2223   Max.   : 0.76632   Max.   : 0.6998  

but I cannot use the exact method on my actual data because the model features are not independent. Any help getting this to work would be appreciated!

Export variance?

Love the package! I'm wondering if there is some simple way to export the Monte Carlo variance for estimated phi values? This could help quantify uncertainty. Thanks!

Shap values don't add up to prediction values

I'm not sure if I'm using this correctly, but I would have thought that the sum of the shap values should equal the prediction. In this case they don't seem to:

require(fastshap)
require(ranger)
require(tidyverse)

data(iris)

mdl <- ranger(Sepal.Length ~ . -Species,
              data = iris)
ranger_predict <- function(object, newdata){
  return(predict(object, newdata)$predictions)
}

rng_predictions <- predict(mdl, iris)$predictions

shaps <- explain(object = mdl, 
                 X = iris %>% select(-Species, - Sepal.Length), 
                 pred_wrapper = ranger_predict,
                 nsim = 1)
shap_predictions <- rowSums(shaps)
diffs <- rng_predictions - shap_predictions
plot(diffs)

fastshap handling nominal variables

How does fastshap handle nominal variables?

`library(tidymodels)
library(tidyverse)
library(mlbench)
library(xgboost)
library(lightgbm)
library(treesnip)
data(Glass)

head(Glass)
Glass$Type
rec <-recipe(RI ~., data = Glass) %>% step_scale(all_numeric())

prep_rec <- prep(rec, retain = TRUE)

split <- initial_split(Glass)

train_data <- training(split)

test_data <- testing(split)

model<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine('lightgbm' , verbose = 0 )

wf_glass <- workflow() %>%
add_recipe(rec) %>%
add_model(model)
fit <- wf_glass %>% parsnip::fit(data = train_data)

library(fastshap)
explain(object = fit %>% extract_fit_parsnip(), newdata = test_data %>% select(-RI) %>% as.matrix(), X = train_data %>% select(-RI) %>% as.matrix(), pred_wrapper = predict)`

This will lead to the following error : Error in genFrankensteinMatrices(X, W, O, feature = column) : Not compatible with requested type: [type=character; target=double].

My guess :
This might be due to the fact that transforming a data.frame with different vector types with as.matrix() will lead to a character matrix. This matrix can't be transformed to a matrix of type double without loosing the values of the factor columns here Type. On the other hand, as the error expresses, we can't use a numeric target for the regression task if all other variables are of class character.

Am I missing something or is this a possible transforming problem?
Is there an option to specify nominal/factor variables?

h2o fastshap

Hello would you be interested in h2o fastshap code I developed?

I modified the explain function quite a lot. It will no work within the current explain functions.

I am uncertain what to do with it, but it does work.

~4000 0bservations-- tibble giving one row of outputs

Hi I'm trying to generate the shapley values for a dataset with 14 features and around 4000 observations (dataframe with 14 columns and 4000 rows). I create a model using ranger, but then when I go to generate my shapley values I only end up receiving a one row output.... I get different values depending on how many observations I include... but still just one row output. Please help.

Different Scale of SHAP values for Approx vs. ExactSHAP

Hi there,

great work with the package first and foremost.

Quick question: Does the ApproxSHAP method scale or standardize SHAP values in any way? When I create global feature attribution rankings for a GBM using Approx as well as TreeSHAP, my SHAP values end up being on substantially different scale. For example, using ApproxSHAP the mean absolute values are in the range of 0.01-0.18 while they lie between 1-14 using TreeSHAP.

Thanks in advance!

force_plot not working

I can't get force_plot to work with the example code or the article (https://bgreenwell.github.io/fastshap/articles/forceplot.html) code.

Restarting R session...

> remove.packages("fastshap")
Removing package from/Library/Frameworks/R.framework/Versions/4.0/Resources/library’
(aslibis unspecified)
> 
> install.packages("fastshap")
trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.0/fastshap_0.0.5.tgz'
Content type 'application/x-gzip' length 561465 bytes (548 KB)
==================================================
downloaded 548 KB


The downloaded binary packages are in
	/var/folders/fs/n8fm688d7139p9zjkl1llsbdv6n1th/T//RtmpBRZmED/downloaded_packages
> 
> library(fastshap)
> 
> # Load the sample data; see ?datasets::mtcars for details
> data(mtcars)
> 
> # Fit a projection pursuit regression model
> mtcars.ppr <- ppr(mpg ~ ., data = mtcars, nterms = 1)
> 
> # Compute approximate Shapley values using 10 Monte Carlo simulations
> set.seed(101)  # for reproducibility
> shap <- explain(mtcars.ppr, X = subset(mtcars, select = -mpg), nsim = 10, 
+                 pred_wrapper = predict, adjust = TRUE)
Warning message:
The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
Using compatibility `.name_repair`.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated. 
> 
> # Visualize first explanation
> preds <- predict(mtcars.ppr, newdata = mtcars)
> x <- subset(mtcars, select = -mpg)[1L, ]  # take first row of feature values
> force_plot(shap[1L, ], baseline = mean(preds), feature_values = x)
Error in py_call_impl(callable, dots$args, dots$keywords) : 
  TypeError: save_html() got an unexpected keyword argument 'plot_html'

And here is my session info

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fastshap_0.0.5

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6         pillar_1.6.1       compiler_4.0.3     plyr_1.8.6        
 [5] tools_4.0.3        jsonlite_1.7.2     lifecycle_1.0.0    tibble_3.1.2      
 [9] gtable_0.3.0       lattice_0.20-44    png_0.1-7          pkgconfig_2.0.3   
[13] rlang_0.4.11       Matrix_1.3-3       DBI_1.1.1          rstudioapi_0.13   
[17] gridExtra_2.3      dplyr_1.0.6        generics_0.1.0     vctrs_0.3.8       
[21] xgboost_1.4.1.1    grid_4.0.3         tidyselect_1.1.1   reticulate_1.20   
[25] glue_1.4.2         data.table_1.14.1  R6_2.5.0           fansi_0.5.0       
[29] ggplot2_3.3.4      purrr_0.3.4        magrittr_2.0.1     scales_1.1.1      
[33] ellipsis_0.3.2     matrixStats_0.59.0 assertthat_0.2.1   abind_1.4-5       
[37] colorspace_2.0-1   utf8_1.2.1         munsell_0.5.0      crayon_1.4.1      
IPython could not be loaded!
> 

shap v0.36.0 causes error in force_plot

Hi Brandon,

Thanks for the great package. I've noticed that shap v0.36.0 made some changes to the API that causes fastshap::force_plot to fail. Specifically, shap$save_html(tfile, plot_html = fp) needs to be changed to shap$save_html(tfile, plot = fp) to work with the latest version of shap.

Not sure what the best workaround would be, either requiring the latest version of shap or maybe if the original method throws an error, try again with the new method?

Add h2o support

Package h2o now supports feature contributions for certain models via h2o.predict_contributions().

Sum of SHAP values not equal to `pred - mean(pred)` when `exact = TRUE`

Hi! Thanks for the great package. I want to clarify a point of confusion I have before proceeding. I found the sample code you posted here and ran it locally. Quick reprex:

library(xgboost)
library(fastshap)
library(SHAPforxgboost) #to load the dataX

y_var <-  "diffcwv"
dataX <- as.matrix(dataXY_df[,-..y_var])

# hyperparameter tuning results
param_list <- list(objective = "reg:squarederror",  # For regression
                   eta = 0.02,
                   max_depth = 10,
                   gamma = 0.01,
                   subsample = 0.95
)
mod <- xgboost(data = dataX, label = as.matrix(dataXY_df[[y_var]]), 
               params = param_list, nrounds = 10, verbose = FALSE, 
               nthread = parallel::detectCores() - 2, early_stopping_rounds = 8)

# Grab SHAP values directly from XGBoost
shap <- predict(mod, newdata = dataX, predcontrib = TRUE)

# Compute shapley values 
shap2 <- explain(mod, X = dataX, exact = TRUE, adjust = TRUE)

# Compute bias term; difference between predictions and sum of SHAP values
pred <- predict(mod, newdata = dataX)
head(bias <- pred - rowSums(shap2))
#> [1] 0.4174776 0.4174775 0.4174775 0.4174775 0.4174775 0.4174776

# Compare to output from XGBoost
head(shap[, "BIAS"])
#> [1] 0.4174775 0.4174775 0.4174775 0.4174775 0.4174775 0.4174775

# Check that SHAP values sum to the difference between pred and mean(pred)
head(cbind(rowSums(shap2), pred - mean(pred)))
#>             [,1]        [,2]
#> [1,] -0.03048085 -0.03053582
#> [2,] -0.08669319 -0.08674819
#> [3,] -0.05410853 -0.05416352
#> [4,] -0.09465271 -0.09470773
#> [5,] -0.01655553 -0.01661054
#> [6,] -0.01729831 -0.01735327

In this code, the SHAP values' sum is not equal to the difference between pred and mean(pred) as suggested. Instead the SHAP values' sum is (nearly) equal to the BIAS term from the stats::predict(object, X, predcontrib = TRUE, ...) call in explain.xgb.Booster when exact = TRUE.

# Compare pred - BIAS from shap2
head(cbind(rowSums(shap2), pred - attributes(shap2)$baseline))
#>             [,1]        [,2]
#> [1,] -0.03048085 -0.03048083
#> [2,] -0.08669319 -0.08669320
#> [3,] -0.05410853 -0.05410853
#> [4,] -0.09465271 -0.09465274
#> [5,] -0.01655553 -0.01655555
#> [6,] -0.01729831 -0.01729828

So, quick questions:

  1. Should adjust = TRUE have the same effect for exact = TRUE output as it does for exact = FALSE output? In the line above (explain(mod, X = dataX, exact = TRUE, adjust = TRUE)), adjust = TRUE has no function. Is is simply passed on to the predict method of xgb.Booster and silently swallowed. Is this the intended behavior?
  2. Can you briefly explain the difference between the baseline/bias term (produced by predict(xgb.Booster, newdata = X, predcontrib = FALSE) as the last matrix column) and mean(prediction)? I scoured the xgboost/lightgbm docs but couldn't find much.

Visualization omitted, Javascript library not loaded

Hi, I followed the example code in the fastshap vignette and encountered with error:

library(fastshap)
library(xgboost)

data(boston, package = "pdp")
X <- data.matrix(subset(boston, select = -cmedv))  # matrix of feature values

set.seed(859)  # for reproducibility
bst <- xgboost(X, label = boston$cmedv, nrounds = 338, max_depth = 3, eta = 0.1,
               verbose = 0)

ex <- explain(bst, exact = TRUE, X = X)

force_plot(object = ex[1L, ], feature_values = X[1L, ], display = "html")  

Error:

 <b>Visualization omitted, Javascript library not loaded!</b><br>
  Have you run `initjs()` in this notebook? If this notebook was from another
  user you must also trust this notebook (File -> Trust notebook). If you are viewing
  this notebook on github the Javascript has been stripped for security. If you are using
  JupyterLab this error is because a JupyterLab extension has not yet been written.

Here is my working environment:

sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936 
[2] LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936
[4] LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xgboost_1.6.0.1 fastshap_0.0.7 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9        pillar_1.7.0      compiler_4.2.1    tools_4.2.1      
 [5] digest_0.6.29     jsonlite_1.8.4    lifecycle_1.0.3   tibble_3.1.8     
 [9] gtable_0.3.1      lattice_0.20-45   pkgconfig_2.0.3   png_0.1-8        
[13] rlang_1.0.6       Matrix_1.5-3      DBI_1.1.2         cli_3.4.1        
[17] rstudioapi_0.13   fastmap_1.1.0     gridExtra_2.3     dplyr_1.0.10     
[21] rappdirs_0.3.3    generics_0.1.2    vctrs_0.5.0       rprojroot_2.0.3  
[25] grid_4.2.1        tidyselect_1.2.0  reticulate_1.26   glue_1.6.2       
[29] data.table_1.14.4 here_1.0.1        R6_2.5.1          fansi_0.5.0      
[33] ggplot2_3.4.0     magrittr_2.0.3    htmltools_0.5.3   scales_1.2.1     
[37] ellipsis_0.3.2    assertthat_0.2.1  colorspace_2.0-3  utf8_1.2.2       
[41] munsell_0.5.0     crayon_1.5.1

Keras prediction issue

I've learned Keras sequential model with the following code:

library(keras)
library(fastshap)

x_train <- iris |> 
  slice(1:100) |> 
  select(-Species) |> 
  as.matrix()
y_train <- iris |> 
  slice(1:100) |> 
  select(Species) |> 
  mutate(Species = as.numeric(Species)-1) |> 
  as.matrix()

mod <- keras_model_sequential()
mod |> 
  layer_dense(units = 16, input_shape = 4) |> 
  layer_activation(activation = "relu") |> 
  layer_dropout(rate = 0.3) |> 
  layer_dense(units = 1, activation = "sigmoid")
mod |> 
  compile(optimizer = "nadam",
          loss = loss_binary_crossentropy())

history <- fit(
  object = mod, 
  x = x_train, 
  y = y_train,
  batch_size = 10, 
  epochs = 30,
  validation_split = 0.1
)

Then I tried to build explainer with fastshap explainer

p_fun <- function(object, newdata) predict(object, x = newdata)
shp_exp <- fastshap::explain(mod, 
                             X = x_train, 
                             pred_wrapper = p_fun)
shp_exp

but my shp_exp has only one row... why?

LightGBM is on CRAN

Nice package! Since LightGBM is now on CRAN, it would be straightforward to add support as with XGBoost. If you would support this idea, I can make a pull request.

Implement additive force plots

See here for examples. Here's a cheap hack using reticulate:

force_plot <- function(object, ...) {
  UseMethod("force_plot")
}

force_plot.explain <- function(object, prediction = NULL, baseline = NULL, 
                               feature_values = NULL,...) {
  object <- (object / sum(object)) * (prediction - baseline)
  shap <- reticulate::import("shap")
  fp <- shap$force_plot(
    base_value = baseline, 
    shap_values = data.matrix(object),
    features = data.matrix(feature_values),
    feature_names = names(object),
    matplotlib = FALSE,
    ...
  )
  tfile <- tempfile(fileext = ".html")
  shap$save_html(tfile, plot_html = fp)
  # Check for dependency
  if (requireNamespace("rstudioapi", quietly = TRUE)) {
    rstudioapi::viewer(tfile)
  } else if (requireNamespace("utils", quietly = TRUE)) {
    utils::browseURL(tfile)
  } else {
    stop("Packages \"rstudioapi\" or \"utils\" needed for this function to ",
         "work. Please install one of them.", call. = FALSE)
  }
}

Is it valid to aggregate fastshap values to sets of features?

Can I check if it is valid to aggregate the fastshap values for related features?

  • For example, 'height' and 'weight' into 'BMI', and
  • 2 related ordinal features into one feature?

Understand that this is not really an issue but more of a question, but I'm not quite sure where else to post this question as this only pertains to this package.

Appreciate your help on this!
Thanks in advance!

All shap values were 0.

library(fastshap)
library(randomForest)
model1rf1 <- rf1[["finalModel"]]
pfun <- function(model1rf1, newdata) {
p_deri1rf1$Y
}

Compute fast (approximate) Shapley values using 10 Monte Carlo repetitions

system.time({ # estimate run time
set.seed(5038)
shap <- fastshap::explain(model1rf1, X = deridata1_x, pred_wrapper = pfun, nsim = 10,)
})

When I run the above code, all the shap values I get are 0.

explain() seems to fail if X is a tibble (two different errors)

Apologies in advance if this is related to the package configuration on my machine - but explain() seems to fail for me when 'X' is a tibble. Here is an example that is reproducible on my machine:

# Load required packages
library(fastshap)  # for fast (approximate) Shapley values
library(ranger)    # for fast random forest algorithm
library(dplyr)
library(tibble)

# Simulate training data
trn <- gen_friedman(200, seed = 101)
X <- subset(trn, select = -y)  # feature columns only

# Fit a random forest
set.seed(102)
rfo <- ranger(y ~ ., data =  trn)

# Prediction wrapper
pfun <- function(object, newdata) {
  predict(object, data = newdata)$predictions
}

# Succeeds when X is a dataframe
shap <- explain(rfo, X = X, pred_wrapper = pfun, nsim = 1)

# Fails when X is a tibble (with error "Subscript `O` is a matrix, the data `X[O]` must have size 1.")
shap <- explain(rfo, X = X %>% tibble(), pred_wrapper = pfun, nsim = 1)

# Still succeeds if we add a factor predictor
trn$fac_pred <- sample(c("A", "B", "C"), 200, TRUE) %>% factor()
X <- subset(trn, select = -y)  # feature columns only
set.seed(102)
rfo <- ranger(y ~ ., data =  trn)
shap <- explain(rfo, X = X, pred_wrapper = pfun, nsim = 1)

# But fails in a new way with tibbles (with error "Error: Can't combine `x1` <double> and `fac_pred` <factor<5eb8e>>.")
shap <- explain(rfo, X = X %>% tibble(), pred_wrapper = pfun, nsim = 1)

Error when using 'autoplot' to get the contributions of a specific row

Dear Brandon Greenwell,

I'm using the great 'fastshap' library to compute the approximate Shapley values, when running the first code example provided in the official CRAN document [1], the last line gives an error that states:

ERROR while rich displaying an object: Error in farver::decode_colour(colors, alpha = TRUE, to = "lab", na_value = "transparent"): unused argument (na_value = "transparent")

The code is:

#
# A projection pursuit regression (PPR) example
#

# Load the sample data; see ?datasets::mtcars for details
data(mtcars)

# Fit a projection pursuit regression model
mtcars.ppr <- ppr(mpg ~ ., data = mtcars, nterms = 1)

# Compute approximate Shapley values using 10 Monte Carlo simulations
set.seed(101)  # for reproducibility
shap <- explain(mtcars.ppr, X = subset(mtcars, select = -mpg), nsim = 10, pred_wrapper = predict)
shap

# Shapley-based plots
library(ggplot2)
autoplot(shap)  # Shapley-based importance plot
autoplot(shap, type = "dependence", feature = "wt", X = mtcars)
autoplot(shap, type = "contribution", row_num = 1)  # explain first row of X

[1] https://cran.r-project.org/web/packages/fastshap/fastshap.pdf

Remove tibble dependency

Shapley values are ALWAYS all numeric, so it seems more natural to return a numeric matrix instead, which could always be converted to a data frame or tibble.

X (explain object) is a list

Hi, thanks for the awesome package.

I have one issue, though. My input X is a list, consisting of four matrices that enter the neural net at different points within the neural net. Is there any way of dealing with this?

Thanks in advance.

baseline is 0

@bgreenwell : one question about the baseline fnull:

When I run this code from the master branch, the baseline is 0. Same for ranger models.

library(fastshap)

fit <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data = iris)

shap <- fastshap::explain(
  fit, 
  X = iris[c("Sepal.Width", "Petal.Length")],
  nsim = 100, 
  pred_wrapper = predict,
  shap_only = FALSE  # Same with shap_only = TRUE
)

What am I doing wrong? :-)

Can we use fastshap to explain isolation forest in R?

Currently, Fastshap is only designed for supervised learning as per the package description. Can we use this package to explain unsupervised learning algorithms like isolation forest? I see we could already use isoTree with shapper package and getting similar functionality on fastshap would be amazing.

Subscript `O` is a matrix, the data `X[O]` must have size 1.

Hi Brandon,

Thanks for the great explanation here. I'm using isotree for outlier detection and this has been super useful.

I'm trying to run your example code but unfortunately I am getting the following error when I run your code verbatim, has something changed under the hood?:

ex <- fastshap::explain(ifo, X = X, newdata = max.x, pred_wrapper = pfun, 
+                          adjust = TRUE, nsim = 1000)
Error:
! Subscript `O` is a matrix, the data `X[O]` must have size 1.
Run `rlang::last_error()` to see where the error occurred.
> rlang::last_error()
<error/tibble_error_subset_matrix_must_be_scalar>
Error:
! Subscript `O` is a matrix, the data `X[O]` must have size 1.
---
Backtrace:
  1. fastshap::explain(...)
  2. fastshap:::explain.default(...)
  3. fastshap:::explain.default(...)
  4. plyr::laply(...)
  5. plyr::llply(...)
  7. base::lapply(pieces, .fun, ...)
  8. fastshap FUN(X[[i]], ...)
  9. base::replicate(...)
 10. base::sapply(...)
 11. base::lapply(X = X, FUN = FUN, ...)
 12. fastshap FUN(X[[i]], ...)
 13. fastshap:::explain_column(...)
 15. tibble:::`[<-.tbl_df`(`*tmp*`, O, value = `<dbl>`)
 16. tibble:::tbl_subassign_matrix(x, j, value, j_arg, substitute(value))
Run `rlang::last_trace()` to see the full context.

The only change I made to the code was removing the reference to random_seed = 2223 when fitting the isolation tree model, this doesn't appear to be an argument within the isolation.forest function: https://cran.r-project.org/web/packages/isotree/isotree.pdf

{lightgbm} v4.0.0 is coming

👋 Hello! I'm James, one of the maintainers of LightGBM.

We were excited to see {fastshap} add support for LightGBM models. Thanks so much for making it easier for some R users to inspect the characteristics of their data via {lightgbm} models!

First, I wanted to let you know that the next release of LightGBM (the entire project, including the R package) will be a major version release with significant breaking changes. See this discussion of v4.0.0: microsoft/LightGBM#5153. We don't have a planned date for that release yet, as we have been struggling from a lack of maintainer attention / activity. But I expect it will be months not weeks from now.

Please open issues at https://github.com/microsoft/LightGBM/issues if there's anything we could do to make {lightgbm} easier to use with {fastshap}.

I'm opening this issue mainly to ask a question. In this period prior to v4.0.0, I'm willing to contribute and maintain the following here:

  • patches that make {fastshap} compatible with the latest release on CRAN (v3.3.2) and the upcoming release (v4.0.0) so that the next release of {lightgbm} isn't disruptive to you
  • CI job(s) testing against the latest development version of LightGBM

Are you open to such contributions?

Thanks for your time and consideration.

Error in if (ncol(res) == 1) { : argument is of length zero

Hello,

I receive the error above after I use this command:

shap_values <- fastshap::explain(model_n, X = trainval, exact = TRUE)

model_n is a multiclassification model fitted with xgboost.
trainval is my train-data

When I run this set-up for binary classification, the shap values are calculated correctly.
When I run this set-up for multiclass classification, the error above is generated.

Do you have any idea what can be the cause?

Thanks a lot!

Calculating Shapley Values with Data Constraints

I'm trying to calculate Shapley values for a dataset that has constrained regions - the data just doesn't exist or is impossible in the real world. I've tried to illustrate this phenomena in an example below where I have two constraints listed. Right now, I have the pred_wrapper return NA if there's any sort of input that attempts to produce a prediction. As such, the results (labeled shap in the code below), are greatly diminished (there are only 74 returned values as opposed to the 217 we started with). This effect is even more pronounced with the more iterations (nsim) used.

I realize this isn't necessarily a bug given the way that Shapley values are calculated, but I suppose I'm either looking for advice on how to deal with constraints or I'm submitting a feature request/enhancement.

library(fastshap)  # for fast (approximate) Shapley values
library(ranger)    # for fast random forest algorithm

trn <- gen_friedman(250, seed = 101)
constraint_1_inds = which(trn$x1 < 0.35 & trn$x5 > 0.85)
constraint_2_inds = which(trn$x3 < 0.25 & trn$x9 > 0.75)
constraint_inds   = unique(c(constraint_1_inds, constraint_2_inds))
trn <- trn[-constraint_inds, ]
dim(trn) # lost 33 data points; now at 217 instead of 250

X <- subset(trn, select = -y)  # feature columns only

set.seed(102)
rfo <- ranger(y ~ ., data =  trn)

pfun <- function(object, newdata) {
  preds = predict(object, data = newdata)$predictions
  constraint_1_inds = which(newdata$x1 < 0.35 & newdata$x5 > 0.85)
  constraint_2_inds = which(newdata$x3 < 0.25 & newdata$x9 > 0.75)
  constraint_inds   = unique(c(constraint_1_inds, constraint_2_inds))
  preds[constraint_inds] = NA
  return(preds)
}

set.seed(5038)
shap <- explain(rfo, X = X, pred_wrapper = pfun, nsim = 10)

shap[complete.cases(shap), ]# This is only 74 cases

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.