koalaverse / vip Goto Github PK

Variable Importance Plots (VIPs)

Home Page: https://koalaverse.github.io/vip/

R 100.00%

variable-importance interaction-effect partial-dependence-plot supervised-learning-algorithms machine-learning variable-importance-plots

vip's Introduction

vip: Variable Importance Plots

Overview

vip is an R package for constructing variable importance plots (VIPs). VIPs are part of a larger framework referred to as interpretable machine learning (IML), which includes (but not limited to): partial dependence plots (PDPs) and individual conditional expectation (ICE) curves. While PDPs and ICE curves (available in the R package pdp) help visualize feature effects, VIPs help visualize feature impact (either locally or globally). An in-progress, but comprehensive, overview of IML can be found here: https://github.com/christophM/interpretable-ml-book.

Many supervised learning algorithms can naturally emit some measure of importance for the features used in the model, and these approaches are embedded in many different packages. The downside, however, is that each package uses a different function and interface and it can be challenging (and distracting) to have to remember each one (e.g., remembering to use xgb.importance() for xgboost models and gbm.summary() for gbm models). With vip you get one consistent interface to computing variable importance for many types of supervised learning models across a number of packages. Additionally, vip offers a number of model-agnostic procedures for computing feature importance (see the next section) as well an experimental function for quantifying the strength of potential interaction effects. For details and example usage, visit the vip package website.

Features

Model-based variable importance - Compute variable importance specific to a particular model (like a random forest, gradient boosted decision trees, or multivariate adaptive regression splines) from a wide range of R packages (e.g., randomForest, ranger, xgboost, and many more). Also supports the caret and parsnip (starting with version 0.0.4) packages.
Permutation-based variable importance - An efficient implementation of the permutation feature importance algorithm discussed in this chapter from Christoph Molnar’s Interpretable Machine Learning book.
Shapley-based variable importance - An efficient implementation of feature importance based on the popular Shapley values via the fastshap package.
Variance-based variable importance - Compute variable importance using a simple feature importance ranking measure (FIRM) approach. For details, see see Greenwell et al. (2018) and Scholbeck et al. (2019).

Installation

# The easiest way to get vip is to install it from CRAN:
install.packages("vip")

# Alternatively, you can install the development version from GitHub:
if (!requireNamespace("remotes")) {
  install.packages("remotes")
}
remotes::install_github("koalaverse/vip")

vip's People

Contributors

Stargazers

Watchers

vip's Issues

Add split count method to tree-based models (where possible)

Reference: https://arxiv.org/abs/1802.03888.

Add option to plot PDPs for the top, say m, predictors

Add support for datarobot package

Add sparklyr version dependency

Add a NEWS file

TensorFlow example

Fitting the same model here, we can construct PDPs, ICE curves, and VIPs with minimal code using the pred.fun argument to pdp::partial():

# Centered ICE curves
library(pdp)
pfun <- function(object, newdata) {
  predict_proba(object, x = as.matrix(newdata)) %>%
    as.vector()  # %>% mean()  # uncomment for PDP instead of ICE curves
}
pd <- partial(model_keras, pred.var = "TotalCharges", progress = "text",
              pred.fun = pfun, train = as.data.frame(x_train_tbl))
autoplot(pd, alpha = 0.01, center = TRUE)

# Variable importance plot (VIP)
library(vip)
vip(model_keras, method = "ice", feature_names = names(x_train_tbl), 
    pred.fun = pfun, train = as.data.frame(x_train_tbl), num_features = 35)

Add public TODO list to README

Similar to the TODO list on RBitmojis README page

Add support for h2o package

Some support for regression models is there, but it needs to be tested! Also, h2o.varimp() output is model specific!

Error: Permutation-based variable importance scores not yet implemented.

Hi. I am using vip to measure variable importance scores from my super learner model.
And I get the following error message:

install.packages("vip")
library(vip)
set.seed(150) # for reproducibility
sl <- SuperLearner(Y = label, X = train,
SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"),
method = "method.NNLS", verbose=TRUE)
words <- colnames(train)
p1 <- vi(sl, method = "perm", obs = label, feature_names = words)
Error: Permutation-based variable importance scores not yet implemented.

What does this error message mean?
Also, does vip support super learner models?
I would appreciate your help. Thanks!

Clean up documentation

Change `top_n` argument to `num_features`

Add function to measure interaction effects

Added vint function.

Throw warning that metric should match the same one used to tune/train the model

Thanks to @mpane21 for the suggestion!

Better checks for helper functions

For example, vi() should always check for is.null(feature_names), etc.

Use a different statistic for factors

So far, the max difference seems the most reasonable; diff(range(yhat)).

Add method for rpart (and other tree-based) objects

Should have methods for:
rpart
party
partykit

Add sign column for `glm`-like models

Columns should be, for example, "Variable", "Importance", and "Sign/direction".

Add CRAN badge to README

Allow user-defined functions for metric

Add nsim argument to vi_permute()

The nsim argument (which will default to 1) enables multiple replications of shuffling each feature to reduce Monte Carlo error and help stabilize results from run to run (and maybe even produce standard deviations?).

imp <- vi_permute(fit, nsim = 10, ...)

Vignette on using `vip` with unsupported models

Add options for point size and color for when `bar = FALSE`

Add random forest support

Permutation-based VI

Maybe something like...

vip(model, type = "permute", train = trn, metric = "rmse")

New function: feature_rank() (or some other name)

Extracts the relative rank of a particular feature. For example,

feature_rank(rfo, feature_names = c("rm", "lstat"))
# [1] 1 2

Local variable importance measures?

Are we interested in incorporating our own measures for local vip (i.e. Shapley, LIME)?

Add sampling option to vi_permute()

E.g.,

vi(fit, method = "permute", metric = "rmse", sample_size = 50)

vint function is broken

Unnecessary check on feature_names argument;
Passes feature_names to pdp::partial() using the wrong argument name.

Best way to handle numeric and categorical features

Argument FUN could be a list which, for example, defaults to

FUN = list(
  "con" = stats::sd,  # continuous
  "cat" = function(x) diff(x) / 4  # categorical
)

However, the denominator in "cat" should depend on the number of categories, say K.

Add Garson and Olden methods for neural networks

This would support, for starters, vi_model.nnet(). keras would be nice too!

Add tests

Should not be hard to reach >90% test coverage.

Introductory vignette

Add type argument back to vi_model for xgboost

Case-specific variable importance scores

Using method = "ice" it is possible to compute VI scores for each individual case. For example,

vi(fit, method = "ice", case = TRUE)

What's the best way to plot this?

Additional types of plots?

For example, bar = FALSE might give a dotchart instead.

New options

Add options to control bar width (when bar = TRUE) and orientation (e.g., horizontal = TRUE).

`get_feature_names()` returns `NULL` for ranger objects whenever `importance = "none"`

Should check object$forest$independent.variable.names first since write.forest = TRUE by default in ranger. Otherwise, should throw an error.

Get added to https://github.com/jphall663/awesome-machine-learning-interpretability

XGBoost and Rborist support

Add model method for PLS

See, for example, the code here: http://mevik.net/work/software/VIP.R

Improve range-based estimate for categorical variables

Update the range-based statistic (currently, range/4) to adapt to the number of categories K.

Support for earth models

update dependencies?

What are your thoughts on using ggplot2 >= 3.0.0, moving dplyr >= 0.7.0 to Imports, and getting rid of plyr (no commits in over two years) and magrittr/tibble (dplyr exports the pipe and tibble()) from Imports? I could do a pull request if you're intersted.

Start thinking package restructure

Helper functions:

vi_model()  # a generic with various methods: vi_model.randomForest()
vi_partial()
vi_permute()
vi_shapely()  # <- this would def be nice!
...

which would then be called by vi() as in

vi(fit, type = "model", ...)

Note: will have to figure out how to deal with naming conflicts! For example, if fit is a "randomForest" object, then issues would arise with

vi(fit, type = "model", type = 2)

One work around is to prefix argument names with a . like in plyr:

vi(fit, .type = "model",  .feature_names = c("x1", "x2"), 
   # Arguments to be passed on through ...
    type = 2)