Thanks for porting lime to R! I'm trying to explain a large xgboost

Slow with thousands of features about lime HOT 15 OPEN

johanneswaage commented on July 22, 2024

Slow with thousands of features

from lime.

Comments (15)

thomasp85 commented on July 22, 2024

Does it take hours to explain a single observation or all 1000? If it’s the latter then that is expected. If it’s the former there is a potential bug...

from lime.

pommedeterresautee commented on July 22, 2024

How many features are you using in your explanation? (I used it with millions of columns because of ngrams, and it works quite rapidly). If you have long texts, plus trying to get many variables, then it s expected that the complexity explodes because of the large number of possibilities to explore.

from lime.

johanneswaage commented on July 22, 2024

I'm using between 2000 and 5000 features. A few hundred as one-hot categoricals, and the rest as numeric tdidfs. The matrix is very sparse.
I'm explaining one observation, one class and 5 features.
Here's the runtime for a sample set with 500 rows trained with a xgboost model.

Debugging, the primary time-sink is
dist <- c(0, dist(feature_scale(perms, explainer$feature_distribution, explainer$feature_type, explainer$bin_continuous), method = dist_fun)[seq_len(n_permutations-1)])
in R/dataframe.R

Thanks,
Johannes

from lime.

pommedeterresautee commented on July 22, 2024

The file R/dataframe should not be accessed !
Can you show your sourcecode? It seems to me that you are not providing data as a character vector.

from lime.

johanneswaage commented on July 22, 2024

I’m not accessing dataframe.R directly, I was just pointing out the code step that takes the lions share of time. I train my model on preprocessed and tokenized data, so I’m not sure why I would provide data as a character vector, when I have a numeric matrix?

from lime.

pommedeterresautee commented on July 22, 2024

Ok I now understand what is happening.
There are 2 functions to perfom the work and they are S3 functions, meaning depending of the way you provide data you are using one of them.
It s up to Lime to generate the features it will use for the explanation.
You are providing a matrix and the S3 function route you to the data.frame function.
So you are not using the text explanation. Please follow the Vignette, there is a part dedicated to text data.

from lime.

johanneswaage commented on July 22, 2024

Ah, ok, I see what you mean. In my dataset matrix (or data frame), I have a mix of one-hot encoded categoricals and the TF-IDF features. I guess this is a pretty common use case. Would this mean, that I would have get separate explanations for the categoricals (using the data frame method) and the text (using the character method)?

from lime.

pommedeterresautee commented on July 22, 2024

Nope, transform your category in words (like cat_X, cat_Y) then the whole thing will become a matrix at the end.

from lime.

johanneswaage commented on July 22, 2024

Dear both,
Here's a reprex that closely mimics my problem. The dataset is a combination of character words transformed to tf_idfs, and categorical metadata. I'm keen to hear, how this data is best run through lime.

library(tidyverse)
library(caret)
library(xgboost)

no_observations <- 500
no_tf_idf_features <- 1000

# Create simulated response
response <- sample(letters[1:5], no_observations, replace = TRUE)

# Create simulated TF_IDF matrix
tf_idf <- rnorm(no_observations * no_tf_idf_features) %>% matrix(ncol = no_tf_idf_features)
colnames(tf_idf) <- paste0("tfidf_", seq_len(no_tf_idf_features))
tf_idf_tbl <- tf_idf %>% as_tibble

# Create simulated categorical
cat_tbl <- bind_cols(
  cat_1 = sample(letters[1:10], no_observations, replace = TRUE),
  cat_2 = sample(letters[1:10], no_observations, replace = TRUE),
  cat_3 = sample(letters[1:10], no_observations, replace = TRUE)
)

# Merge together and one-hot-encode
train_data <- bind_cols(cat_tbl,
                        tf_idf_tbl)

train_data_m <- Matrix::sparse.model.matrix(~.-1, data = train_data)

# Train using caret
cv_5 <- trainControl(method = "cv", 
                     number = 5, 
                     allowParallel = FALSE)

single_grid <- expand.grid(
  nrounds = 25,
  max_depth = 5,
  eta = 0.1,
  gamma = 0,
  colsample_bytree = 0.5,
  min_child_weight = 1,
  subsample = 1
)

xgb_no_tune <- caret::train(x=train_data_m,
                     y=response,
                     method="xgbTree",
                     metric="Accuracy",
                     trControl = cv_5, 
                     tuneGrid = single_grid,
                     nthread = 1)

# Explain using lime
explainer <- lime::lime(train_data, xgb_no_tune, bin_continuous=T)
explanation <- lime::explain(x = train_data[1,], explainer = explainer, labels = "a", n_features = 5)

Obviously, you would probably explain on a test set, but nonetheless...

Thanks for you time and help
/J

from lime.

thomasp85 commented on July 22, 2024

Unfortunately lime is really not able to handle this type of mixed model-type very well at the moment... I'll try to think about this a bit but it is a hard problem to solve

from lime.

jdegange commented on July 22, 2024

I'm having issues trying to run this library in Spark environment. Curious if anyone has had a workaround for this or more generally running in parallel framework?

from lime.

platinum736 commented on July 22, 2024

Hi All,
I have encountered similar issues on slowness while running explain on a particular instance, it takes more than a minute for each instance. I have 1800 features in my model which are combination of BoW and numeric features.

Please let me know if this is expected behaviour.

Thanks.

from lime.

thomasp85 commented on July 22, 2024

@platinum736 can I get you to open a new issue... this issue is not concerned with running lime on spark

from lime.

platinum736 commented on July 22, 2024

Hi @thomasp85 ,

I am not running lime on spark, this is on plain pandas dataframe.

from lime.

thomasp85 commented on July 22, 2024

Then you should post your question on the python repository :-)

from lime.

Slow with thousands of features about lime HOT 15 OPEN

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent