Giter Club home page Giter Club logo

Comments (15)

thomasp85 avatar thomasp85 commented on July 22, 2024

Does it take hours to explain a single observation or all 1000? If it’s the latter then that is expected. If it’s the former there is a potential bug...

from lime.

pommedeterresautee avatar pommedeterresautee commented on July 22, 2024

How many features are you using in your explanation? (I used it with millions of columns because of ngrams, and it works quite rapidly). If you have long texts, plus trying to get many variables, then it s expected that the complexity explodes because of the large number of possibilities to explore.

from lime.

johanneswaage avatar johanneswaage commented on July 22, 2024

I'm using between 2000 and 5000 features. A few hundred as one-hot categoricals, and the rest as numeric tdidfs. The matrix is very sparse.
I'm explaining one observation, one class and 5 features.
Here's the runtime for a sample set with 500 rows trained with a xgboost model.

image

Debugging, the primary time-sink is
dist <- c(0, dist(feature_scale(perms, explainer$feature_distribution, explainer$feature_type, explainer$bin_continuous), method = dist_fun)[seq_len(n_permutations-1)])
in R/dataframe.R

Thanks,
Johannes

from lime.

pommedeterresautee avatar pommedeterresautee commented on July 22, 2024

The file R/dataframe should not be accessed !
Can you show your sourcecode? It seems to me that you are not providing data as a character vector.

from lime.

johanneswaage avatar johanneswaage commented on July 22, 2024

I’m not accessing dataframe.R directly, I was just pointing out the code step that takes the lions share of time. I train my model on preprocessed and tokenized data, so I’m not sure why I would provide data as a character vector, when I have a numeric matrix?

from lime.

pommedeterresautee avatar pommedeterresautee commented on July 22, 2024

Ok I now understand what is happening.
There are 2 functions to perfom the work and they are S3 functions, meaning depending of the way you provide data you are using one of them.
It s up to Lime to generate the features it will use for the explanation.
You are providing a matrix and the S3 function route you to the data.frame function.
So you are not using the text explanation. Please follow the Vignette, there is a part dedicated to text data.

from lime.

johanneswaage avatar johanneswaage commented on July 22, 2024

Ah, ok, I see what you mean. In my dataset matrix (or data frame), I have a mix of one-hot encoded categoricals and the TF-IDF features. I guess this is a pretty common use case. Would this mean, that I would have get separate explanations for the categoricals (using the data frame method) and the text (using the character method)?

from lime.

pommedeterresautee avatar pommedeterresautee commented on July 22, 2024

Nope, transform your category in words (like cat_X, cat_Y) then the whole thing will become a matrix at the end.

from lime.

johanneswaage avatar johanneswaage commented on July 22, 2024

Dear both,
Here's a reprex that closely mimics my problem. The dataset is a combination of character words transformed to tf_idfs, and categorical metadata. I'm keen to hear, how this data is best run through lime.

library(tidyverse)
library(caret)
library(xgboost)

no_observations <- 500
no_tf_idf_features <- 1000

# Create simulated response
response <- sample(letters[1:5], no_observations, replace = TRUE)

# Create simulated TF_IDF matrix
tf_idf <- rnorm(no_observations * no_tf_idf_features) %>% matrix(ncol = no_tf_idf_features)
colnames(tf_idf) <- paste0("tfidf_", seq_len(no_tf_idf_features))
tf_idf_tbl <- tf_idf %>% as_tibble

# Create simulated categorical
cat_tbl <- bind_cols(
  cat_1 = sample(letters[1:10], no_observations, replace = TRUE),
  cat_2 = sample(letters[1:10], no_observations, replace = TRUE),
  cat_3 = sample(letters[1:10], no_observations, replace = TRUE)
)

# Merge together and one-hot-encode
train_data <- bind_cols(cat_tbl,
                        tf_idf_tbl)

train_data_m <- Matrix::sparse.model.matrix(~.-1, data = train_data)

# Train using caret
cv_5 <- trainControl(method = "cv", 
                     number = 5, 
                     allowParallel = FALSE)

single_grid <- expand.grid(
  nrounds = 25,
  max_depth = 5,
  eta = 0.1,
  gamma = 0,
  colsample_bytree = 0.5,
  min_child_weight = 1,
  subsample = 1
)

xgb_no_tune <- caret::train(x=train_data_m,
                     y=response,
                     method="xgbTree",
                     metric="Accuracy",
                     trControl = cv_5, 
                     tuneGrid = single_grid,
                     nthread = 1)

# Explain using lime
explainer <- lime::lime(train_data, xgb_no_tune, bin_continuous=T)
explanation <- lime::explain(x = train_data[1,], explainer = explainer, labels = "a", n_features = 5)

Obviously, you would probably explain on a test set, but nonetheless...

Thanks for you time and help
/J

from lime.

thomasp85 avatar thomasp85 commented on July 22, 2024

Unfortunately lime is really not able to handle this type of mixed model-type very well at the moment... I'll try to think about this a bit but it is a hard problem to solve

from lime.

jdegange avatar jdegange commented on July 22, 2024

I'm having issues trying to run this library in Spark environment. Curious if anyone has had a workaround for this or more generally running in parallel framework?

from lime.

platinum736 avatar platinum736 commented on July 22, 2024

Hi All,
I have encountered similar issues on slowness while running explain on a particular instance, it takes more than a minute for each instance. I have 1800 features in my model which are combination of BoW and numeric features.

Please let me know if this is expected behaviour.

Thanks.

from lime.

thomasp85 avatar thomasp85 commented on July 22, 2024

@platinum736 can I get you to open a new issue... this issue is not concerned with running lime on spark

from lime.

platinum736 avatar platinum736 commented on July 22, 2024

Hi @thomasp85 ,

I am not running lime on spark, this is on plain pandas dataframe.

from lime.

thomasp85 avatar thomasp85 commented on July 22, 2024

Then you should post your question on the python repository :-)

from lime.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.