Comments (15)
Does it take hours to explain a single observation or all 1000? If it’s the latter then that is expected. If it’s the former there is a potential bug...
from lime.
How many features are you using in your explanation? (I used it with millions of columns because of ngrams, and it works quite rapidly). If you have long texts, plus trying to get many variables, then it s expected that the complexity explodes because of the large number of possibilities to explore.
from lime.
I'm using between 2000 and 5000 features. A few hundred as one-hot categoricals, and the rest as numeric tdidfs. The matrix is very sparse.
I'm explaining one observation, one class and 5 features.
Here's the runtime for a sample set with 500 rows trained with a xgboost model.
Debugging, the primary time-sink is
dist <- c(0, dist(feature_scale(perms, explainer$feature_distribution, explainer$feature_type, explainer$bin_continuous), method = dist_fun)[seq_len(n_permutations-1)])
in R/dataframe.R
Thanks,
Johannes
from lime.
The file R/dataframe should not be accessed !
Can you show your sourcecode? It seems to me that you are not providing data as a character vector.
from lime.
I’m not accessing dataframe.R directly, I was just pointing out the code step that takes the lions share of time. I train my model on preprocessed and tokenized data, so I’m not sure why I would provide data as a character vector, when I have a numeric matrix?
from lime.
Ok I now understand what is happening.
There are 2 functions to perfom the work and they are S3 functions, meaning depending of the way you provide data you are using one of them.
It s up to Lime to generate the features it will use for the explanation.
You are providing a matrix and the S3 function route you to the data.frame function.
So you are not using the text explanation. Please follow the Vignette, there is a part dedicated to text data.
from lime.
Ah, ok, I see what you mean. In my dataset matrix (or data frame), I have a mix of one-hot encoded categoricals and the TF-IDF features. I guess this is a pretty common use case. Would this mean, that I would have get separate explanations for the categoricals (using the data frame method) and the text (using the character method)?
from lime.
Nope, transform your category in words (like cat_X, cat_Y) then the whole thing will become a matrix at the end.
from lime.
Dear both,
Here's a reprex that closely mimics my problem. The dataset is a combination of character words transformed to tf_idfs, and categorical metadata. I'm keen to hear, how this data is best run through lime.
library(tidyverse)
library(caret)
library(xgboost)
no_observations <- 500
no_tf_idf_features <- 1000
# Create simulated response
response <- sample(letters[1:5], no_observations, replace = TRUE)
# Create simulated TF_IDF matrix
tf_idf <- rnorm(no_observations * no_tf_idf_features) %>% matrix(ncol = no_tf_idf_features)
colnames(tf_idf) <- paste0("tfidf_", seq_len(no_tf_idf_features))
tf_idf_tbl <- tf_idf %>% as_tibble
# Create simulated categorical
cat_tbl <- bind_cols(
cat_1 = sample(letters[1:10], no_observations, replace = TRUE),
cat_2 = sample(letters[1:10], no_observations, replace = TRUE),
cat_3 = sample(letters[1:10], no_observations, replace = TRUE)
)
# Merge together and one-hot-encode
train_data <- bind_cols(cat_tbl,
tf_idf_tbl)
train_data_m <- Matrix::sparse.model.matrix(~.-1, data = train_data)
# Train using caret
cv_5 <- trainControl(method = "cv",
number = 5,
allowParallel = FALSE)
single_grid <- expand.grid(
nrounds = 25,
max_depth = 5,
eta = 0.1,
gamma = 0,
colsample_bytree = 0.5,
min_child_weight = 1,
subsample = 1
)
xgb_no_tune <- caret::train(x=train_data_m,
y=response,
method="xgbTree",
metric="Accuracy",
trControl = cv_5,
tuneGrid = single_grid,
nthread = 1)
# Explain using lime
explainer <- lime::lime(train_data, xgb_no_tune, bin_continuous=T)
explanation <- lime::explain(x = train_data[1,], explainer = explainer, labels = "a", n_features = 5)
Obviously, you would probably explain on a test set, but nonetheless...
Thanks for you time and help
/J
from lime.
Unfortunately lime
is really not able to handle this type of mixed model-type very well at the moment... I'll try to think about this a bit but it is a hard problem to solve
from lime.
I'm having issues trying to run this library in Spark environment. Curious if anyone has had a workaround for this or more generally running in parallel framework?
from lime.
Hi All,
I have encountered similar issues on slowness while running explain on a particular instance, it takes more than a minute for each instance. I have 1800 features in my model which are combination of BoW and numeric features.
Please let me know if this is expected behaviour.
Thanks.
from lime.
@platinum736 can I get you to open a new issue... this issue is not concerned with running lime on spark
from lime.
Hi @thomasp85 ,
I am not running lime on spark, this is on plain pandas dataframe.
from lime.
Then you should post your question on the python repository :-)
from lime.
Related Issues (20)
- permute_cases: Error arguments imply differing number of rows: 30000, 0
- Dealing with multiple output regression keras model
- Shiny plotOutput with plot_features from the lime package produces nothing
- Use of lime to be used in conjunction with keras model (regression)
- Error when using MLR3 for LIME
- lime/keras image classification: Input must be a vector, not a `superpixel_list` object. HOT 4
- [!] explain() does not work with ordered factors
- Flow ... through to the interactive_text_explanations
- Question about LIME results HOT 1
- Incorrect diagram in "Understanding lime"? HOT 1
- lime predicts other label than CNN
- Error in feature_distribution[[i]] : subscript out of bounds
- Error in cut.default(x[[i]], unique(explainer$bin_cuts[[i]]), labels = FALSE, : invalid number of intervals
- Error in Image Explanation
- Documentation gap concerning usage with additional libraries HOT 1
- Compatibility with tidymodels HOT 2
- Family in glmnet is always gaussian
- Release lime 0.5.3
- Error in combine_vars(data, params$plot_env, vars, drop = params$drop) :
- Allow `plot_features(cases = )` to accept integer indices even when `x` has rownames
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lime.