predict-idlab / powershap Goto Github PK
View Code? Open in Web Editor NEWA power-full Shapley feature selection method.
License: Other
A power-full Shapley feature selection method.
License: Other
Thanks for the great library!
We want to upgrade our workflow to Python 3.11, but it looks like this library doesn't support it. I was able to install it with pip install --ignore-requires-python powershap==0.0.9
but get the following error when trying to import powershap:
>>> from powershap import PowerShap
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:34: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:53: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:62: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:68: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:76: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:4: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:9: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:14: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:19: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_masked_model.py:361: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit # we can't use this when using a custom link function...
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_masked_model.py:383: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/maskers/_tabular.py:184: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/maskers/_tabular.py:195: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
@jit
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
IPython could not be loaded!
Side note, you may want to change the file naming scheme in PyPI to reflect this. The name of the newest release powershap-0.0.9-py3-none-any.whl
is misleading if this can't be installed on any Python 3 distribution.
For ref, here are my environment details:
python:3.11-slim
and got the same error)Hi,
With version 0.0.7 I am receiving the following error when trying to pickle (ie joblib):
_pickle.PicklingError: Can't pickle <function PowerShap.__init__.<locals>._infinite_splitter.<locals>.split at 0x0000020167F551F0>: it's not found as powershap.powershap.PowerShap.__init__.<locals>._infinite_splitter.<locals>.split
Powershap work's great, it's just saving the fit for future use with pickle that is the problem. I have to use pickle as it's part of joblib which is caching my functions using "memory".
Here's a reproduceable example:
self.selector = PowerShap(model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True), cv=cv, verbose=True)
self.selector.fit(X, y)
import joblib
joblib.dump(self, "self.job")
Thanks in advance for any help.
*** Update: this is not a major issue. Worked around the problem by storing the feature names vs the fit and happy to close issue if this is not something that is a priority.
Hey all, I know Boruta is an "All relevant feature selection" method vs. e.g., mRMR which aims to find a minimum optimum set of features. This is described here.
I'm just wondering if Powershap is an all relevant feature selection method or a minimum optimum set feature selection method? Thanks!
Hi! Thank you creating this library, it's super useful!
We're using it with pandas 2.0, without any errors so far. Would it be possible to relax the version constraints, to enable usage with the new major version?
Thank you!
Hi!
Appreciate the work on this project.
scikit-learn
announced a gradual brownout of the sklearn
dependency from PyPI. Even though the main
branch on this repository tags scikit-learn
correctly, the latest release on PyPI(rc2
), still lists it as sklearn
.
We had a discussion about this on this PR, but I believe a release was tagged before the sklearn
-> scikit-learn
change was made.
Can we expect a new release being tagged and pushed to PyPI before 12/01?
Thanks!
Thought it may be helpful/convenient to have a link to the license in the readme
I'm facing a weird error when using a LightGBM model as the underlying model with the selector.
I could find a simple repro using the titanic dataset:
X - bug_features.csv
y - bug_label.csv
Categorical features and Data types info
cats --> ['Cabin', 'Embarked', 'Gender', 'Name', 'Parch', 'Pclass', 'SibSp', 'Ticket'],
X.dtypes -->
Age float64
Cabin category
Embarked category
Fare float64
Gender category
Name category
Parch category
PassengerId int64
Pclass category
SibSp category
Ticket category
Code snippet used to fit the selector:
lgb_clf = LGBMClassifier(random_state=42, n_estimators=10)
sel = PowerShap(model=lgb_clf, stratify=True, fit_kwargs={'categorical_feature':cats})
selector = PowerShap(
model=lgb_clf,
stratify=True,
verbose=False,
automatic=True
)
selector.fit(X, y)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [19], in <cell line: 12>()
5 _stratify = True
6 selector = PowerShap(
7 model=lgb_clf,
8 stratify=_stratify,
9 verbose=False,
10 automatic=True
11 )
---> 12 selector.fit(X, y, categorical_feature=cats)
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/powershap.py:393, in PowerShap.fit(self, X, y, stratify, groups, **kwargs)
387 loop_its = 10
388 self._print(
389 "Automatic mode enabled: Finding the minimal required powershap",
390 f"iterations for significance of {self.power_alpha}.",
391 )
--> 393 shaps_df = self._explainer.explain(
394 X=X,
395 y=y,
396 loop_its=loop_its,
397 val_size=self.val_size,
398 stratify=stratify,
399 groups=groups,
400 cv_split=self.cv, # pass the wrapped cv split function
401 show_progress=self.show_progress,
402 **kwargs,
403 )
405 processed_shaps_df = powerSHAP_statistical_analysis(
406 shaps_df,
407 self.power_alpha,
408 self.power_req_iterations,
409 include_all=self.include_all,
410 )
412 if self.automatic:
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/shap_wrappers/shap_explainer.py:175, in ShapExplainer.explain(self, X, y, loop_its, val_size, stratify, groups, cv_split, random_seed_start, show_progress, **kwargs)
172 Y_train = y[np.sort(train_idx)]
173 Y_val = y[np.sort(val_idx)]
--> 175 Shap_values = self._fit_get_shap(
176 X_train=X_train,
177 Y_train=Y_train,
178 X_val=X_val,
179 Y_val=Y_val,
180 random_seed=i + random_seed_start,
181 **kwargs,
182 )
184 Shap_values = np.abs(Shap_values)
186 if len(np.shape(Shap_values)) > 2:
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/shap_wrappers/shap_explainer.py:251, in LGBMExplainer._fit_get_shap(self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs)
248 from copy import copy
250 PowerShap_model = copy(self.model).set_params(random_seed=random_seed)
--> 251 PowerShap_model.fit(X_train, Y_train, eval_set=(X_val, Y_val))
252 # Calculate the shap values
253 C_explainer = shap.TreeExplainer(PowerShap_model)
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/sklearn.py:967, in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
964 else:
965 valid_sets[i] = (valid_x, self._le.transform(valid_y))
--> 967 super().fit(X, _y, sample_weight=sample_weight, init_score=init_score, eval_set=valid_sets,
968 eval_names=eval_names, eval_sample_weight=eval_sample_weight,
969 eval_class_weight=eval_class_weight, eval_init_score=eval_init_score,
970 eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds,
971 verbose=verbose, feature_name=feature_name, categorical_feature=categorical_feature,
972 callbacks=callbacks, init_model=init_model)
973 return self
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/sklearn.py:748, in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
745 evals_result = {}
746 callbacks.append(record_evaluation(evals_result))
--> 748 self._Booster = train(
749 params=params,
750 train_set=train_set,
751 num_boost_round=self.n_estimators,
752 valid_sets=valid_sets,
753 valid_names=eval_names,
754 fobj=self._fobj,
755 feval=eval_metrics_callable,
756 init_model=init_model,
757 feature_name=feature_name,
758 callbacks=callbacks
759 )
761 if evals_result:
762 self._evals_result = evals_result
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/engine.py:271, in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
269 # construct booster
270 try:
--> 271 booster = Booster(params=params, train_set=train_set)
272 if is_valid_contain_train:
273 booster.set_train_data_name(train_data_name)
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:2605, in Booster.__init__(self, params, train_set, model_file, model_str, silent)
2598 self.set_network(
2599 machines=machines,
2600 local_listen_port=params["local_listen_port"],
2601 listen_time_out=params.get("time_out", 120),
2602 num_machines=params["num_machines"]
2603 )
2604 # construct booster object
-> 2605 train_set.construct()
2606 # copy the parameters from train_set
2607 params.update(train_set.get_params())
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:1815, in Dataset.construct(self)
1812 self._set_init_score_by_predictor(self._predictor, self.data, used_indices)
1813 else:
1814 # create train
-> 1815 self._lazy_init(self.data, label=self.label,
1816 weight=self.weight, group=self.group,
1817 init_score=self.init_score, predictor=self._predictor,
1818 silent=self.silent, feature_name=self.feature_name,
1819 categorical_feature=self.categorical_feature, params=self.params)
1820 if self.free_raw_data:
1821 self.data = None
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:1474, in Dataset._lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
1472 self.pandas_categorical = reference.pandas_categorical
1473 categorical_feature = reference.categorical_feature
-> 1474 data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data,
1475 feature_name,
1476 categorical_feature,
1477 self.pandas_categorical)
1478 label = _label_from_pandas(label)
1480 # process for args
File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:594, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical)
592 if bad_indices:
593 bad_index_cols_str = ', '.join(data.columns[bad_indices])
--> 594 raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n"
595 "Did not expect the data types in the following fields: "
596 f"{bad_index_cols_str}")
597 data = data.values
598 if data.dtype != np.float32 and data.dtype != np.float64:
ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in the following fields: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
Let me know if I am using the API incorrectly, or missing an argument. I tried passing the cat features list into the fit call as a kwarg, but didn't help either.
Library details:
Name: powershap
Version: 0.0.8rc2
Name: lightgbm
Version: 3.3.2
Name: pandas
Version: 1.4.2
Name: scikit-learn
Version: 1.1.1
Name: numpy
Version: 1.21.6
https://engineering.linkedin.com/blog/2022/fasttreeshap--accelerating-shap-value-computation-for-trees
It looks like it's a drop-in replacement, which we'll start testing next week!
Wondering if there's a timeline for a release that incorporates the usage of numpy>=1.21? Main motivations being:
poweshap
in a python environment becomes troublesome, especially when other libraries in the environment set hard requirements on numpy>1.21Hi,
Thank you for this great library!
My data is time series (ordinal) and would like to provide my own cv iterator such as:
cv = GapKFold(n_splits=3, gap_before=self.gap_before, gap_after=self.gap_after)
Would like to do the following?
selector = PowerShap(model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True), verbose=True)
selector.fit(X, y, cv=cv)
It appears from cursory look at the code that the data is split using a train_test_split which shuffles the data. I was hoping to be able to either provide a custom cv or be able to set the shuffle=False parameter?
Thanks for your consideration
I would like to perform group based train-test split (to avoid data leakage in the powershap feature selection).
To do so I would;
fit
fit
and use GroupShuffleSplit
in the powershap explain
methodIs there any reason pandas needs to be <1.4?
I'm currently using pandas 1.5.3, should I expect any issues with powershap? If so, where?
PS: I know about virtualenvs, etc, etc. It's just that I don't quite understand how pandas >1.4 would affect the package all that much.
Here is my code to get XGBoost working (from a notebook):
from powershap import PowerShap
import powershap.shap_wrappers.shap_explainer as se
import powershap.shap_wrappers.shap_explainer_factory as sef
import shap
import xgboost as xgb
from typing import Callable
class XGBoostExplainer(se.ShapExplainer):
@staticmethod
def supports_model(model) -> bool:
supported_models = [xgb.XGBRegressor, xgb.XGBClassifier]
return isinstance(model, tuple(supported_models))
def _validate_data(self, validate_data: Callable, X, y, **kwargs):
kwargs["force_all_finite"] = False # xgboost allows NaNs and infs in X
kwargs["dtype"] = None # allow non-numeric data
return super()._validate_data(validate_data, X, y, **kwargs)
def _fit_get_shap(
self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs
) -> np.array:
# Fit the model
params = self.model.get_params()
#PowerShap_model = self.model.copy().set_params(random_seed=random_seed)
PowerShap_model = self.model.__class__(**{**params, 'random_seed':random_seed})
PowerShap_model.fit(X_train, Y_train)#, eval_set=(X_val, Y_val))
# Calculate the shap values
C_explainer = shap.TreeExplainer(PowerShap_model)
return C_explainer.shap_values(X_val)
def _get_more_tags(self):
return {"allow_nan": True}
sef.ShapExplainerFactory._explainer_models.append(XGBoostExplainer)
Dear PoweShap authors,
I am Antonio, Machine Learning Scientist @ Booking.com.
We found this package quite nice and useful. I have a question over the code:
In this line the **kwargs are not passed to the fit method of LGBMModel. Since they are not passed it's not possible to set extra parameter to the fit method.
This means that the **kwargs and **fit_kwargs specified in PowerShap class initialization are not used.
Could you please check and maybe fix this in a new release.
Cheers
Antonio
While running selector.fit()
, I get the following warning.
Failed to converge on a solution.
What does this indicate and what can be done to remedy this situation?
Is there any way to use powershap with catboostregressor training from a file (tsv)?
Is there currently a way to pass a sklearn.pipeline.Pipeline object as the model parameter? I can't seem to do it, and I think that being able to do it would be better for the internal cross validation.
For example, imagine that you have a model defined as a pipeline, that first takes one or two steps of preprocessing that may include operations that should not be done on the validation or test set eg. filling missing values with the mean.
Right now, I'm preprocessing my data and then passing just the pipeline final step to the powershap object.
By default CatBoost accepts categorical columns. Can we expect support for powershap as well.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.