predict-idlab / powershap Goto Github PK

View Code? Open in Web Editor NEW

179.0 3.0 17.0 4.61 MB

A power-full Shapley feature selection method.

License: Other

Python 99.52% Makefile 0.48%

feature-selection machine-learning data-science shap

powershap's People

Contributors

Stargazers

Watchers

Forkers

moeflon gillesvandewiele valeman jmrichardson zhangc927 techthiyanes rsmahabir tdl77 lorenjan eduardokapp arestifo nielspraet kryptec bmaneesh ellimilial sablokgaurav mrzdev ritviksahajpal

powershap's Issues

Support Python 3.11

Thanks for the great library!

We want to upgrade our workflow to Python 3.11, but it looks like this library doesn't support it. I was able to install it with pip install --ignore-requires-python powershap==0.0.9 but get the following error when trying to import powershap:

PowerShap import error

>>> from powershap import PowerShap
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:34: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:53: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:62: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:68: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_clustering.py:76: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:4: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:9: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:14: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/links.py:19: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @numba.jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_masked_model.py:361: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit # we can't use this when using a custom link function...
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/utils/_masked_model.py:383: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/maskers/_tabular.py:184: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
/Users/alex/miniforge3/envs/datachat_py3.11_arm64/lib/python3.11/site-packages/shap/maskers/_tabular.py:195: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  @jit
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
IPython could not be loaded!

Side note, you may want to change the file naming scheme in PyPI to reflect this. The name of the newest release powershap-0.0.9-py3-none-any.whl is misleading if this can't be installed on any Python 3 distribution.

For ref, here are my environment details:

Python 3.11.3
macOS Ventura 13.2.1 (I also tested on the Linux Docker image python:3.11-slim and got the same error)
Apple M1 chip

Can't Pickle

Hi,

With version 0.0.7 I am receiving the following error when trying to pickle (ie joblib):

_pickle.PicklingError: Can't pickle <function PowerShap.__init__.<locals>._infinite_splitter.<locals>.split at 0x0000020167F551F0>: it's not found as powershap.powershap.PowerShap.__init__.<locals>._infinite_splitter.<locals>.split

Powershap work's great, it's just saving the fit for future use with pickle that is the problem. I have to use pickle as it's part of joblib which is caching my functions using "memory".

Here's a reproduceable example:

        self.selector = PowerShap(model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True), cv=cv, verbose=True)
        self.selector.fit(X, y)
        import joblib
        joblib.dump(self, "self.job")

Thanks in advance for any help.

*** Update: this is not a major issue. Worked around the problem by storing the feature names vs the fit and happy to close issue if this is not something that is a priority.

All relevant feature or minimal set feature selection?

Hey all, I know Boruta is an "All relevant feature selection" method vs. e.g., mRMR which aims to find a minimum optimum set of features. This is described here.

I'm just wondering if Powershap is an all relevant feature selection method or a minimum optimum set feature selection method? Thanks!

Support pandas 2.0

Hi! Thank you creating this library, it's super useful!

We're using it with pandas 2.0, without any errors so far. Would it be possible to relax the version constraints, to enable usage with the new major version?

Thank you!

PyPI brownout for `sklearn` in favor of `scikit-learn`

Hi!

Appreciate the work on this project.

scikit-learn announced a gradual brownout of the sklearn dependency from PyPI. Even though the main branch on this repository tags scikit-learn correctly, the latest release on PyPI(rc2), still lists it as sklearn.

We had a discussion about this on this PR, but I believe a release was tagged before the sklearn -> scikit-learn change was made.

Can we expect a new release being tagged and pushed to PyPI before 12/01?

Thanks!

Add link to license in readme

Thought it may be helpful/convenient to have a link to the license in the readme

Error when handling categoricals with LightGBM

I'm facing a weird error when using a LightGBM model as the underlying model with the selector.
I could find a simple repro using the titanic dataset:

X - bug_features.csv
y - bug_label.csv

Categorical features and Data types info

cats --> ['Cabin', 'Embarked', 'Gender', 'Name', 'Parch', 'Pclass', 'SibSp', 'Ticket'],

X.dtypes -->
 Age             float64
 Cabin          category
 Embarked       category
 Fare            float64
 Gender         category
 Name           category
 Parch          category
 PassengerId       int64
 Pclass         category
 SibSp          category
 Ticket         category

Code snippet used to fit the selector:

lgb_clf = LGBMClassifier(random_state=42, n_estimators=10)
sel = PowerShap(model=lgb_clf, stratify=True, fit_kwargs={'categorical_feature':cats})
selector = PowerShap(
    model=lgb_clf,
    stratify=True,
    verbose=False,
    automatic=True
)
selector.fit(X, y)

Traceback

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [19], in <cell line: 12>()
      5 _stratify = True
      6 selector = PowerShap(
      7     model=lgb_clf,
      8     stratify=_stratify,
      9     verbose=False,
     10     automatic=True
     11 )
---> 12 selector.fit(X, y, categorical_feature=cats)

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/powershap.py:393, in PowerShap.fit(self, X, y, stratify, groups, **kwargs)
    387     loop_its = 10
    388     self._print(
    389         "Automatic mode enabled: Finding the minimal required powershap",
    390         f"iterations for significance of {self.power_alpha}.",
    391     )
--> 393 shaps_df = self._explainer.explain(
    394     X=X,
    395     y=y,
    396     loop_its=loop_its,
    397     val_size=self.val_size,
    398     stratify=stratify,
    399     groups=groups,
    400     cv_split=self.cv,  # pass the wrapped cv split function
    401     show_progress=self.show_progress,
    402     **kwargs,
    403 )
    405 processed_shaps_df = powerSHAP_statistical_analysis(
    406     shaps_df,
    407     self.power_alpha,
    408     self.power_req_iterations,
    409     include_all=self.include_all,
    410 )
    412 if self.automatic:

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/shap_wrappers/shap_explainer.py:175, in ShapExplainer.explain(self, X, y, loop_its, val_size, stratify, groups, cv_split, random_seed_start, show_progress, **kwargs)
    172 Y_train = y[np.sort(train_idx)]
    173 Y_val = y[np.sort(val_idx)]
--> 175 Shap_values = self._fit_get_shap(
    176     X_train=X_train,
    177     Y_train=Y_train,
    178     X_val=X_val,
    179     Y_val=Y_val,
    180     random_seed=i + random_seed_start,
    181     **kwargs,
    182 )
    184 Shap_values = np.abs(Shap_values)
    186 if len(np.shape(Shap_values)) > 2:

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/powershap/shap_wrappers/shap_explainer.py:251, in LGBMExplainer._fit_get_shap(self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs)
    248 from copy import copy
    250 PowerShap_model = copy(self.model).set_params(random_seed=random_seed)
--> 251 PowerShap_model.fit(X_train, Y_train, eval_set=(X_val, Y_val))
    252 # Calculate the shap values
    253 C_explainer = shap.TreeExplainer(PowerShap_model)

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/sklearn.py:967, in LGBMClassifier.fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    964         else:
    965             valid_sets[i] = (valid_x, self._le.transform(valid_y))
--> 967 super().fit(X, _y, sample_weight=sample_weight, init_score=init_score, eval_set=valid_sets,
    968             eval_names=eval_names, eval_sample_weight=eval_sample_weight,
    969             eval_class_weight=eval_class_weight, eval_init_score=eval_init_score,
    970             eval_metric=eval_metric, early_stopping_rounds=early_stopping_rounds,
    971             verbose=verbose, feature_name=feature_name, categorical_feature=categorical_feature,
    972             callbacks=callbacks, init_model=init_model)
    973 return self

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/sklearn.py:748, in LGBMModel.fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks, init_model)
    745 evals_result = {}
    746 callbacks.append(record_evaluation(evals_result))
--> 748 self._Booster = train(
    749     params=params,
    750     train_set=train_set,
    751     num_boost_round=self.n_estimators,
    752     valid_sets=valid_sets,
    753     valid_names=eval_names,
    754     fobj=self._fobj,
    755     feval=eval_metrics_callable,
    756     init_model=init_model,
    757     feature_name=feature_name,
    758     callbacks=callbacks
    759 )
    761 if evals_result:
    762     self._evals_result = evals_result

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/engine.py:271, in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
    269 # construct booster
    270 try:
--> 271     booster = Booster(params=params, train_set=train_set)
    272     if is_valid_contain_train:
    273         booster.set_train_data_name(train_data_name)

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:2605, in Booster.__init__(self, params, train_set, model_file, model_str, silent)
   2598     self.set_network(
   2599         machines=machines,
   2600         local_listen_port=params["local_listen_port"],
   2601         listen_time_out=params.get("time_out", 120),
   2602         num_machines=params["num_machines"]
   2603     )
   2604 # construct booster object
-> 2605 train_set.construct()
   2606 # copy the parameters from train_set
   2607 params.update(train_set.get_params())

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:1815, in Dataset.construct(self)
   1812             self._set_init_score_by_predictor(self._predictor, self.data, used_indices)
   1813 else:
   1814     # create train
-> 1815     self._lazy_init(self.data, label=self.label,
   1816                     weight=self.weight, group=self.group,
   1817                     init_score=self.init_score, predictor=self._predictor,
   1818                     silent=self.silent, feature_name=self.feature_name,
   1819                     categorical_feature=self.categorical_feature, params=self.params)
   1820 if self.free_raw_data:
   1821     self.data = None

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:1474, in Dataset._lazy_init(self, data, label, reference, weight, group, init_score, predictor, silent, feature_name, categorical_feature, params)
   1472     self.pandas_categorical = reference.pandas_categorical
   1473     categorical_feature = reference.categorical_feature
-> 1474 data, feature_name, categorical_feature, self.pandas_categorical = _data_from_pandas(data,
   1475                                                                                      feature_name,
   1476                                                                                      categorical_feature,
   1477                                                                                      self.pandas_categorical)
   1478 label = _label_from_pandas(label)
   1480 # process for args

File ~/.pyenv/versions/3.8.13/envs/comp_worker/lib/python3.8/site-packages/lightgbm/basic.py:594, in _data_from_pandas(data, feature_name, categorical_feature, pandas_categorical)
    592 if bad_indices:
    593     bad_index_cols_str = ', '.join(data.columns[bad_indices])
--> 594     raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n"
    595                      "Did not expect the data types in the following fields: "
    596                      f"{bad_index_cols_str}")
    597 data = data.values
    598 if data.dtype != np.float32 and data.dtype != np.float64:

ValueError: DataFrame.dtypes for data must be int, float or bool.
Did not expect the data types in the following fields: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Let me know if I am using the API incorrectly, or missing an argument. I tried passing the cat features list into the fit call as a kwarg, but didn't help either.

Library details:

Name: powershap
Version: 0.0.8rc2

Name: lightgbm
Version: 3.3.2

Name: pandas
Version: 1.4.2

Name: scikit-learn
Version: 1.1.1

Name: numpy
Version: 1.21.6

Consider using FastTreeSHAP?

https://engineering.linkedin.com/blog/2022/fasttreeshap--accelerating-shap-value-computation-for-trees
It looks like it's a drop-in replacement, which we'll start testing next week!

Release with numpy >=1.21

Wondering if there's a timeline for a release that incorporates the usage of numpy>=1.21? Main motivations being:

Security vulnerabilities
Installing poweshap in a python environment becomes troublesome, especially when other libraries in the environment set hard requirements on numpy>1.21

Custom CV

Hi,

Thank you for this great library!

My data is time series (ordinal) and would like to provide my own cv iterator such as:

cv = GapKFold(n_splits=3, gap_before=self.gap_before, gap_after=self.gap_after)

Would like to do the following?

selector = PowerShap(model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True), verbose=True)
selector.fit(X, y, cv=cv)

It appears from cursory look at the code that the data is split using a train_test_split which shuffles the data. I was hoping to be able to either provide a custom cv or be able to set the shuffle=False parameter?

Thanks for your consideration

The cv argument in `fit` is ignored

I would like to perform group based train-test split (to avoid data leakage in the powershap feature selection).

To do so I would;

remove the cv argument in fit
add an optional groups argument to fit and use GroupShuffleSplit in the powershap explain method

Pandas version restraints

Is there any reason pandas needs to be <1.4?

I'm currently using pandas 1.5.3, should I expect any issues with powershap? If so, where?

PS: I know about virtualenvs, etc, etc. It's just that I don't quite understand how pandas >1.4 would affect the package all that much.

Missing XGBoost support

Here is my code to get XGBoost working (from a notebook):

from powershap import PowerShap
import powershap.shap_wrappers.shap_explainer as se
import powershap.shap_wrappers.shap_explainer_factory as sef
import shap
import xgboost as xgb
from typing import Callable


class XGBoostExplainer(se.ShapExplainer):
    @staticmethod
    def supports_model(model) -> bool:
        supported_models = [xgb.XGBRegressor, xgb.XGBClassifier]
        return isinstance(model, tuple(supported_models))

    def _validate_data(self, validate_data: Callable, X, y, **kwargs):
        kwargs["force_all_finite"] = False  # xgboost allows NaNs and infs in X
        kwargs["dtype"] = None  # allow non-numeric data
        return super()._validate_data(validate_data, X, y, **kwargs)

    def _fit_get_shap(
        self, X_train, Y_train, X_val, Y_val, random_seed, **kwargs
    ) -> np.array:
        # Fit the model
        params = self.model.get_params()
        #PowerShap_model = self.model.copy().set_params(random_seed=random_seed)
        PowerShap_model = self.model.__class__(**{**params, 'random_seed':random_seed})
        PowerShap_model.fit(X_train, Y_train)#, eval_set=(X_val, Y_val))
        # Calculate the shap values
        C_explainer = shap.TreeExplainer(PowerShap_model)
        return C_explainer.shap_values(X_val)

    def _get_more_tags(self):
        return {"allow_nan": True}

sef.ShapExplainerFactory._explainer_models.append(XGBoostExplainer)

Add *kwargs to pass extra parameter to LGBMModel fit

Dear PoweShap authors,

I am Antonio, Machine Learning Scientist @ Booking.com.
We found this package quite nice and useful. I have a question over the code:

powershap/powershap/shap_wrappers/shap_explainer.py

Line 245 in 4a60fbe

PowerShap_model.fit(X_train, Y_train, eval_set=(X_val, Y_val))

In this line the **kwargs are not passed to the fit method of LGBMModel. Since they are not passed it's not possible to set extra parameter to the fit method.
This means that the **kwargs and **fit_kwargs specified in PowerShap class initialization are not used.

Could you please check and maybe fix this in a new release.

Cheers

Antonio

Right now, I'm preprocessing my data and then passing just the pipeline final step to the powershap object.

Support for categorical columns

By default CatBoost accepts categorical columns. Can we expect support for powershap as well.