manuel-calzolari / shapicant Goto Github PK

View Code? Open in Web Editor NEW

30.0 4.0 4.0 87 KB

Feature selection package based on SHAP and target permutation, for pandas and Spark

Home Page: https://shapicant.readthedocs.io

License: MIT License

Python 100.00%

shapicant's Introduction

shapicant

shapicant is a feature selection package based on SHAP [LUN] and target permutation, for pandas and Spark.

It is inspired by PIMP [ALT], with some differences:

PIMP fits a probability distribution to the population of null importances or, alternatively, uses a non-parametric estimation of the PIMP p-values. Instead, shapicant only implements the non-parametric estimation.
For the non-parametric estimation, PIMP computes the fraction of null importances that are more extreme than the true importance (i.e. r/n). Instead, shapicant computes it as (r+1)/(n+1) [NOR].
PIMP uses the Gini importance of Random Forest models or the Mutual Information criterion. Instead, shapicant uses SHAP values.
While feature importance measures such as the Gini importance show an absolute feature importance, SHAP provides both positive and negative impacts. Instead of taking the mean absolute value of the SHAP values for each feature as feature importance, shapicant takes the mean value for positive and negative SHAP values separately. The true importance needs to be consistently higher than null importances for both positive and negative impacts. For multi-class classification, the true importance needs to be higher for at least one of the classes.
While feature importance measures such as the Gini importance of Random Forest models are computed on the training set, SHAP values can be computed out-of-sample. Therefore, shapicant allows to compute them on a distinct validation set. To decide whether to compute them on the training set or on a validation set, you can refer to this discussion for "Training vs. Test Data" (it talks about PFI [BRE], which is a different algorithm, but the general idea is still applicable).

Permuting the response vector instead of permuting features has some advantages:

The dependence between predictor variables remains unchanged.
The number of permutations can be much smaller than the number of predictor variables for high dimensional datasets (unlike PFI [BRE]) and there is no need to add shadow features (unlike Boruta [KUR]).
Since the features set does not change during iterations, the distributed implementation is more straightforward.

Installation

Dependencies

shapicant requires:

Python (>= 3.6)
shap (>= 0.36.0)
numpy
pandas
scikit-learn
tqdm

For Spark, we also need:

pyspark (>= 3.0)
pyarrow

User installation

The easiest way to install shapicant is using pip

pip install shapicant

or conda

conda install -c conda-forge shapicant

Documentation

Installation documentation, API reference and examples can be found on the documentation.

References

[LUN]

Lundberg, S., & Lee, S.I. (2017). A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems (pp. 4765–4774).

[ALT]

Altmann, A., Toloşi, L., Sander, O., & Lengauer, T. (2010). Permutation importance: a corrected feature importance measure Bioinformatics, 26 (10), 1340-1347.

[NOR]

North, B. V., Curtis, D., & Sham, P. C. (2002). A note on the calculation of empirical P values from Monte Carlo procedures. American journal of human genetics, 71 (2), 439–441.

[BRE]

(1, 2) Breiman, L. (2001). Random Forests Machine Learning, 45 (1), 5–32.

[KUR]

Kursa, M., & Rudnicki, W. (2010). Feature Selection with Boruta Package Journal of Statistical Software, 36, 1-13.

shapicant's People

Contributors

Stargazers

Watchers

Forkers

rezacsedu linkonbsmrstu nengwp lkampoli

shapicant's Issues

All-relevant vs Minimum-optimal feature selection?

Hi! I was just wondering whether shapicant aims to perform All-relevant feature selection (as per Boruta, e.g.) or Minimum-optimal feature selection (as per mRMR, e.g.)? I'm referring to the distinction described here.

Thanks!

Cannot clone object '<shapicant._pandas_selector.PandasSelector object

I am trying to use shapicant as a feature selector within a sklearn pipeline as follows:

explainer_type = shap.TreeExplainer
shap = PandasSelector(clf_dim, explainer_type, n_iter=20, random_state=0)

pipe_dim = Pipeline(
    steps=[
        ('enc', encode), 
        ('dim', shap), 
        ('clf', clf), 
    ], 
)

However, when trying to fit the pipeline, I get the error:

TypeError: Cannot clone object '<shapicant._pandas_selector.PandasSelector object at 0x00000158C66CD190>' (type <class 'shapicant._pandas_selector.PandasSelector'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' method.

and therefore can't use it in the pipeline, unfortunately,

Is this something that can be fixed, perhaps by inheriting from the sklearn class TransformerMixin?

support for xgboost enable_categorical

Xgboost supports categorical features since 1.6 but I am stumbling into an error when using it in shapicant. Here is a minimal example

import pandas as pd
import numpy as np
from shapicant import PandasSelector
import shap
import xgboost as xgb

num_features = pd.DataFrame(np.random.random((100, 4)), columns=list(range(4)))
categoricals = pd.DataFrame(np.random.randint(1, 10, (100, 3)), dtype="category", columns=list(range(4, 7)))

X_train = pd.concat([num_features, categoricals], axis=1, )
X_test = X_train.copy()
y_train = np.random.random((100, ))
y_test = np.random.random((100, ))

params = {
        "colsample_bynode": (len(num_features) + len(categoricals)) ** .5 / (len(num_features) + len(categoricals)),
        "learning_rate": 1,
        "max_depth": 5,
        "num_boost_round": 1,
        "num_parallel_tree": 100,
        "objective": "reg:logistic",
        "subsample": 0.62,
        "enable_categorical": True,
        "tree_method": "hist", "booster": "gbtree",
        "eval_metric": ['logloss', 'rmse'], 'base_score': y_train.mean()
    }
    

model = xgb.XGBRFRegressor(**params, random_state=42)
model.fit(X_train, y_train)

# Use PandasSelector with 100 iterations
explainer_type = shap.TreeExplainer
selector = PandasSelector(model, explainer_type, n_iter=30, random_state=42)

selector.fit(
    X_train,
    y_train,
    X_validation=X_test,
    estimator_params={
        "eval_set": [(X_test, y_test)]
    },
)

Which results into

[10:01:13] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1667849614592/work/src/learner.cc:767: 
Parameters: { "num_boost_round" } are not used.
Computing true SHAP values:   0%|          | 0/30 [00:00<?, ?it/s][10:01:14] WARNING: /Users/runner/miniforge3/conda-bld/xgboost-split_1667849614592/work/src/learner.cc:767: 
Parameters: { "num_boost_round" } are not used.
[0]	validation_0-logloss:0.71347	validation_0-rmse:0.29523
Computing true SHAP values:   0%|          | 0/30 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3457, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-36-cb665d5c0d88>", line 36, in <module>
    selector.fit(
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shapicant/_pandas_selector.py", line 85, in fit
    true_pos_shap_values, true_neg_shap_values = self._get_shap_values(
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shapicant/_pandas_selector.py", line 199, in _get_shap_values
    explainer = self.explainer_type(self.estimator, **explainer_type_params or {})
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shap/explainers/_tree.py", line 149, in __init__
    self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shap/explainers/_tree.py", line 859, in __init__
    xgb_loader = XGBTreeModelLoader(self.original_model)
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/shap/explainers/_tree.py", line 1431, in __init__
    self.buf = xgb_model.save_raw()
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/xgboost/core.py", line 2408, in save_raw
    _check_call(
  File "/Users/cdalmaso/opt/anaconda3/lib/python3.9/site-packages/xgboost/core.py", line 279, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [10:01:14] /Users/runner/miniforge3/conda-bld/xgboost-split_1667849614592/work/src/tree/tree_model.cc:869: Check failed: !HasCategoricalSplit(): Please use JSON/UBJSON for saving models with categorical splits.
Stack trace:
  [bt] (0) 1   libxgboost.dylib                    0x000000017fb0ed98 dmlc::LogMessageFatal::~LogMessageFatal() + 124
  [bt] (1) 2   libxgboost.dylib                    0x000000017fccca40 xgboost::RegTree::Save(dmlc::Stream*) const + 1184
  [bt] (2) 3   libxgboost.dylib                    0x000000017fc102a4 xgboost::gbm::GBTreeModel::Save(dmlc::Stream*) const + 312
  [bt] (3) 4   libxgboost.dylib                    0x000000017fc1b390 xgboost::LearnerIO::SaveModel(dmlc::Stream*) const + 1224
  [bt] (4) 5   libxgboost.dylib                    0x000000017fb2eb2c XGBoosterSaveModelToBuffer + 788
  [bt] (5) 6   libffi.8.dylib                      0x00000001019e804c ffi_call_SYSV + 76
  [bt] (6) 7   libffi.8.dylib                      0x00000001019e57d4 ffi_call_int + 1336
  [bt] (7) 8   _ctypes.cpython-39-darwin.so        0x0000000101c8c544 _ctypes_callproc + 1324
  [bt] (8) 9   _ctypes.cpython-39-darwin.so        0x0000000101c86850 PyCFuncPtr_call + 1176

I am running xgboost==1.7.1 and shapicant==0.4.0

Nan values handling

Hello
I have a problem with your library, if data has Nan values
Error:
Caused by: org.apache.spark.SparkException: Encountered NaN while assembling a row with handleInvalid = "error". Consider
removing NaNs from dataset or using handleInvalid = "keep" or "skip".

Update code to use the concat method instead of append

Hey there future-self,

Just dropping a note to remind you to update the code to replace the pandas append() method with the concat() method. The append() method has been deprecated and removed from pandas version 2.0 onwards, so it's crucial to make this change to ensure the code remains functional.