ing-bank / probatus Goto Github PK

Validation (like Recursive Feature Elimination for SHAP) of (multiclass) classifiers & regressors and data used to develop them.

Home Page: https://ing-bank.github.io/probatus

License: MIT License

Python 100.00%

machine-learning data-science tree-model data-analysis statistics shap feature-elimination binary-classifiers multi-class-classification recursive-feature-elimination

probatus's Introduction

Probatus

Overview

Probatus is a python package that helps validate regression & (multiclass) classification models and the data used to develop them. Main features:

probatus.interpret provides shap-based model interpretation tools
probatus.sample_similarity to compare two datasets using resemblance modelling, f.e. train with out-of-time test.
probatus.feature_elimination.ShapRFECV provides cross-validated Recursive Feature Elimination using shap feature importance.

Installation

pip install probatus

Documentation

Documentation at ing-bank.github.io/probatus/.

You can also check out blog posts about Probatus:

Contributing

To learn more about making a contribution to Probatus, please see CONTRIBUTING.md.

probatus's People

Contributors

Stargazers

Watchers

probatus's Issues

Ensure compatibility with updates of dependencies

Problem Description

probatus has dependencies that are not pinned:

probatus/setup.py

Lines 25 to 34 in 2723c88

 install_requires=[ 

 "scikit-learn>=0.22.2", 

 "pandas>=1.0.0", 

 "matplotlib>=3.1.1", 

 "scipy>=1.4.0", 

 "joblib>=0.13.2", 

 "tqdm>=4.41.0", 

 "shap>=0.36.0", 

 "numpy>=1.19.0" 

 ],

That is user-friendly, but introduces the risk that updates to downstream dependencies break probatus without us knowning.

Desired Outcome

Notification of breaking updates.

Solution Outline

There are two main methods to deal with this:

Run unit tests on a schedule (f.e every couple of days), so you get a broken build you can fix. As an example, see this PR from scikit-lego koaning/scikit-lego#378
Fix dependencies to the latest version, and use dependabot to automatically open an PR (and thus trigger unit tests) when a dependency is updated.

Both options have upsides and downsides. I'm inclined to 'force' users to use the latest versions (option 2), creating a new release for updated dependencies as soon as they pass unit tests.

Move tutorials to colab and add it to readme and mkdocs page

Make Colab notebooks, from existing tutorial notebooks and expose them in README table.

See example here:
https://github.com/ing-bank/popmon

TODO

Make colab notebooks
Add links in a table in readme
Remove tutorial notebooks and replace them with colab notebooks reference

Improve docstrings consistency

Make sure that all the docstrings have the same format as feature_elimination.py
Make sure there are not warnings in mkdocs when mkdocs serve is run

Update docs to include columns_to_keep feature in SHAPRFECV.

Update the current docs/tutorials/nb_shap_feature_elimination.ipynb notebook to include an example about using columns_to_keep.
We can add the example before the ShapRFECV vs RFECV section.

Run black over the codebase

Run black package over the code base
Ensure the documentation works as expected, and the long links have not been broken

Expose class attributes to users

Decide on the convention on how to expose class attributes to users e.g. trailing "_" like in sklearn
Decide on attributes that can be exposed across different functionality e.g. fitted_, feature_importances_, report_, train_metric_, test_metric_.
Document in class docstring which attributes the users can expect

Refactor Distribution statistics to implement fit_compute interface

distribution_statistics classes should extend BaseFitComputeClass and implement: fit, fit_compute, compute methods.

Implement correlated features elimination routine

Implement correlated features elimination in feature_elimination module (stick to the class API: fit, compute, fit_compute and plot, you can see example in ShapRFECV)

The process of feature elimination could be iterative and follow the schema:

Select highest correlating feature pair
If above max_correlation_allowed then remove one using the provided strategy

The elimination strategy could be the following:
- 'random' - random one of the 2
- 'centrality' - pick the one that is highly correlated to more other features
- 'manual' - allow user to select one of the two

In the future we could also consider adding:

parameter indicating features that the user wants to keep regardless of correlation
add handling of categorical features

Example code that i used for similar purpose is below. It needs to be refactored and formulated into the probatus API.

from probatus.feature_elimination import ShapRFECV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import lightgbm
from scipy.stats import spearmanr
from probatus.stat_tests import DistributionStatistics
from pandas.api.types import is_numeric_dtype
from probatus.feature_elimination import ShapRFECV
from probatus.stat_tests import AutoDist
 
feature_names = ['f1_missing', 'f2_missing', 'f3_unstable', 'f4_unstable', 'f5_unstable', 'f6_correlated', 'f7_correlated', 'f8_correlated', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20']
 
# Prepare two samples
X, y = make_classification(n_samples=1000, class_sep=0.05, n_informative=6, n_features=20,
                           random_state=1, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)
X['f1_missing'] = X['f1_missing'].apply(lambda x: x if np.random.rand()<0.5 else np.nan)
X['f2_missing'] = X['f2_missing'].apply(lambda x: x if np.random.rand()<0.8 else np.nan)
X['f7_correlated'] = X['f6_correlated'] + 0.5
X['f8_correlated'] = X['f6_correlated'] * 0.5
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
 
# Introduce shift in certain features in test
X_test['f3_unstable'] = X_test['f3_unstable'] + 2
X_test['f4_unstable'] = X_test['f4_unstable'] + 2
X_test['f5_unstable'] = X_test['f5_unstable'] + 2

def clean_correlation_matrix(corr_matrix: np.ndarray, replaced_value: int=-1) -> np.ndarray:
    """
    Overwrite all diagonal and bottom left from diagonal values from 2d correlation matrix with a replaced_value.
 
    Args:
        corr_matrix (np.array): 2d matrix with correlation values.
        replaced_value (Optional, integer): Value that is imputed, by default -1.
 
    Returns:
        (np.array) Correlation matrix
    """
    cleaned_corr_matrix = corr_matrix
 
    for i in range(len(cleaned_corr_matrix)):
        for j in range(len(cleaned_corr_matrix)):
            if i >= j:
                cleaned_corr_matrix[i, j] = replaced_value
 
    return cleaned_corr_matrix
 
def remove_correlated_features(corr_matrix: np.ndarray, correlation_threshold: float, feature_names: List[str]) -> Tuple[List[str], List[str], List[str]]:
    """
    Iteratively remove correlated features. It iteratively looks at highest correlated feature pair, and asks user to
    select, which feature to keep based on the expert knowledge. The process is continued until all pairs have
    correlation below correlation_threshold.
 
    Args:
        corr_matrix (np.array): 2D Correlation Matrix.
        correlation_threshold (float): Threshold above which correlated features are removed.
        feature_names (list of str): list of feature names.
 
    Returns:
        (list, list, list) Feature names remaining after removal, selected feature at each iteration of removel,
         and features removed at each iteration.
    """
    # Take absolute value of correlations
    corr_matrix = np.abs(corr_matrix)
 
    # Replace diagonal values
    clean_corr_matrix = clean_correlation_matrix(corr_matrix, -1)
 
    # Init variables
    selected_features = np.array([])
    removed_features = np.array([])
    remaining_features = np.array(feature_names)
 
    # Iteratively remove correlated features one by one if there is any correlation above threshold
    while np.max(clean_corr_matrix) > correlation_threshold:
        position_max_correlated_features = np.unravel_index(clean_corr_matrix.argmax(),
                                                            clean_corr_matrix.shape)
        print(f'Select one of the following features with correlation '
              f'{clean_corr_matrix[position_max_correlated_features]}: \n'
              f'0 for {remaining_features[position_max_correlated_features[0]]} \n'
              f'1 for {remaining_features[position_max_correlated_features[1]]}')
 
        selected_option = int(input())
        removed_option = 1 - selected_option
 
        # Append for tracking results
        removed_features = np.append(removed_features,
                                     remaining_features[position_max_correlated_features[removed_option]])
        selected_features = np.append(selected_features,
                                      remaining_features[position_max_correlated_features[selected_option]])
 
        print(f'Removing {remaining_features[position_max_correlated_features[removed_option]]}, Num of features left'
              f' {len(clean_corr_matrix) - 1}')
 
        # Remove the feature
        remaining_features = np.delete(remaining_features, position_max_correlated_features[removed_option])
        clean_corr_matrix = np.delete(clean_corr_matrix, position_max_correlated_features[removed_option], 0)
        clean_corr_matrix = np.delete(clean_corr_matrix, position_max_correlated_features[removed_option], 1)
    return list(remaining_features), list(selected_features), list(removed_features)
 
corr_matrix_numeric , p_val_matrix_neg_numeric = spearmanr(X_train[feature_names], nan_policy='omit')
remaining_features, selected_features, removed_features = remove_correlated_features(corr_matrix_numeric, 0.95, feature_names)
 
print(f'Dropping highly correlated features {removed_features}.')
X_train = X_train.drop(columns=removed_features)
X_test = X_test.drop(columns=removed_features)
feature_names = [feature for feature in feature_names if feature not in removed_features]

You can make a class in feature_elimination module. This class would look like this:

__init__(correlation_type='spearman' or 'pearson', correlation_threshold=0.95, **kwargs passed to correlation method)
fit(X, y)
compute(): -> pd.DataFrame
fit_compute(X,y)
plot() # Correlation matrix (possibly before and after)

Unify data preprocessing steps

Steps:

Ensure that the input is transformed to DataFrame, set the dtype to category, such that LGBM can handle them
Make a generic preprocess function for X argument. Ensure it is always transformed to df
Test each feature that should allow for categorical variables, whether it accepts it e.g. SHAPResemblanceModel, MetricVolatility etc.

Add support for kernel and neighbour for SHAPRFECV

Problem Description
Currently SHAPRFECV does not support models such as SVM,KNN. This depends on this open issue with SHAP.Once this issue is resolved these models can be supported by SHAPRFECV.

Desired Outcome
SHAPRFECV works with SVM,KNN models.

Solution Outline

Upgrade the SHAP version once the above issue is resolved.
Add test to check if the explainer method is able to work with SVM,KNN models.

Refactor the code for binning

The binning module should be refactored, and the code should follow the compute() api

Deploy documentation on release using github actions

Problem Description
To deploy the probatus docs after a release, we need to manually run mkdocs gh-deploy

Desired Outcome
That's a lot of typing and thinking right there. We need to automate! A release on github should also trigger deployment of the docs (as well as deployment to pypi)

Solution Outline
Use github actions. Detailed setup described here: https://squidfunk.github.io/mkdocs-material/publishing-your-site/

Currently we do not have permissions to set secrets in probatus as they are managed by the organization. @Matgrb ping me when this is resolved.

Implement CV Based Volatility Estimation

The idea would be similar to TrainTestVolatility class, yet instead of repeatedly splitting data into train and test, we could use Repeated CV from sklearn and get train and test scores for each iteration.
The feature would extend BaseVolatilityEstimator. The main difference is implementation of fit() method.

ShapRFECV accept sequence / range to `step` parameter

Problem Description

Currently ShapRFECV parameter step only accepts an int or a float:

probatus/probatus/feature_elimination/feature_elimination.py

Lines 95 to 100 in ed5ab8c

  step (int or float, optional): 

  Number of lowest importance features removed each round. If it is an int, then each round such number of 

  features is discarded. If float, such percentage of remaining features (rounded down) is removed each 

  iteration. It is recommended to use float, since it is faster for a large number of features, and slows 

  down and becomes more precise towards less features. Note: the last round may remove fewer features in 

  order to reach min_features_to_select.

Desired Outcome

For long running models, I would like to be able to control the number of features selected more precisely.

Solution Outline

By providing a list of the # features selected in each step, you would allow more control. For example:

shap_elimination = ShapRFECV(clf=search, step=[20,17,15], cv=10, scoring='roc_auc')

# or
shap_elimination = ShapRFECV(clf=search, step=list(range(100,70,-4)) + list(range(70,50,-2)) + list(range(50,10,-1)), cv=10, scoring='roc_auc', n_jobs=3)

Data Shift detection over time with Resemblance Model

One way to measure at which point in time we can observe a data shift.

One idea would be to split data into multiple folds based on time e.g. 10. Then one could train a resemblance model taking:

1st fold as class 0 and 9 other folds as class 1
2 first folds as class 0 and 8 other as class 1
...
By measuring the AUC over different split time, we could observe whether data significantly changes at any point in time, and we could monitor, which features contribute to that.
One drawback is the first and last iteration would lead to high AUC, due to small sample size of one of the classes.

Another idea would be to again split the data into folds based on time. However, in this case we would perform cross-validation, where at each iteration only one fold belongs to class 1. This way we could point to e.g. months that significantly differ from other months in the past and the future.

This feature could be part of sample_similarity, and the user could specify which resemblance model to use: permutation or shap based (default could be SHAP).

The init should take as arguments the clf and resemblance_model type, and fit should take X dataset, and indication of number of bins, or split dates, and date column name.

The output of compute should be the report presenting the validation AUC for each iteration of the process, the information about current time split date and top feature of the resemblance model (top_n parameter?)

In the plot method we could plot the AUC over time, but also user should be able to plot resemblance model plots for a specific iteration, to analyse it.

ShapModelInterpreter calculate train set feature importance

Problem Description
Currently, ShapModelInterpreter calculates the mean absolute shap value and mean shap value for each feature on the test set. This measure helps understanding the feature importance and feature influence on the model. However, this is only for the test set. In case there is a shift between train and test this might be misleading, and one needs to compare it with train data shap importance.

Desired Outcome
mean absolute shap value and mean shap value for the train set, together with the test set. Also plotting based on the train data.

Solution Outline

Calculate shap values for train as well and add it to the report_df.
In plot method take an argument indicating whether the plot should be done on train or test
Update the docs for this feature, explain why it is relevant.

Make API more consistent

Problem Description
Currently some classes have different naming of some attributes. E.g. some take clf and some model.

Desired Outcome
We need to make them consistent

Solution Outline

Make the parameters consistent accross probatus e.g. clf, model could be changed to clf or estimator.
Each plot should have show() parameter

Add adversarial validation to sample_similarity

Once we find out that the distribution of the training and test data are different, we can use adversarial validation to identify a subset of training data that is similar to the test data.
This subset can then be used as the validation set.Thus we will have a good idea of how our model will perform on the test set, which belongs to a different distribution than our training set.
The pseudo-code to include adversarial validation can be as follows :


    """
    Args:
        train   : training dataframe
        test    : testing dataframe
        clf     : classifier used to seperate train and test
        threshold : threshold , default = 0.5

    Returns:
       adv_val_set : validation set.

    """

    train['istest']=0
    test['istest']=1
    df = pd.concat([train,test],axis=0)
    y = df['istest']
    X = df.drop(columns=['istest'],axis=1)
    proba = cross_val_predict(clf,X,y,cv=3,method='predict_proba')
    df['test_proba'] = proba[:,1]
    adv_val_set = df.query('istest==0 and test_proba > {threshold}')
    return adv_val_set

More information can be found here and here

Use cross references in the API documentation

You can provide relative links to module using cross references

It makes it easy to link modules like the ones mentioned in this intro:

https://ing-bank.github.io/probatus/api/sample_similarity.html

Installation problem: probatus only have "name" class

Platform

Date: Jan 2, 2021
macos catalina
miniconda3

I created a new conda environment to test the usage of the module probatus, but for some reason only name class of the module is installed. It does not have any other methods. This is highly strange.

I also tried installin this module in other pre-existing conda environments (with python3.7) but I got the same issue.

However, when I install the module in google colab, it runs fine.
There seems to have some strange problem here and if any other programmer
is having same issue like this, I decided to create a bug report so everybody can benefit from it.

Note: I have no problem with other modules such as lightgbm and xgboost all other codes are running fine, only probatus module is not working.

In google colab python is 3.6. I also tried minicoda with python3.6, still probatus gives only the "name" attribute.

print(dir(probatus))
['__builtins__', '__cached__', '__doc__',
 '__file__', '__loader__', '__name__', '__package__',
 '__path__', '__spec__', 'name']

Installation

# Create new env
yes|conda create -n prob python=3.8
source activate prob

# adding new kernel to ipythonsource activate prob
yes|conda install ipykernel 
python -m ipykernel install --user --name prob --display-name "Python38(prob)"
yes|conda install -n prob  nb_conda_kernels


# special
/Users/poudel/opt/miniconda3/envs/prob/bin/pip install probatus

ShapRFECV implement best features set

Implement automatic selection of get_reduced_features_set()-> one with highest Validation AUC and lowest number of features. Next to this, we should implement arguments as in RFECV, namely support_ and ranking_, where the users can get the final features set without the need to manually select the best features set.

In the end, the user should select the number of features manually, however, for consistency and usability, we can add the function to perform it automatically. Optionally we can print a warning, whenever it is done.

Part of this issue is updating the tests, and docs (notebook with tutorial).

Unify scoring metric within probatus

Unify how scoring is done within probatus
Document an example and API

Implement LinearResembalnce model

Problem Description
Currently, the resemblance model uses SHAP or permutation importance in order to understand the shift between X1 and X2. However, this is quite time consuming, and the output of that might be hard to analyse. Even though the AUC of such model is 75%, the shift might be very small, and the relations between features might change.

Desired Outcome
Implement Linear Resemblance Model, where the user provides a linear model, and the same steps are performed to fit it. However, when calculating the feature importance, we can simply plot the coefs, which will be much faster and easier to analyse.

Solution Outline

Implement a class LinearResemblance, which extends BaseResemblanceModel and overwrite init, fit, plot methods. A lot of it can be based on SHAPImportanceResemblance
Make sure that the plots look good for a large and small number of features. Consider using eli5 or implementing something similar as eli5.show_weights, see the example outcome below.
Add documentation and unit tests for this feature

test issue

Add verbose parameter to ShapRFECV

Add verbose parameter to fit() and fit_compute methos in ShapRFECV. If verbose then do not show prints not warnings.

Update the setup.py file to include requirement sets

Problem Description
As of now, there is no requirement.txt , inorder to make it easier for the developers to contribute to we can define sets in the setup.py file.

Desired Outcome
The setup.py file is updated to include sets.
We should also update CONTRIBUTING.md and the different github action pipelines.

Solution Outline
Sets would be: documentation, tests, development, core, and all. E.g setup
Define these sets via install_requires and extras_require.
You can then install what you need via pip install probatus[name-of-set].

Refer this issue for the detailed discussion.

Add codecov badge to README

The current code coverage badge in README points to gitlab, we want it to use the coverage report from github

Add option to overlap importance on train and test in plot of ShapModelInterpreter

Currently ShapModelInterpreter plot method either plots the graph for train set or for test set.

We want to make sure you can plot both in one graph

SHAPImportanceResemblance.init has no source code on API documentation

Problem Description
The API Documentation of SHAPImportanceResemblance is not showing the source of the __init__ method.

Desired Outcome
For consistency, it would be good to show the __init__ source code on the API Documentation.

Solution Outline
All other methods already show their corresponding __init__ source codes. I assume that the API Documentation is automatically generated, so there must be an option that's keeping this particular one out of the documentation.

Add tutorial on LightGBM with early stopping in probatus

Add tutorial on LightGBM with early stopping in probatus

Desired Outcome

Desired Outcome
ShapModelInterpreter to support model_ouput="probability" from SHAP TreeExplainer.

Thanks for a fantastic library.

Extend SHAP-based features support for non tree-based models

Problem Description
Currently the features based on SHAP library use TreeExplainer, which is limited to trees only. Because of that, one cannot use probatus with models different than that, e.g. SVM with rbf kernel

Desired Outcome
Probatus should support using other models, and use other explainer e.g. KernelExplainer in case the model is not tree based

Solution Outline

In function calculating shap values in utils, check if the model is tree or not tree based and apply the correct explainer. Consider making a warning about using KernelExplainer, and that it is very slow.
Check each feature using SHAP whether this works
Make a unit test for that.
Update the docs for every feature using shap that it also works for non-tree based models, yet it might be slow
Consider some optimizations e.g. maybe you don't need to calculate shap values for all samples, but you can subsample. You can check whether this is required at the end, based on the speed of the solution

Example how to use Kernel Explainer
https://shap.readthedocs.io/en/latest/example_notebooks/kernel_explainer/Iris%20classification%20with%20scikit-learn.html

Make an experiment comparing sklearn.RFECV with ShapRFECV

Issue Description

Compare RFECV with ShapRFECV in an experiment in a jupyter notebook
Make a page in docs presenting this experiment

Code snippets
There are some ready to use code snippets:

from probatus.feature_elimination import ShapRFECV
import numpy as np
import pandas as pd
import lightgbm
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import RFECV
import matplotlib.pyplot as plt

# Prepare train and test data:
X, y = make_classification(n_samples=10000, class_sep=0.1, n_informative=40, n_features=50,
                           random_state=0, n_clusters_per_class=10)
X = pd.DataFrame(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

# Set up the model:
clf = lightgbm.LGBMClassifier(n_estimators=10, num_leaves=7)

# Run RFECV and ShapRFECV with the same parameters
rfe = RFECV(clf, step=1, cv=10, scoring='roc_auc', n_jobs=3).fit(X_train, y_train)
shap_elimination = ShapRFECV(clf=clf, step=1, cv=10, scoring='roc_auc', n_jobs=3)
shap_report = shap_elimination.fit_compute(X_train, y_train)

# Compare the CV Validation AUC for different number of features in each method.
ax = pd.DataFrame({'RFECV Validation AUC': list(reversed(rfe.grid_scores_)),
                   'ShapRFECV Validation AUC': shap_report['val_metric_mean'].values.tolist()}, 
                 index=shap_report['num_features'].values.tolist()).plot(ylim=(0.5, 0.6), rot=10, title='Comparison of RFECV and ShapRFECV', figsize=(10,5))
ax.set_ylabel("Model Performance")
ax.set_xlabel("Number of features")
ax.invert_xaxis()
plt.show()

n_features_shap = 21
n_features_rfecv = rfe.n_features_

# Calculate the AUC for the models with different feature sets
test_auc_full = clf.fit(X_train, y_train).score(X_test, y_test)
val_auc_full =  np.mean(cross_val_score(clf, X_train, y_train, cv=10))

rfe_features_set = X_train.columns[rfe.support_]
test_auc_rfe = clf.fit(X_train[rfe_features_set], y_train).score(X_test[rfe_features_set], y_test)
val_auc_rfe = rfe.grid_scores_[n_features_rfecv]

shap_feature_set = X_train.columns[shap_elimination.get_reduced_features_set(n_features_shap)]
test_auc_shap = clf.fit(X_train[shap_feature_set], y_train).score(X_test[shap_feature_set], y_test)
val_auc_shap = shap_report[shap_report.num_features == n_features_shap]['val_metric_mean'].values[0]

shap_feature_set_size_rfe = X_train.columns[shap_elimination.get_reduced_features_set(n_features_rfecv)]
test_auc_shap_size_rfe = clf.fit(X_train[shap_feature_set_size_rfe], y_train).score(X_test[shap_feature_set_size_rfe], y_test)
val_auc_shap_size_rfe = shap_report[shap_report.num_features == n_features_rfecv]['val_metric_mean'].values[0]

# Plot Test and Validation Performance
variants = ('All 50 features', f'RFECV {n_features_rfecv} features', f'ShapRFECV {n_features_shap} features', f'ShapRFECV {n_features_rfecv} features')
results_test = [test_auc_full, test_auc_rfe, test_auc_shap, test_auc_shap_size_rfe]
results_val = [val_auc_full, val_auc_rfe, val_auc_shap, val_auc_shap_size_rfe]

ax = pd.DataFrame({'CV Validation AUC': results_val,
                   'Test AUC': results_test}, index=variants).plot.bar(ylim=(0.5, 0.60), rot=10, title='Comparison of RFECV and ShapRFECV', figsize=(10,5))
ax.set_ylabel("Model Performance")
plt.show()

ShapRFECV implement features to keep parameter

Desired Outcome
Shap RFECV makes it possible to remove from features to prune a set of features to keep always

Solution Outline
The fit method has a parameter features to keep, at each step, the method will make sure to not remove the features to always keep.

SHAPREFCV provides optimistic results for validation

Describe the bug
SHAPRFECV provides an optimistic results for the validation metrics while eliminating features from a random dataset.

To Reproduce
Code or steps to reproduce the error

def make_data(N=1000, n_vars=10,
              n_classes=2):
    X = np.random.normal(size=(N,n_vars))
    y = np.random.choice(n_classes, N)
    
    return X, y

X,y = make_data(N=10000,n_vars=50)

X = pd.DataFrame(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Set up the model:
clf = lightgbm.LGBMClassifier(n_estimators=10, num_leaves=7)

# Run RFECV and ShapRFECV with the same parameters
rfe = RFECV(clf, step=1, cv=20, scoring='roc_auc', n_jobs=3).fit(X_train, y_train)
shap_elimination = ShapRFECV(clf=clf, step=1, cv=20, scoring='roc_auc', n_jobs=3)
shap_report = shap_elimination.fit_compute(X_train, y_train)

# Compare the CV Validation AUC for different number of features in each method.
ax = pd.DataFrame({'RFECV Validation AUC': list(reversed(rfe.grid_scores_)),
                   'ShapRFECV Validation AUC': shap_report['val_metric_mean'].values.tolist()}, 
                 index=shap_report['num_features'].values.tolist()).plot(ylim=(0.4,0.7), rot=10, title='Comparison of RFECV and ShapRFECV', figsize=(10,5))
ax.set_ylabel("Model Performance")
ax.set_xlabel("Number of features")
ax.invert_xaxis()
plt.axhline(y=0.5,linestyle='--',color='r')
plt.show()

# For simple model.
clf = LogisticRegression()

Expected behavior
Incase of a random dataset, the calculated metrics should be as close to 0.5 as possible.

Desktop (please complete the following information):

probatus version [e.g. 1.5.0]
OS: MacOS

Add images to API docs

Problem Description
Currently the API docs are automatically generated based on the docstrings. The example code, does not include the image that normally would show up as output of the code example.

Desired Outcome
Add images that are output of code examples to the docstrings, and make sure they are added to the mkdocs page

Solution Outline
Use above Example in docstrings

Sample row

When writing technical reports on model, it makes sense to show a row of the data. However, to avoid cherry picking the sampling should be random.

Finish migration to github

Move all the links to the docs to github pages
Change old docs to point to the new docs page
close all MRs in gitlab
Add team members to this repo
Add new slackbot

Implement unified interface

Implement interfaces each class extends:

Change fitted to fitted_, Make a class ComputeObject, FitComputeObject, which implement the API and _is_fitted()

Imputation method selection using Metric Volatility

One of the often encountered questions in modelling is how to impute the missing values.

An interesting approach to tackle this is the following:

User provides the model he would like to use e.g. XGBoost
User provides the data split into train and test
User provides metrics of interest e.g. AUC, Accuracy
User provides methods for imputation to be tested e.g. zero imputation, median, prediction of values, or no imputation

Then probatus would run the metric volatility to compute the mean and std on train and test for these metrics, models and different imputation methods. Thanks to this the user can select which method to use for a given dataset and model, and understand how volatility of the metrics is affected by it.

Upgrade SHAP to 0.38.0

There is a new release of SHAP library with new functionality. We need to:

Upgrade version in setup.py
Run unit tests (with upgraded shap) to check if everything works with new version

Optionally one can also pick up #14 now.

Add references to stats tests methods

One of the functionality in probatus allows to perform the statistical tests on two samples of data:

'ES': Epps-Singleton,
'KS': Kolmogorov-Smirnov statistic,
'PSI': Population Stability Index,
'SW': Shapiro-Wilk based difference statistic,
'AD': Anderson-Darling TS.

In multiple files we mention these tests in the docstrings, however, there are no references to the page/paper describing these tests. the references need to be added in:

probatus.stats_tests.distribution_statistics.py both DistributionStatistic class and AutoDist class
files in probatus.stats_tests: ad.py. es.py, ks.py, psi.py, sw.py
docs.tutorials.nb_distribution_statistics.ipynb

Make sure that reference is added correctly as in MD file format. They should appear in the docs as a hyperlinks.

Install probatus in colab notebooks

Colab notebooks in documentation tutorials do not have the probatus installed.

One needs to install it in the environment in each notebook

Enable CV in volatility estimation

Currently volatility estimation is done using train/test split. It would improve the code and speed if we transformed the train/test split into CV object, and passed it to cross_val_score to get the results

ShapRFECV adds support for HalvingGridSearchCV

Problem Description
With the new release of sklearn there is a new HalvingGridSearchCV. We want probatus ShapRFECV to support use of it

Solution Outline
In order not to have to bump the version of sklearn we can check if the clf is instance of BaseSearchCV, which HalvingGridSearchCV extends. In case yes, then it should be treated the same way as RandomizedSearchCV.
One also needs to update the docs.

Add issue template

https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/configuring-issue-templates-for-your-repository

requirements.yml file is missing

requirements.yml file is missing.

Unit test docstring and notebooks

Problem Description
What might happen is that as probatus is updated, examples in the docstring or in the tutorials break. That would be embrassing, and is the best way to lose users.

Desired Outcome
Unit tests fail if the documentation examples break.

Solution Outline

have a look at the internal skorecard package for an example on how to unit test docstrings
for testing the notebooks, probably something like:

# tests/test_notebooks.py
def test_notebooks(path):
    exec(path)

that you parametrize over a list of relative paths of notebooks.

Evaluate metrics over time

Problem Description
It would be great if Probatus would give the metric (including volatility) over time so that eventual drops in model performance can be spotted easily. The time aggregation level (day, month, quarter) would be chosen by the user.

Desired Outcome
The output would be a dataframe containing the following columns: dates, metric1, metric2. The input could then be used for a plot like the following:

Also possibility to evaluate for out-of-time would be required.

Solution Outline
Maybe incorporate it in the metric_volatility class. Passing a series with the date to aggregate and using groupby before computing the metrics.

	install_requires=[
	"scikit-learn>=0.22.2",
	"pandas>=1.0.0",
	"matplotlib>=3.1.1",
	"scipy>=1.4.0",
	"joblib>=0.13.2",
	"tqdm>=4.41.0",
	"shap>=0.36.0",
	"numpy>=1.19.0"
	],

	step (int or float, optional):
	Number of lowest importance features removed each round. If it is an int, then each round such number of
	features is discarded. If float, such percentage of remaining features (rounded down) is removed each
	iteration. It is recommended to use float, since it is faster for a large number of features, and slows
	down and becomes more precise towards less features. Note: the last round may remove fewer features in
	order to reach min_features_to_select.

ing-bank / probatus Goto Github PK

probatus's Introduction

Probatus

Overview

Installation

Documentation

Contributing

probatus's People

Contributors

Stargazers

Watchers

Forkers

probatus's Issues

Platform

Installation

Recommend Projects

Recommend Topics

Recommend Org