trr318 / scikit-psl Goto Github PK

Probabilistic Scoring System – a probabilistic & incremental extension to Scoring Systems

License: MIT License

Python 75.04% Jupyter Notebook 24.96%

classifier decision-making decision-support-system probabilistic-classifiers scoring-system sklearn

scikit-psl's Introduction

Probabilistic Scoring Lists

Probabilistic scoring lists are incremental models that evaluate one feature of the dataset at a time. PSLs can be seen as a extension to scoring systems in two ways:

they can be evaluated at any stage allowing to trade of model complexity and prediction speed.
they provide probablistic predictions instead of deterministic decisions for each possible score.

Scoring systems are used as decision support systems for human experts e.g. in medical or judical decision making.

This implementation adheres to the sklearn-api.

Install

pip install scikit-psl

Usage

For examples have a look at the examples folder, but here is a simple example

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from skpsl import ProbabilisticScoringList

# Generating synthetic data with continuous features and a binary target variable
X, y = make_classification(n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

psl = ProbabilisticScoringList({-1, 1, 2})
psl.fit(X_train, y_train)
print(f"Brier score: {psl.score(X_test, y_test, -1):.4f}")
"""
Brier score: 0.2438  (lower is better)
"""

df = psl.inspect(5)
print(df.to_string(index=False, na_rep="-", justify="center", float_format=lambda x: f"{x:.2f}"))
"""
 Stage Threshold  Score  T = -2  T = -1  T = 0  T = 1  T = 2  T = 3  T = 4  T = 5
  0            -     -       -       -   0.51      -      -      -      -      - 
  1     >-2.4245  2.00       -       -   0.00      -   0.63      -      -      - 
  2     >-0.9625 -1.00       -    0.00   0.00   0.48   1.00      -      -      - 
  3      >0.4368 -1.00    0.00    0.00   0.12   0.79   1.00      -      -      - 
  4     >-0.9133  1.00    0.00    0.00   0.12   0.12   0.93   1.00      -      - 
  5      >2.4648  2.00    0.00    0.00   0.07   0.07   0.92   1.00   1.00   1.00 
"""

scikit-psl's People

Contributors

Stargazers

Watchers

Forkers

emil1314

scikit-psl's Issues

Lookahead should respect predefined scores even if the solution is suboptimal

scikit-psl/skpsl/estimators/probabilistic_scoring_list.py

Lines 238 to 248 in ccd105c

 len_pre = min(len(set(predef_features) & remaining_features), len_) 

 len_rest = len_ - len_pre 

 if strict and predef_features: 

 prefixes = [ 

 [f_ for f_ in predef_features if f_ in remaining_features][:len_pre] 

 ] 

 else: 

 prefixes = permutations( 

 set(predef_features) & remaining_features, len_pre 

 )

lookahead should not consider any additional features for selection untill the prefix is exausted.

maybe that is already the case, and maybe we need the lookahead regardless because we might need to lookahead when selecting the score for the current feature.

just needs to be checked, as it might save runtime

extract expected entropy as proper sklearn metric function

The expected entropy can be implemented in a way that it only depends on ytest and clf.predict_proba(Xtest)

Integrate beta calibration into our core repository

Since our change requests for the beta calibration repository are ignored and pypi packages can only have pypi depencencies we can currently not publish the our master version to pypi

My suggestion is to integrate the beta calibration (or at least the relevant code) into our own repository.

Make Classifier robust to using pandas dataframes

we need to cast every X and y to numpy arrays before usage

X = np.array(X)
y = np.array(y)

use "permutations" instead of "combinations" for lookahead and feature sequence generation

Representation as decision tree

PSL can also be represented as decision trees. This could serve both as a development and inspection tool, as well as an alternative to the tabular form in application. Certainly, there are several possibilities for representation, and an implementation can vary as well. For an implementation, the graphviz package might be useful. Scikit-learn also has an inspection method for trees named sklearn.tree.plot_tree.

Example:

Additional Hyperparams of the classifierAtK

The clf at k should be more versitile by:

allowing to select isotonic regression or logistic regression as callibrators

in PSL:

allow to provide an arbitrary scoring function for optimizing the loss
forward all neccesary parameters to the classifier at k in form of a dictionary. that allows to forward all parameters in an easily extensible way

Create example for binarizer used in pipeline

Fit-method should be able to accept weights for datapoints

Validate PSL with sklearn check estimator

from sklearn.utils.estimator_checks import check_estimator

Subset summation in Classifier at k is wrong

currently the set of total scores will be calculated wrong if there are two or more identical scores

e.g. scores = [3, 3] should yield to the total scores of [0,3,6] but it will probably yield to [0,3].

maybe the implementation is even correct right now, but it must be checked throroughly

first negative correlated feature gets minimal negative score instead of maximal

currently the iteration over features and scores uses the maximal score in stage one. however if the feature is negative correlated the "maximal negative score" for a scoreset [-1,-2,1,2] is -1 instead of -2, which will limit the algorithm to use the whole flexibility

investigate crash in inspect method. seems to be related to binary datasets

Train PSL via Rank loss

The StageCLF gets an additional fit function that

splits the data into positive P and negative N samples
creates a dataset D=P×N
score function and binarizer use avg rankloss:
- h(p)<h(n) -> 1
- h(p)=h(n) -> .5

Extend PSL to be capable of working with continuos datasets

in v0.2.0 the psl can only work with binary datasets. There is a binarization class to allow for continuos datasets, but doing the binarization in a preprocessing step is probably not optimal.

future extensions could be:

allow for categorical inputs (producing set-style decision rules)
allow multiclass prediction

move hyperparameters to constructor

PSL has a lookahead parameter and some other params, that are currently set in fit. this is not in alignment with the sklearn api

implement psl.describe(·)

It would be nice to have a describe function that can print the model up to a given stage

def describe(self, max_stage=-1, feature_names=None):
    pass


psl = ProbabilisticScoringList().fit(X,y)
psl.describe()

implement expected entropy as function of y_true and y_proba instead of X

def _expected_entropy(self, X):
    total_scores, score_freqs = np.unique(X @ self.scores_, return_counts=True)
    true_proba = self.regressor.transform(total_scores)
    entropy_values = entropy([1 - true_proba, true_proba], base=2)
    return np.sum((score_freqs / X.size) * entropy_values)

def _expected_entropy(y_true, y_proba):
    proba, score_freqs = np.unique(y_proba, return_counts=True)
    entropy_values = entropy(proba, base=2)
    return entropy_values @ score_freqs

There are two cases to proof:

all scores map to different probabilities (true for strict monotonic regressors like sigmoid, often true for isotonic)

In this case the frequencies are exactly the same and the result is the same

some scores coincide to the same probability

in that case the entropy values are exactly the same and the n previous groups of scores, mapping to the same probability are now added and their weight is added as well.
In that case it should really work

The expected entropy should even simplify to:

def _expected_entropy(y_true, y_proba):
    return entropy(a,base=2,axis=1).mean()

Change default Stage of PSL when predicting (currently last stage is used regardless of performance)

The default model which is used when "psl.predict()" is called shall be the best performing one in terms of the stage loss (expected entropy per default). Currently, it is the model that is the last one in the list, which may not be the best performing one.

Add global-loss for optimizing the psl

Use cachetools to cache already evaluated models during lookahead

https://stackoverflow.com/questions/30730983/make-lru-cache-ignore-some-of-the-function-arguments

when extracting the actual model evaluation from the _optimize function we need do pass the training data and so on to this cached function. however dont want to check for equality in the trainingdata.

A better way would be to use this library to ignore the the irelevant parameters of the cached function
and always delete the cache at the beginning of the fit method.

maybe its even more elegant to define that function as a closure function within the fit function, therefore we can be sure that the cache is clean. the problem here is that this might interfere with joblib

merge/consolidate scoringsystem-cascadizer into the repository

https://github.com/stheid/scoringsystem-cascadizer

Feature Dropout during psl learning

Idea

Sometimes Features of the are not available at runtime. Hence the psl should be more robust when learning from this features in case they are unavailable, also a default score must be assigned if the feature really becomes unavailable!

An idea would be to flag features to be "potentially missing" at runtime and perhaps the psl would also be robust to those features not beeing set in the trainingdata as well.

Improve binarizer to use explicit linear programs instead of Blackbox optimizer

Currently the binarizer maximizes the entropy by using the nelder-mead optimizer. i think it could be implemented using scipy.optimize.linprog() more directly and more efficiently

One-shot Optimization

Currently we greedily select one feature after the other by evaluating all remaining features in every step.

An idea would be to only evaluate each feature-score-pair once and than fit a global ordering of features and only fit the scores in each iteration or so.

Basically see how good the algorithm runs with less compute ressources

[optional] optimization of internal threshold calculation

currently we optimize in the feature, score and threshold in the following way

for each feature
  for each score
    optimize treshold

however the threshold is independend of the score, and we could optimize the threshold first for the feature (depending on the context of already selected features) and than go into the score loop

for each feature
  optimize treshold
  for each score

Try extracting internal Threshhold optimizer to Transformer

If this can be done, the integration of the scoring system cascadizer will get easier and this might also simplify the search algorithm

	len_pre = min(len(set(predef_features) & remaining_features), len_)
	len_rest = len_ - len_pre

	if strict and predef_features:
	prefixes = [
	[f_ for f_ in predef_features if f_ in remaining_features][:len_pre]
	]
	else:
	prefixes = permutations(
	set(predef_features) & remaining_features, len_pre
	)