Giter Club home page Giter Club logo

scikit-psl's Introduction

License Pip Paper

Probabilistic Scoring Lists

Probabilistic scoring lists are incremental models that evaluate one feature of the dataset at a time. PSLs can be seen as a extension to scoring systems in two ways:

  • they can be evaluated at any stage allowing to trade of model complexity and prediction speed.
  • they provide probablistic predictions instead of deterministic decisions for each possible score.

Scoring systems are used as decision support systems for human experts e.g. in medical or judical decision making.

This implementation adheres to the sklearn-api.

Install

pip install scikit-psl

Usage

For examples have a look at the examples folder, but here is a simple example

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from skpsl import ProbabilisticScoringList

# Generating synthetic data with continuous features and a binary target variable
X, y = make_classification(n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

psl = ProbabilisticScoringList({-1, 1, 2})
psl.fit(X_train, y_train)
print(f"Brier score: {psl.score(X_test, y_test, -1):.4f}")
"""
Brier score: 0.2438  (lower is better)
"""

df = psl.inspect(5)
print(df.to_string(index=False, na_rep="-", justify="center", float_format=lambda x: f"{x:.2f}"))
"""
 Stage Threshold  Score  T = -2  T = -1  T = 0  T = 1  T = 2  T = 3  T = 4  T = 5
  0            -     -       -       -   0.51      -      -      -      -      - 
  1     >-2.4245  2.00       -       -   0.00      -   0.63      -      -      - 
  2     >-0.9625 -1.00       -    0.00   0.00   0.48   1.00      -      -      - 
  3      >0.4368 -1.00    0.00    0.00   0.12   0.79   1.00      -      -      - 
  4     >-0.9133  1.00    0.00    0.00   0.12   0.12   0.93   1.00      -      - 
  5      >2.4648  2.00    0.00    0.00   0.07   0.07   0.92   1.00   1.00   1.00 
"""

scikit-psl's People

Contributors

jonashanselle avatar stheid avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

emil1314

scikit-psl's Issues

Lookahead should respect predefined scores even if the solution is suboptimal

len_pre = min(len(set(predef_features) & remaining_features), len_)
len_rest = len_ - len_pre
if strict and predef_features:
prefixes = [
[f_ for f_ in predef_features if f_ in remaining_features][:len_pre]
]
else:
prefixes = permutations(
set(predef_features) & remaining_features, len_pre
)

lookahead should not consider any additional features for selection untill the prefix is exausted.

maybe that is already the case, and maybe we need the lookahead regardless because we might need to lookahead when selecting the score for the current feature.

just needs to be checked, as it might save runtime

Integrate beta calibration into our core repository

Since our change requests for the beta calibration repository are ignored and pypi packages can only have pypi depencencies we can currently not publish the our master version to pypi

My suggestion is to integrate the beta calibration (or at least the relevant code) into our own repository.

Representation as decision tree

PSL can also be represented as decision trees. This could serve both as a development and inspection tool, as well as an alternative to the tabular form in application. Certainly, there are several possibilities for representation, and an implementation can vary as well. For an implementation, the graphviz package might be useful. Scikit-learn also has an inspection method for trees named sklearn.tree.plot_tree.

Example:
image

Additional Hyperparams of the classifierAtK

The clf at k should be more versitile by:

  • allowing to select isotonic regression or logistic regression as callibrators

in PSL:

  • allow to provide an arbitrary scoring function for optimizing the loss
  • forward all neccesary parameters to the classifier at k in form of a dictionary. that allows to forward all parameters in an easily extensible way

Subset summation in Classifier at k is wrong

currently the set of total scores will be calculated wrong if there are two or more identical scores

e.g. scores = [3, 3] should yield to the total scores of [0,3,6] but it will probably yield to [0,3].

maybe the implementation is even correct right now, but it must be checked throroughly

Train PSL via Rank loss

The StageCLF gets an additional fit function that

  • splits the data into positive P and negative N samples
  • creates a dataset D=P×N
  • score function and binarizer use avg rankloss:
    • h(p)<h(n) -> 1
    • h(p)=h(n) -> .5

Extend PSL to be capable of working with continuos datasets

in v0.2.0 the psl can only work with binary datasets. There is a binarization class to allow for continuos datasets, but doing the binarization in a preprocessing step is probably not optimal.

future extensions could be:

  • allow for categorical inputs (producing set-style decision rules)
  • allow multiclass prediction

implement psl.describe(·)

It would be nice to have a describe function that can print the model up to a given stage

def describe(self, max_stage=-1, feature_names=None):
    pass


psl = ProbabilisticScoringList().fit(X,y)
psl.describe()

image

implement expected entropy as function of y_true and y_proba instead of X

def _expected_entropy(self, X):
    total_scores, score_freqs = np.unique(X @ self.scores_, return_counts=True)
    true_proba = self.regressor.transform(total_scores)
    entropy_values = entropy([1 - true_proba, true_proba], base=2)
    return np.sum((score_freqs / X.size) * entropy_values)
def _expected_entropy(y_true, y_proba):
    proba, score_freqs = np.unique(y_proba, return_counts=True)
    entropy_values = entropy(proba, base=2)
    return entropy_values @ score_freqs

There are two cases to proof:

  1. all scores map to different probabilities (true for strict monotonic regressors like sigmoid, often true for isotonic)
  • In this case the frequencies are exactly the same and the result is the same
  1. some scores coincide to the same probability
  • in that case the entropy values are exactly the same and the n previous groups of scores, mapping to the same probability are now added and their weight is added as well.
    In that case it should really work

The expected entropy should even simplify to:

def _expected_entropy(y_true, y_proba):
    return entropy(a,base=2,axis=1).mean()

Use cachetools to cache already evaluated models during lookahead

https://stackoverflow.com/questions/30730983/make-lru-cache-ignore-some-of-the-function-arguments

when extracting the actual model evaluation from the _optimize function we need do pass the training data and so on to this cached function. however dont want to check for equality in the trainingdata.

  • A better way would be to use this library to ignore the the irelevant parameters of the cached function
  • and always delete the cache at the beginning of the fit method.

maybe its even more elegant to define that function as a closure function within the fit function, therefore we can be sure that the cache is clean. the problem here is that this might interfere with joblib

Feature Dropout during psl learning

Idea

Sometimes Features of the are not available at runtime. Hence the psl should be more robust when learning from this features in case they are unavailable, also a default score must be assigned if the feature really becomes unavailable!

An idea would be to flag features to be "potentially missing" at runtime and perhaps the psl would also be robust to those features not beeing set in the trainingdata as well.

One-shot Optimization

Currently we greedily select one feature after the other by evaluating all remaining features in every step.

An idea would be to only evaluate each feature-score-pair once and than fit a global ordering of features and only fit the scores in each iteration or so.

Basically see how good the algorithm runs with less compute ressources

[optional] optimization of internal threshold calculation

currently we optimize in the feature, score and threshold in the following way

for each feature
  for each score
    optimize treshold

however the threshold is independend of the score, and we could optimize the threshold first for the feature (depending on the context of already selected features) and than go into the score loop

for each feature
  optimize treshold
  for each score

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.