Giter Club home page Giter Club logo

veridical-flow's Introduction

vflow logo

A library for making stability analysis simple. Easily evaluate the effect of judgment calls to your data-science pipeline (e.g. choice of imputation strategy)!

mit license python3.9+ tests tests joss PyPI - version

Why use vflow?

Using vflows simple wrappers facilitates many best practices for data science, as laid out in the predictability, computability, and stability (PCS) framework for veridical data science. The goal of vflow is to easily enable data science pipelines that follow PCS by providing intuitive low-code syntax, efficient and flexible computational backends via Ray, and well-documented, reproducible experimentation via MLflow.

Computation Reproducibility Prediction Stability
Automatic parallelization and caching throughout the pipeline Automatic experiment tracking and saving Filter the pipeline by training and validation performance Replace a single function (e.g. preprocessing) with a set of functions and easily assess the stability of downstream results

Here we show a simple example of an entire data-science pipeline with several perturbations (e.g. different data subsamples, models, and metrics) written simply using vflow.

import sklearn
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

from vflow import Vset, init_args

# initialize data
X, y = make_classification()
X_train, X_test, y_train, y_test = init_args(
    train_test_split(X, y),
    names=["X_train", "X_test", "y_train", "y_test"],  # optionally name the args
)

# subsample data
subsampling_funcs = [sklearn.utils.resample for _ in range(3)]
subsampling_set = Vset(
    name="subsampling", vfuncs=subsampling_funcs, output_matching=True
)
X_trains, y_trains = subsampling_set(X_train, y_train)

# fit models
models = [LogisticRegression(), DecisionTreeClassifier()]
modeling_set = Vset(name="modeling", vfuncs=models, vfunc_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)
preds_test = modeling_set.predict(X_test)

# get metrics
binary_metrics_set = Vset(
    name="binary_metrics",
    vfuncs=[accuracy_score, balanced_accuracy_score],
    vfunc_keys=["Acc", "Bal_Acc"],
)
binary_metrics = binary_metrics_set.evaluate(preds_test, y_test)

Once we've written this pipeline, we can easily measure the stability of metrics (e.g. "Accuracy") to our choice of subsampling or model.

Documentation

See the docs for reference on the API

Notebook examples

Note that some of these require more dependencies than just those required for vflow. To install all, run pip install vflow[nb].

Synthetic classification

Enhancer genomics

fMRI voxel prediction

Fashion mnist classification

Feature importance stability

Clinical decision rule vetting

Installation

Stable version

pip install vflow

Development version (unstable)

pip install vflow@git+https://github.com/Yu-Group/veridical-flow

References

@software{duncan2020vflow,
   author = {Duncan, James and Kapoor, Rush and Agarwal, Abhineet and Singh, Chandan and Yu, Bin},
   doi = {10.21105/joss.03895},
   month = {1},
   title = {{VeridicalFlow: a Python package for building trustworthy data science pipelines with PCS}},
   url = {https://doi.org/10.21105/joss.03895},
   year = {2022}
}

veridical-flow's People

Contributors

aagarwal1996 avatar csinva avatar danielskatz avatar jbytecode avatar jpdunc23 avatar kmichael08 avatar matthewfeickert avatar rushk014 avatar ssaxena00 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

veridical-flow's Issues

perturbation_stats unclear on mismatched Subkeys

It is unclear how perturbation_stats should handle multiple Subkeys with the same origin (thus the same column name in df).
Currently attempting to group on a duplicated column throws ValueError: Grouper for 'subsample' not 1-dimensional.

The illustrative example of this issue comes if we take the exact example pipeline from #35 and attempt to use a single subsample Vset with output_matching=False (so the X_trains/X_tests will match properly) instead of the two. Now if we want to predict with uncertainty over subsamples, it is unclear what this means. I think there are 2 cases:

  • My initial thought we could implement a way to distinguish identical mismatched Subkeys (maybe by appending -i)
  • Alternatively/additionally we could try to support multidimensional grouping in perturbation_stats

Illustrative Example

X, y = sklearn.datasets.make_classification(n_samples=100, n_features=5)
X_train, X_test, y_train, y_test = init_args(train_test_split(X, y), names=['xtr', 'xte', 'ytr', 'yte'])

subsampling_funcs = [partial(sklearn.utils.resample, n_samples=80, random_state=i) for i in range(5)]
subsampling_set = Vset(name='subsample', modules=subsampling_funcs)
X_trains, y_trains = subsampling_set(X_train, y_train)
X_tests, y_tests = subsampling_set(X_test, y_test)

models = [LogisticRegression(max_iter=1000, tol=0.1), DecisionTreeClassifier()]
modeling_set = Vset(name='model', modules=models, module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)

# clamp mean predictions over test-set subsamples
mean_dict, std_dict, pred_stats_df = modeling_set.predict(X_tests, with_uncertainty=True, group_by=['subsample'])
mean_dict = {k: np.round(v) if k != PREV_KEY else v for k, v in mean_dict.items()}

[JOSS Review] Example Notebooks

(as part of: openjournals/joss-reviews#3895)

I noticed some issues with the example notebooks that I summarized here:

  • In general, it would be very beneficial to have a bit more explanation in the example notebooks in order to better understand the cool features and advantages of VeridicalFlow.

  • The 00_synthetic_classification.ipynb notebook throws the following Exception when trying to execute it: module 'sklearn' has no attribute 'datasets'. My suggestion would be to add the line import sklearn.datasets. Furthermore, the function sklearn.datasets.load_boston is deprecated:

    'load_boston' is deprecated in 1.0 and will be removed in 1.2.  
    

    Maybe you want to consider using another dataset as an example in order to make it safe for future use?

  • Another general improvement suggestion would be to directly include example the notebooks into documentation instead of linking to the notebooks in the GitHub repo.

    Your current solution implies that the example notebooks always have to be fully executed before committing to the repository to include the output in the notebook for. However, this has some drawbacks:

    • In general, it's discouraged to commit notebooks with their output since every execution would change the notebook file
    • And, of course, you would always need to remember to really run the notebooks before committing.

    For building docs with sphinx this could be done using nbsphinx, To be honest, I don't know how it works with pdoc, but I'm sure similar solutions exist?

prediction_uncertainty breaks subkey matching

prediction_uncertainty breaks subkey matching

Currently prediction_uncertainty uses dict_to_df to perform aggregation and then recreates Subkeys.
As a result the recreated Subkeys lose their prior _output_matching and _sep_dicts_id information. This can break pipelines that rely on mean_dict/std_dict for later parts of the pipeline.

Reproducible Example

X, y = sklearn.datasets.make_classification(n_samples=100, n_features=5)
X_train, X_test, y_train, y_test = init_args(train_test_split(X, y), names=['xtr', 'xte', 'ytr', 'yte'])

subsampling_funcs = [partial(sklearn.utils.resample, n_samples=80, random_state=i) for i in range(5)]
subsampling_set = Vset(name='subsample', modules=subsampling_funcs, output_matching=True)
X_trains, y_trains = subsampling_set(X_train, y_train)

subsampling_set_test = Vset(name='subsample_test', modules=subsampling_funcs, output_matching=True)
X_tests, y_tests = subsampling_set_test(X_test, y_test)

models = [LogisticRegression(max_iter=1000, tol=0.1), DecisionTreeClassifier()]
modeling_set = Vset(name='model', modules=models, module_keys=["LR", "DT"])
modeling_set.fit(X_trains, y_trains)

# clamp mean predictions over test-set subsamples
mean_dict, std_dict, pred_stats_df = modeling_set.predict(X_tests, with_uncertainty=True, group_by=['subsample_test'])
mean_dict = {k: np.round(v) if k != PREV_KEY else v for k, v in mean_dict.items()}

failed_metrics = binary_metrics_set.evaluate(mean_dict, y_tests)
failed_metrics

failed_metrics should have matched mean_dict keys of the form (subsample_test_0, ) with y_tests keys of the form (yte, subsample_test_0), but cannot because the Subkeys have different _sep_dicts_id due to recreation.

Expected Output

{(subsample_test_0, yte, Acc): 0.925,
 (subsample_test_1, yte, Acc): 0.9125,
 (subsample_test_2, yte, Acc): 0.8875,
 (subsample_test_3, yte, Acc): 0.9,
 (subsample_test_4, yte, Acc): 0.9,
 (subsample_test_0, yte, Bal_Acc): 0.9268292682926829,
 (subsample_test_1, yte, Bal_Acc): 0.9,
 (subsample_test_2, yte, Bal_Acc): 0.8902439024390244,
 (subsample_test_3, yte, Bal_Acc): 0.9024390243902439,
 (subsample_test_4, yte, Bal_Acc): 0.9069767441860466, ...}

Actual Output

{(subsample_test_0, yte, subsample_test_0, Acc): 0.9125,
 (subsample_test_0, yte, subsample_test_1, Acc): 0.5875,
 (subsample_test_0, yte, subsample_test_2, Acc): 0.4875,
 (subsample_test_0, yte, subsample_test_3, Acc): 0.475,
 (subsample_test_0, yte, subsample_test_4, Acc): 0.6,
 (subsample_test_1, yte, subsample_test_0, Acc): 0.625,
 (subsample_test_1, yte, subsample_test_1, Acc): 0.925,
 (subsample_test_1, yte, subsample_test_2, Acc): 0.525,
 (subsample_test_1, yte, subsample_test_3, Acc): 0.4375,
 (subsample_test_1, yte, subsample_test_4, Acc): 0.5375, ... }

[JOSS Review] Code

(as part of: openjournals/joss-reviews#3895)

When looking at the code there are some points you might want to address which would make usage of the package even better:

  • For some functions and classes, docstrings are missing.
  • In general, the docstring formatting is inconsistent; some docstrings start with small letters, some with capital letters, some have a Short Summary at the beginning, some don't.
  • To easily detect those issues and to further improve code quality I would recommend using a code analysis tool such as prospector, which includes pylint for linting and pep8 for checking against style conventions. This could easily be integrated into your existing workflow and also works well with GitHub Actions. This would also make it easier for contributors to get up to speed quickly with VeridicalFlow.

Error when calling `fit_transform` for `Vset` with `is_async=True`

In the example below, when using a Vset with is_async=True, the transform method expects to get a ray.objectRef and call ray.get on it, but instead gets an array:

from vflow import build_vset, init_args

import numpy as np

from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

import ray

ray.init(num_cpus=4)

X, y = make_regression(n_samples=1000, n_features=100, n_informative=1)

X_trainval, X_test, y_trainval, y_test = train_test_split(X, y)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval)

X_train, y_train = init_args([X_train, y_train], names=['X_train', 'y_train'])
X_val, y_val = init_args([X_val, y_val], names=['X_val', 'y_val'])

# create a Vset for bootstrapping from data 10 times
# we use lazy=True so that the data will not be resampled until needed
boot_set = build_vset('boot', resample, reps=10, lazy=True)

# bootstrap from training data by calling boot_fun
X_trains, y_trains = boot_set(X_train, y_train)

# hyperparameters to try
pca_params = {
    'n_components': [10, 20, 50],
    'svd_solver': ['randomized', 'full', 'auto']
}

# we could instead pass a list of distinct models and corresponding param dicts
pca_set = build_vset('PCA', PCA, pca_params, is_async=True)

X_trains_pca = pca_set.fit_transform(X_trains)
TypeError: Attempting to call `get` on the value [[-0.73763296 -1.64044139 -0.74793088 ... -0.1085027  -0.25652127
   0.11583096]
...

See #50 for a possible workaround until this is fixed.

Test xfails

As far as I understand xfail tests, it is supposed to be a temporary flag rather than a test for a failing case.
Could you explain why are you using it here?

# this test is expected to fail because combine_dicts() makes the

Pass list of Callables and list of param_dicts to build_vset

Hello again.

I'm having trouble passing a list of Callables and list of param_dicts to build_vset. The following error occurs: obj must be callable.
I'm passing a list of sklearn models and a corresponding list of parameter dicts. According to the documentation, this should work.

Build_vset not building vfuncs properly

Call to build_vset is failing to set the vfuncs attribute of Vset correctly. It is not combining the parameters in param_dict accurately.

param_dict = {
    'n_estimators': [100, 200, 300],
    'min_samples_split': [2, 10],  # default value comes first
    'max_features': ['sqrt', 'log2']
}
rf_set = build_vset('RF', RandomForestRegressor, param_dict, criterion = 'absolute_error')
assert len(rf_set.modules) == 3*2*2

Returns

AssertionError                            Traceback (most recent call last)
Input In [2], in <module>
      1 param_dict = {
      2     'n_estimators': [100, 200, 300],
      3     'min_samples_split': [2, 10],  # default value comes first
      4     'max_features': ['sqrt', 'log2']
      5 }
      6 rf_set = build_vset('RF', RandomForestRegressor, param_dict, criterion = 'absolute_error')
----> 7 assert len(rf_set.modules) == 3*2*2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.