Giter Club home page Giter Club logo

imodels-data's Introduction

imodelsπŸ” data

Tabular data for various problems, especially for high-stakes rule-based modeling with the imodels package.

See also https://huggingface.co/imodels

Includes the following datasets and more (see notebooks for more details on the datasets).

To download, use the "Name" field as the key: e.g. imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels').

Name Samples Features Class 0 Class 1 Majority class %
heart 270 15 150 120 55.6
breast_cancer 277 17 196 81 70.8
haberman 306 3 81 225 73.5
credit_g 1000 60 300 700 70
csi_pecarn_prop 3313 97 2773 540 83.7
csi_pecarn_pred 3313 39 2773 540 83.7
juvenile_clean 3640 286 3153 487 86.6
compas_two_year_clean 6172 20 3182 2990 51.6
enhancer 7809 80 7115 694 91.1
fico 10459 23 5000 5459 52.2
iai_pecarn_prop 12044 73 11841 203 98.3
iai_pecarn_pred 12044 58 11841 203 98.3
credit_card_clean 30000 33 23364 6636 77.9
tbi_pecarn_prop 42428 223 42052 376 99.1
tbi_pecarn_pred 42428 121 42052 376 99.1
readmission_clean 101763 150 54861 46902 53.9

Data usage

First, install the imodels package: pip install imodels. Then, use the imodels.get_clean_dataset function.

imodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') ‑> Tuple[numpy.ndarray, numpy.ndarray, list]
"""
Fetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally

Parameters
----------
dataset_name: str
    dataset_name - unique dataset identifier
data_source: str
    options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'
data_path: str
    path to load/save data (default: 'data')

Returns
-------
X: np.ndarray
    features
y: np.ndarray
    outcome
feature_names: list
"""

Example

# download compas dataset from imodels
X, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
# download ionosphere dataset from pmlb
X, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')
# download liver dataset from openml
X, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')
# download ca housing from sklearn
X, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')

Data info

Data comes from various sources - please cite those sources appropriately.

notebooks_fetch_data contains notebooks which download and preprocess the data

data_cleaned contains the cleaned csv file for each dataset

Clinical decision-rule (PECARN) datasets

To use any of the clinical decision-rule datasets, you must first accept the research data use agreement here.

There are two versions of each PECARN (TBI, IAI, and CSI) dataset.

  • prop: missing values have not been imputed
  • pred: missing values have been imputed

csi_pecarn_pred.csv note: unlike the rest of the datasets in this repo, which are fully cleaned, csi_pecarn_pred.csv contains a variable ("SITE") that should be removed before fitting models.

Dataset Task Size References
iai_pecarn Predict intra-abdominal injury requiring acute intervention before CT 12,044 patients, 203 with IAI-I πŸ“„, πŸ”—
tbi_pecarn Predict traumatic brain injuries before CT 42,412 patients, 376 with ciTBI πŸ“„, πŸ”—
csi_pecarn Predict cervical spine injury in children 3,314 patients, 540 with CSI πŸ“„, πŸ”—

Miscellaneous notes

The breast_cancer dataset here is not the extremely common Wisconsin breast-cancer dataset but rather this dataset from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.

Some other cool datasets:

  • moleculenet - benchmarks for molecular datasets
  • srbench - benchmarking for symbolic regression
  • big-bench - language modeling benchmarks

imodels-data's People

Contributors

csinva avatar keyan3 avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

imodels-data's Issues

Add FICO dataset

Fair Isaac (FICO) credit risk dataset (FICO et al., 2018) used for the Explainable ML Challenge

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.