imodels🔍 data

Tabular data for various problems, especially for high-stakes rule-based modeling with the imodels package.

Includes the following datasets and more (see notebooks for more details on the datasets).

To download, use the "Name" field as the key: e.g. imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels').

Name	Samples	Features	Class 0	Class 1	Majority class %
heart	270	15	150	120	55.6
breast_cancer	277	17	196	81	70.8
haberman	306	3	81	225	73.5
credit_g	1000	60	300	700	70
csi_pecarn_prop	3313	97	2773	540	83.7
csi_pecarn_pred	3313	39	2773	540	83.7
juvenile_clean	3640	286	3153	487	86.6
compas_two_year_clean	6172	20	3182	2990	51.6
enhancer	7809	80	7115	694	91.1
fico	10459	23	5000	5459	52.2
iai_pecarn_prop	12044	73	11841	203	98.3
iai_pecarn_pred	12044	58	11841	203	98.3
credit_card_clean	30000	33	23364	6636	77.9
tbi_pecarn_prop	42428	223	42052	376	99.1
tbi_pecarn_pred	42428	121	42052	376	99.1
readmission_clean	101763	150	54861	46902	53.9

Data usage

First, install the imodels package: pip install imodels. Then, use the imodels.get_clean_dataset function.

imodels.get_clean_dataset(dataset_name: str, data_source: str = 'imodels', data_path='data') ‑> Tuple[numpy.ndarray, numpy.ndarray, list]
"""
Fetch clean data (as numpy arrays) from various sources including imodels, pmlb, openml, and sklearn. If data is not downloaded, will download and cache. Otherwise will load locally

Parameters
----------
dataset_name: str
    dataset_name - unique dataset identifier
data_source: str
    options: 'imodels', 'pmlb', 'sklearn', 'openml', 'synthetic'
data_path: str
    path to load/save data (default: 'data')

Returns
-------
X: np.ndarray
    features
y: np.ndarray
    outcome
feature_names: list
"""

Example

# download compas dataset from imodels
X, y, feature_names = imodels.get_clean_dataset('compas_two_year_clean', data_source='imodels')
# download ionosphere dataset from pmlb
X, y, feature_names = imodels.get_clean_dataset('ionosphere', data_source='pmlb')
# download liver dataset from openml
X, y, feature_names = imodels.get_clean_dataset('8', data_source='openml')
# download ca housing from sklearn
X, y, feature_names = imodels.get_clean_dataset('california_housing', data_source='sklearn')

Data info

Data comes from various sources - please cite those sources appropriately.

notebooks_fetch_data contains notebooks which download and preprocess the data

data_cleaned contains the cleaned csv file for each dataset

Clinical decision-rule (PECARN) datasets

To use any of the clinical decision-rule datasets, you must first accept the research data use agreement here.

There are two versions of each PECARN (TBI, IAI, and CSI) dataset.

prop: missing values have not been imputed
pred: missing values have been imputed

csi_pecarn_pred.csv note: unlike the rest of the datasets in this repo, which are fully cleaned, csi_pecarn_pred.csv contains a variable ("SITE") that should be removed before fitting models.

Dataset	Task	Size	References
iai_pecarn	Predict intra-abdominal injury requiring acute intervention before CT	12,044 patients, 203 with IAI-I	📄, 🔗
tbi_pecarn	Predict traumatic brain injuries before CT	42,412 patients, 376 with ciTBI	📄, 🔗
csi_pecarn	Predict cervical spine injury in children	3,314 patients, 540 with CSI	📄, 🔗

Miscellaneous notes

The breast_cancer dataset here is not the extremely common Wisconsin breast-cancer dataset but rather this dataset from OpenML. Preprocessing (e.g. dropping missing values) results in the cleaned data having n=277, p=17, rather than the original n=286, p=9.

Some other cool datasets:

moleculenet - benchmarks for molecular datasets
srbench - benchmarking for symbolic regression
big-bench - language modeling benchmarks

csinva / imodels-data Goto Github PK

imodels-data's Introduction

imodels🔍 data

Data usage

Example

Data info

Clinical decision-rule (PECARN) datasets

Miscellaneous notes

imodels-data's People

Contributors

Stargazers

Watchers

Forkers

imodels-data's Issues

Add dataset metadata / links

Look at moleculenet

Add FICO dataset

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent