The complicated_cdi_prediction from mld3

complicated_cdi_prediction's Introduction

Overview

This is the code repository for the manuscript "Using Machine Learning and the Electronic Health Record to Predict Complicated Clostridium difficile Infection".
It includes implementation for a regularized logistic regression cross-validation, training, and evaluation pipeline.

import helpers_log_reg as log_reg
Call function: log_reg.do_log_reg(X, y, feature_dict, train_indices, test_indices, C_range, k_best_range, n_random_iters, random_split)
- Input: feature matrix, labels, indices (for temporal data splits), hyperparameter ranges, and settings for random data splits
- Does: five-fold cross-validation and model-training
- Output: optimal hyperparameters, model performance, and figures

X: nxd feature matrix where n = number of examples and d = number of features
y: n-length labels vector. The i^th row (example) in X must correspond to the i^th label in y
feature_dict: for random splits, can pass in dictionary of features with key=index and value=feature name for analysis of most important features
train_indices: for temporal splits. These must be the first 80% (or 50%, 60%, 90%, etc.) of the rows in X for the temporal c-v to work correctly. For example, if X is 100x20, train_indices must be the first 80 rows
test_indices, for temporal splits. These must be the last 20% (or 50%, 40%, 10%, etc.) of the rows in X
C_range: list of C regularization hyperparameters for L2 regularization to test
k_best_range: list of k hyperparamters for chi-square feature selection to test. k_best_range = [X.shape[1]] means no filter feature selection
n_random_iters. for random splits. Number of random split experiments to run
random_split: True for random splits, False for (single) temporal splits