Input: feature matrix, labels, indices (for temporal data splits), hyperparameter ranges, and settings for random data splits
Does: five-fold cross-validation and model-training
Output: optimal hyperparameters, model performance, and figures
Parameters
X: nxd feature matrix where n = number of examples and d = number of features
y: n-length labels vector. The ith row (example) in X must correspond to the ith label in y
feature_dict: for random splits, can pass in dictionary of features with key=index and value=feature name for analysis of most important features
train_indices: for temporal splits. These must be the first 80% (or 50%, 60%, 90%, etc.) of the rows in X for the temporal c-v to work correctly. For example, if X is 100x20, train_indices must be the first 80 rows
test_indices, for temporal splits. These must be the last 20% (or 50%, 40%, 10%, etc.) of the rows in X
C_range: list of C regularization hyperparameters for L2 regularization to test
k_best_range: list of k hyperparamters for chi-square feature selection to test. k_best_range = [X.shape[1]] means no filter feature selection
n_random_iters. for random splits. Number of random split experiments to run
random_split: True for random splits, False for (single) temporal splits
Authors
Benjamin Y. Li, Jeeheh Oh, Vincent B. Young, Krishna Rao, and Jenna Wiens