tlverse / sl3 Goto Github PK

💪 🤔 Modern Super Learning with Machine Learning Pipelines

License: GNU General Public License v3.0

R 42.67% Makefile 0.05% CSS 3.63% JavaScript 49.43% HTML 4.23%

machine-learning ensemble-learning ensemble-model model-selection stacking r r-package regression data-science statistics

sl3's Introduction

R/`sl3`: Super Machine Learning with Pipelines

A flexible implementation of the Super Learner ensemble machine learning system

Authors: Jeremy Coyle, Nima Hejazi, Ivana Malenica, Rachael Phillips, and Oleg Sofrygin

What’s `sl3`?

sl3 is an implementation of the Super Learner ensemble machine learning algorithm of van der Laan, Polley, and Hubbard (2007). The Super Learner algorithm performs ensemble learning in one of two fashions:

The discrete Super Learner can be used to select the best prediction algorithm from among a supplied library of machine learning algorithms (“learners” in the sl3 nomenclature) – that is, the discrete Super Learner is the single learning algorithm that minimizes the cross-validated risk.
The ensemble Super Learner can be used to assign weights to a set of specified learning algorithms (from a user-supplied library of such algorithms) so as to create a combination of these learners that minimizes the cross-validated risk. This notion of weighted combinations has also been referred to as stacked regression (Breiman 1996) and stacked generalization (Wolpert 1992).

Looking for long-form documentation or a walkthrough of the sl3 package? Don’t worry! Just browse the chapter in our book.

Installation

Install the most recent version from the master branch on GitHub via remotes:

remotes::install_github("tlverse/sl3")

Past stable releases may be located via the releases page on GitHub and may be installed by including the appropriate major version tag. For example,

remotes::install_github("tlverse/[email protected]")

To contribute, check out the devel branch and consider submitting a pull request.

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.

Examples

sl3 makes the process of applying screening algorithms, learning algorithms, combining both types of algorithms into a stacked regression model, and cross-validating this whole process essentially trivial. The best way to understand this is to see the sl3 package in action:

set.seed(49753)
library(tidyverse)
library(data.table)
library(SuperLearner)
library(origami)
library(sl3)

# load example data set
data(cpp)
cpp <- cpp %>%
  dplyr::filter(!is.na(haz)) %>%
  mutate_all(~ replace(., is.na(.), 0))

# use covariates of intest and the outcome to build a task object
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
            "sexn")
task <- sl3_Task$new(
  data = cpp,
  covariates = covars,
  outcome = "haz"
)

# set up screeners and learners via built-in functions and pipelines
slscreener <- Lrnr_pkg_SuperLearner_screener$new("screen.glmnet")
glm_learner <- Lrnr_glm$new()
screen_and_glm <- Pipeline$new(slscreener, glm_learner)
SL.glmnet_learner <- Lrnr_pkg_SuperLearner$new(SL_wrapper = "SL.glmnet")

# stack learners into a model (including screeners and pipelines)
learner_stack <- Stack$new(SL.glmnet_learner, glm_learner, screen_and_glm)
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()
head(preds)
#>    Lrnr_pkg_SuperLearner_SL.glmnet Lrnr_glm_TRUE
#> 1:                       0.3525946    0.36298498
#> 2:                       0.3525946    0.36298498
#> 3:                       0.2442593    0.25993072
#> 4:                       0.2442593    0.25993072
#> 5:                       0.2442593    0.25993072
#> 6:                       0.0269504    0.05680264
#>    Pipeline(Lrnr_pkg_SuperLearner_screener_screen.glmnet->Lrnr_glm_TRUE)
#> 1:                                                            0.36228209
#> 2:                                                            0.36228209
#> 3:                                                            0.25870995
#> 4:                                                            0.25870995
#> 5:                                                            0.25870995
#> 6:                                                            0.05600958

Parallelization with `future`s

While it’s straightforward to fit a stack of learners (as above), it’s easy to take advantage of sl3’s built-in parallelization support too. To do this, you can simply choose a plan() from the future ecosystem.

# let's load the future package and set 4 cores for parallelization
library(future)
plan(multicore, workers = 4L)

# now, let's re-train our Stack in parallel
stack_fit <- learner_stack$train(task)
preds <- stack_fit$predict()

Controlling the number of CV folds

In the above examples, we fit stacks of learners, but didn’t create a Super Learner ensemble, which uses cross-validation (CV) to build the ensemble model. For the sake of computational expedience, we may be interested in lowering the number of CV folds (from 10). Let’s take a look at how to do both below.

# first, let's instantiate some more learners and create a Super Learner
mean_learner <- Lrnr_mean$new()
rf_learner <- Lrnr_ranger$new()
sl <- Lrnr_sl$new(mean_learner, glm_learner, rf_learner)

# CV folds are controlled in the sl3_Task object; we can lower the number of
# folds simply by specifying this in creating the Task
task <- sl3_Task$new(
  data = cpp,
  covariates = covars,
  outcome = "haz",
  folds = 5L
)

# now, let's fit the Super Learner with just 5-fold CV, then get predictions
sl_fit <- sl$train(task)
sl_preds <- sl_fit$predict()

The folds argument to sl3_Task supports both integers (for V-fold CV) and all of the CV schemes supported in the origami package. To see the full list, query ?fold_funs from within R or take a look at origami’s online documentation.

Learner Properties

Properties supported by sl3 learners are presented in the following table:

	binomial	categorical	continuous	cv	density	h2o	ids	importance	offset	preprocessing	sampling	screener	timeseries	weights	wrapper
Lrnr_arima	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_bartMachine	√	x	√	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_bayesglm	√	x	√	x	x	x	x	x	√	x	x	x	x	√	x
Lrnr_bilstm	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_bound	√	√	√	x	x	x	x	x	x	x	x	x	x	√	√
Lrnr_caret	√	√	√	x	x	x	x	x	x	x	x	x	x	x	√
Lrnr_cv	x	x	x	√	x	x	x	x	x	x	x	x	x	x	√
Lrnr_cv_selector	√	√	√	x	x	x	x	x	x	x	x	x	x	√	√
Lrnr_dbarts	√	x	√	x	x	x	x	x	x	x	x	x	x	√	x
Lrnr_define_interactions	x	x	x	x	x	x	x	x	x	√	x	x	x	x	x
Lrnr_density_discretize	x	x	x	x	√	x	x	x	x	x	x	x	x	x	x
Lrnr_density_hse	x	x	x	x	√	x	x	x	x	x	x	x	x	x	x
Lrnr_density_semiparametric	x	x	x	x	√	x	x	x	x	x	√	x	x	x	x
Lrnr_earth	√	x	√	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_expSmooth	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_gam	√	x	√	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_gbm	√	x	√	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_glm	√	x	√	x	x	x	x	x	√	x	x	x	x	√	x
Lrnr_glm_fast	√	x	√	x	x	x	x	x	√	x	x	x	x	√	x
Lrnr_glmnet	√	√	√	x	x	x	√	x	x	x	x	x	x	√	x
Lrnr_grf	√	√	√	x	x	x	x	x	x	x	x	x	x	√	x
Lrnr_gru_keras	√	√	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_gts	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_h2o_glm	√	√	√	x	x	√	x	x	√	x	x	x	x	√	x
Lrnr_h2o_grid	√	√	√	x	x	√	x	x	√	x	x	x	x	√	x
Lrnr_hal9001	√	x	√	x	x	x	√	x	x	x	x	x	x	√	x
Lrnr_haldensify	x	x	x	x	√	x	x	x	x	x	x	x	x	x	x
Lrnr_HarmonicReg	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_hts	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_independent_binomial	x	√	x	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_lightgbm	√	√	√	x	x	x	x	√	√	x	x	x	x	√	x
Lrnr_lstm_keras	√	√	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_mean	√	√	√	x	x	x	x	x	√	x	x	x	x	√	x
Lrnr_multiple_ts	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_multivariate	x	√	x	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_nnet	√	√	√	x	x	x	x	x	x	x	x	x	x	√	x
Lrnr_nnls	x	x	√	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_optim	√	√	√	x	x	x	x	x	√	x	x	x	x	√	x
Lrnr_pca	x	x	x	x	x	x	x	x	x	√	x	x	x	x	x
Lrnr_pkg_SuperLearner	√	x	√	x	x	x	√	x	x	x	x	x	x	√	√
Lrnr_pkg_SuperLearner_method	√	x	√	x	x	x	x	x	x	x	x	x	x	√	√
Lrnr_pkg_SuperLearner_screener	√	x	√	x	x	x	√	x	x	x	x	x	x	√	√
Lrnr_polspline	√	√	√	x	x	x	x	x	x	x	x	x	x	√	x
Lrnr_pooled_hazards	x	√	x	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_randomForest	√	√	√	x	x	x	x	√	x	x	x	x	x	x	x
Lrnr_ranger	√	√	√	x	x	x	x	√	x	x	x	x	x	√	x
Lrnr_revere_task	x	x	x	√	x	x	x	x	x	x	x	x	x	x	√
Lrnr_rpart	√	√	√	x	x	x	x	x	x	x	x	x	x	√	x
Lrnr_rugarch	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_screener_augment	x	x	x	x	x	x	x	x	x	x	x	√	x	x	x
Lrnr_screener_coefs	x	x	x	x	x	x	x	x	x	x	x	√	x	x	x
Lrnr_screener_correlation	√	√	√	x	x	x	x	x	x	x	x	√	x	x	x
Lrnr_screener_importance	x	x	x	x	x	x	x	x	x	x	x	√	x	x	x
Lrnr_sl	x	x	x	√	x	x	x	x	x	x	x	x	x	x	√
Lrnr_solnp	√	√	√	x	x	x	x	x	√	x	x	x	x	√	x
Lrnr_solnp_density	x	x	x	x	√	x	x	x	x	x	x	x	x	x	x
Lrnr_stratified	√	x	√	x	x	x	x	x	x	x	x	x	x	x	√
Lrnr_subset_covariates	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_svm	√	√	√	x	x	x	x	x	x	x	x	x	x	x	x
Lrnr_ts_weights	x	x	x	√	x	x	x	x	x	x	x	x	x	x	√
Lrnr_tsDyn	x	x	√	x	x	x	x	x	x	x	x	x	√	x	x
Lrnr_xgboost	√	√	√	x	x	x	x	√	√	x	x	x	x	√	x

Contributions

Contributions are very welcome. Interested contributors should consult our contribution guidelines prior to submitting a pull request.

Citation

After using the sl3 R package, please cite the following:

 @software{coyle2021sl3-rpkg,
      author = {Coyle, Jeremy R and Hejazi, Nima S and Malenica, Ivana and
        Phillips, Rachael V and Sofrygin, Oleg},
      title = {{sl3}: Modern Pipelines for Machine Learning and {Super
        Learning}},
      year = {2021},
      howpublished = {\url{https://github.com/tlverse/sl3}},
      note = {{R} package version 1.4.2},
      url = {https://doi.org/10.5281/zenodo.1342293},
      doi = {10.5281/zenodo.1342293}
    }

License

The contents of this repository are distributed under the GPL-3 license. See file LICENSE for details.

References

Breiman, Leo. 1996. “Stacked Regressions.” Machine Learning 24 (1): 49–64.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).

Wolpert, David H. 1992. “Stacked Generalization.” Neural Networks 5 (2): 241–59.

sl3's People

Contributors

Stargazers

Watchers

sl3's Issues

Handle character vectors in Tasks

Currently, running a simple Super Learner with sl3 on the included cpp_imputed data set with the full set of variables causes an issue due to the presence of variables of class character. To resolve this, sl3 should have a built-in function that coerces variables from character to factor (and I guess back to character when displaying results?), probably as part of the sl3_Task$new() method. As an example, the following code -- courtesy of @jeremyrcoyle -- implements a new function char_to_factor that handles this coercion on a given input data set.

library(sl3)
library(data.table)
set.seed(37942)
data(cpp_imputed)

setDT(cpp_imputed)
char_to_factor <-function(data) {
  classes <- sapply(data, data.class)
  char_cols <- names(classes)[which(classes == "character")]
  set(data, , char_cols, data[,lapply(.SD, as.factor), .SDcols = char_cols])
}
char_to_factor(cpp_imputed)

task <- sl3_Task$new(data = cpp_imputed, covariates = colnames(cpp_imputed)[-6], outcome = "bmi")
lrn_mean <- Lrnr_mean$new()
lrn_glm <- Lrnr_glm$new()
lrn_rf <- Lrnr_randomForest$new()
lrn_sl <- Lrnr_sl$new(learners = list(lrn_mean, lrn_glm, lrn_rf), metalearner = Lrnr_nnls$new())
sl_trained <- lrn_sl$train(task)

make_learner and $new() error with list arguments

Instantiating a learner with additional (modifying) arguments does not appear to work when the additional arguments are passed in as a single list. For some reason, the given learner is apparently instantiated fine but in a problematic manner that causes failure when the $train method of the learner is invoked. The following minimal working example illustrates the problem:

library(sl3)

# example data and sl3 task
data(cpp_imputed)
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs",
            "sexn")
outcome <- "haz"
task <- sl3_Task$new(cpp_imputed, covariates = covars, outcome = outcome)

# create list of additional arguments to be passed in to learner
lrnrs_args_list <- list(nbins = 5, bin_method = "equal.len", pool = FALSE)

# initialize learner in three different ways (the third is problematic)
sl_condensier_1 <- Lrnr_condensier$new(nbins = 5, bin_method = "equal.len",
                                       pool = FALSE)
sl_condensier_2 <- make_learner(Lrnr_condensier, nbins = 5,
                                bin_method = "equal.len", pool = FALSE)
sl_condensier_3 <- make_learner(Lrnr_condensier, lrnrs_args_list)

# train the 3 learners (the third line will fail)
sl_condensier_1_fit <- sl_condensier_1$train(task)
sl_condensier_2_fit <- sl_condensier_2$train(task)
sl_condensier_3_fit <- sl_condensier_3$train(task)

All of these calls should in principle produce the exact same learner, and it is easy to see why it might be convenient to pass in extra/additional arguments as a single (potentially pre-made) list object. It is unclear why the third line above fails. The error produced is as follows:

❯ sl_condensier_3_fit <- sl_condensier_3$train(task)
Failed on Lrnr_condensier_5_20_FALSE_NA_FALSE_NULL
Error in (function (X, Y, input_data, bin_method = c("equal.mass", "equal.len",  :
  bin_method argument must be either 'equal.len', 'equal.mass' or 'dhist'

Handle different outcome types

We should infer outcome types from the discreteness of Y (and also allow the user to specify them), and then make sure the learners correctly handle them. Currently, I think the options should be continuous, categorical, and bounded continuous.

Default learner instances

Each learner should have at least one pre-instantiated version, so that users don't have to call, for example, GLM_Learner$new() to use the most basic version of the learner

Add grid learners for `h2o.grid` and `xgboost`

Add learner classes that can support grids of hypo-params for h2o and xgboost. Similar to https://github.com/osofr/gridisl/blob/master/R/ModelH2OGridLearner.R#L3-L72
and https://github.com/osofr/gridisl/blob/master/R/ModelXGBoostGridLearner.R#L67-L235
The learner should support grid-based interface, specifying vectors of hypo-params for grid search, similar to https://github.com/osofr/gridisl/blob/master/tests/RUnit/RUnit_tests_01b_New_Syntax.R#L17-L38
The learner class should return data.table of predictions, one column per model

set up drat repo via github http

Setting up drat repo with an automatically compiled dev version of the package is becoming more popular (as an alternative to install_github).

The package tar is added to drat repo only if all the checks have passed, thus eliminating the possibility of someone installing the development version that is in the error state.

Basic description: https://github.com/eddelbuettel/drat
Working example: https://github.com/dmlc/drat

Unified handling of `id`, `weights` and `offset` across all learners

proposal: add offset to Learner_Task, specified as the column name of offsets in the input data. The name of the offset column is stored in private$.nodes$offset. Need to settle on the default values for the offsets.
Currently, active members Learner_Task$weights and Learner_Task$id returns a vector of defaults, even if the user specified no weights / id.
Keep the default Learner_Task$id to be always a vector seq_len(nobs). These id's are most likely never going to be used for learner training, only for executing CV-schemes (not 100% sure).
Current default for Learner_Task$weights is a vector rep.int(1,nobs). While this seems to work for all currently used learners, it might be undesirable to always pass this to every single learner (i.e,. we may not want to provide default weights to all learners when the user is not anticipating that and haven't requested this explicitly). In general, always using default weights should never change the model fit, but it might have an effect on learner performance (unclear, but best to be safe). So there needs to be a clear mechanism for using or not using the weights within each learner:
- proposal: add an argument use_weights = FALSE to all learners that are capable of using the weights. Default is to never use weights, unless specifically requested, even if the task included new, user-defined weights. Doesn't seem like a good option, adding too much complexity. @jeremyrcoyle thoughts?
- alternative proposal: set default Learner_Task$weights to NULL and only use weights when !is.null(...). When some not NULL weights are neccesary as defaults, generate those on the fly as rep.int(1L,nobs). @jeremyrcoyle thoughts?
Setting neutral defaults for offset might create a lot of confusion. The neutral default offset may change with link function, depending on how these offset are used. For instance, for family="binomial" with GLM, the input offset will not be converted to logit-linear scale. If this conversion of the offset occurs inside the learner (e.g., TMLE learner) then the default offset of 0 would imply that the actual offset used is qlogis(0)=-Inf. This will work fine if all the learners assume that the offsets are already transformed to the scale of the link function. It will be hard to enforce and maintain this.
- proposal: Set default offset=0. Add an argument use_offset = FALSE to all learners that are capable of using the offset. Default is to never use offset, unless specifically requested, even if the task included new, user-defined offset. Doesn't seem like a good option, too complex. @jeremyrcoyle thoughts?
- alternative proposal: set default Learner_Task$offset to NULL and only use offset when !is.null(...). That way either the user is always responsible for providing interpretable offsets or the learner is responsible for generating correct offsets (on the right scale). @jeremyrcoyle thoughts?

Add conditional density estimation from condensier as a learner

Add a learner for conditional density estimation using https://github.com/osofr/condensier
Allow any learner binary classification / binomial learner from sl3 to be "injected" as a learner for binary bin hazard, similar to current default of the logistic regression learner: https://github.com/osofr/condensier/blob/master/R/GlmAlgorithmClass.R#L5
Note that the injected learner structure in condensier is currently made to work with OSL package and needs to remain compatible. Examples of learner R6 structure in OSL are here: https://github.com/frbl/OnlineSuperLearner/blob/master/R/ML.Base.R and here: https://github.com/frbl/OnlineSuperLearner/blob/master/R/ML.GLMnet.R

Reduce model storage size

A lot of models store copies of the data and other large objects. See discussion here: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/ related to glm. Our model fit objects should store only what's necessary for prediction. We should build a toolset similar to what's discussed in that blogpost, and use it to validate that all learners are storing the minimal amount necessary

Some kind of 'cross product' operator

Given two sets of learners this would create a stack of pipelines where each learner in the stack is a pipeline with a unique combination of a stage 1 and stage 2 learner

Avoid refitting learners

Any given combination of a learner and a task should only be fit once. For example, if the same screening learner is used before a number of predictive learners, we need to not refit the screening learner each time. One approach might be to use some form of memoization, where before fitting a learner to a task, we check to see if we haven't already done so. This might create issues for garbage collection.

add options handling, add verbose mode

verbose mode can be very useful for debuging vs. having a completely silent run

Pipeline support for already fit learners

Sometimes a pipeline might mix learners that are already fit with learners that have yet to be fit. Currently, when a pipeline fits, it fits all its learners. Instead, it might check which ones have yet to be fit, and only fit those.

Add learners that can by-pass external CV with their own internal CV.

Needs to work together with cross_validate()

h2o test failing on oldrel

See here: https://travis-ci.org/jeremyrcoyle/sl3/jobs/267251162 . Seems related to something that was fixed in #31 . The test is last though and it's still failing.

Discuss: Learner$train() should return self rather than fit_object

A nice thing about R6 classes (in my opinion) is that they allow to do the following types of chaining:

 task <- Learner_Task$new(cpp, outcome = "haz")
 h2o_glm <- h2o_GLM_Learner$new()$train(task)

Since train() right now returns the actual fit_object, rather than self, the reference to the above R6 learner object has now been lost and this will not work:

  preds <- h2o_glm$predict()

Is there a reason to force train() to return fit_object rather than self? We could have a method Learner$print() that automatically does something like:
if(self$is_trained) print(private$.fit_object)

Learner Registry

We need to build capacity similar to SuperLearner's listWrappers and mlr's listLearners. I think the easiest way to do this is with a registry object that keeps track of all available learners. It should also allow searching by property

Discrete SL

Support for discrete SL from a combination of a model stack and loss function.

Dissemination: blog post with step by step application of pipelines

One of the big motivators for pipelines was scikit-learn pipelines and the fact that nothing like this was available in R. That still remains true. Since there are still new blog posts coming out that talk about the virtues of pipelines in Python, it only seems natural to create a mirror blog post describing how exactly the same operations can be performed with sl3:

https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html

Error handling for failed learners

What happens when a learner fails to fit or predict? Further, what happens if that learner is in a Pipeline, or a stack, or is being cross-validated?

outcome_type is meaningless to Lrnr_glm_fast learner

All attempts give an error. Perhaps we should add this to test set once we figured out what is causing this. Seems like the same issue is true for Lrnr_glm, not sure if all learners are affected though.

  data(cpp_imputed)
  cpp_haz_01range <- cpp_imputed
  cpp_haz_01range[["haz_01range"]] <- rep_len(c(0.1, 0.9), nrow(cpp_imputed))
  task_01range <- sl3_Task$new(cpp_haz_01range, covariates = covars, outcome = "haz_01range")
  fglm_learner <- Lrnr_glm_fast$new(outcome_type = "continuous")
  fglm_learner$train(task_01range)

Failed on Lrnr_glm_fast_TRUE_Cholesky_continuous
Error in outcome_type$glm_family : 
  $ operator is invalid for atomic vectors

  fglm_learner <- Lrnr_glm_fast$new(outcome_type = "binomial")
  fglm_learner$train(task_01range)

Failed on Lrnr_glm_fast_TRUE_Cholesky_binomial
Error in outcome_type$glm_family : 
  $ operator is invalid for atomic vectors

  fglm_learner <- Lrnr_glm_fast$new(outcome_type = "quasibinomial")
  fglm_learner$train(task_01range)

Failed on Lrnr_glm_fast_TRUE_Cholesky_quasibinomial
Error in outcome_type$glm_family : 
  $ operator is invalid for atomic vectors

Discuss: Naming learner R6 classes and .R files

Can we reverse the current naming scheme from h2o_GLM_Learner to Learner_h2o_GLM?

That way all learner R files will be sorted "together" and will be easily findable.

Just a suggestion. Can we discuss pros and cons of different naming approaches here?

Pipeline cannot be passed to Lrnr_sl "as is", needs to be wrapped up in a list

Replication example below:

data(cpp)
cpp <- cpp[!is.na(cpp[, "haz"]), ]
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn")
cpp[is.na(cpp)] <- 0
outcome <- "haz"
task <- sl3_Task$new(cpp, covariates = covars, outcome = outcome)
make_inter <- Lrnr_define_interactions$new(interactions=list(c("apgar1","parity"),c("apgar5","parity")))

glm_learner <- Lrnr_glm$new()
glmnet_learner <- Lrnr_glmnet$new(nlambda = 5)
learners = Stack$new(glm_learner, glmnet_learner)
pipe <- Pipeline$new(make_inter, learners)
sl1 <- make_learner(Lrnr_sl, pipe, metalearner = Lrnr_solnp$new())
fit <- sl1$train(task)

Returns an error:

Error in do.call(Stack$new, learners) : second argument must be a list

time series learners

sl3_Task: proper handling of outcome=NULL or allow missing outcome

Learner task does not have to include an outcome column. For example, for true prediction with last learner in the pipeline there might be no outcomes available (e.g., new subjects with unlabelled outcomes). The prediction routine doesn't really care if there are outcomes. However, sl3_Task currently requires that the outcome node is provided to the constructor:

newtask <- sl3_Task$new(data.table(val = rep(1,20)), covariates = "val")

Error in .subset2(public_bind_env, "initialize")(...) : argument "outcome" is missing, with no default

Specifying NULL outcome works fine, but may lead to a downstream error if the outcome were to be requested by accident:

newtask <- sl3_Task$new(data.table(val = rep(1,20)), covariates = "val", outcome = NULL)
newtask$Y

Error in .subset2(x, i, exact = exact) : attempt to select less than one element in get1index

We should come up with a better handling for this. Perhaps add extra check in sl3_Task$Y? Maybe allow missing(outcome) in new, in which case its always set to NULL?

Trying to run Super Learner with a Pipeline throws a data.table error

Trying to pipeline an interaction learner with a stack and then trying to do a super learner on that results in the following error data.table:

data(cpp)
cpp <- cpp[!is.na(cpp[, "haz"]), ]
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn")
cpp[is.na(cpp)] <- 0
outcome <- "haz"
task <- sl3_Task$new(cpp, covariates = covars, outcome = outcome)

make_inter <- Lrnr_define_interactions$new(interactions=list(c("apgar1","parity"),c("apgar5","parity")))

glm_learner <- Lrnr_glm$new()
glmnet_learner <- Lrnr_glmnet$new(nlambda = 5)
learners = Stack$new(glm_learner, glmnet_learner)
pipe <- Pipeline$new(make_inter, learners)
sl1 <- make_learner(Lrnr_sl, list(pipe), metalearner = Lrnr_solnp$new())
fit <- sl1$train(task)

Error in set(data, j = col_names, value = new_data) : 
  It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to datatable-help.
In addition: There were 50 or more warnings (use warnings() to see the first 50)
Failed on chain
Error in self$compute_step() : 
  Error in set(data, j = col_names, value = new_data) : 
  It appears that at some earlier point, names of this data.table have been reassigned. Please ensure to use setnames() rather than names<- or colnames<-. Otherwise, please report to datatable-help.

In addition, the code produced 50 of the following possibly related warnings:

Warning messages:
1: In task$add_interactions(self$params$interactions) :
  Interaction column apgar1_parity is already defined, so skipping
...
50: In task$add_interactions(self$params$interactions) :
  Interaction column apgar5_parity is already defined, so skipping

wrapping superleaner in Lrnr_cv

I ran into the following error trying to cross-validate a Superlearner by wrapping it in Lrnr_cv:

Error in .subset2(public_bind_env, "initialize")(...) :
(list) object cannot be coerced to type 'logical'
In addition: Warning message:
In all(learners_trained) : coercing argument of type 'list' to logical

CODE USED

setup parameters for superlearner

library(sl3)
data(cpp_imputed)

only looking at continuous covariates in cpp_imputed

classes <- sapply(cpp_imputed, data.class)
indx.char <- which(classes == "character")
all.vars <- names(cpp_imputed)
cpp_edited <- select(cpp_imputed, all.vars[-indx.char])

create superlearner then pipe into cross-validation learner

learner.list <- as.list(c(lrnr_glm, lrnr_mean, lrnr_RF, lrnr_xgboost, lrnr_glm_fast))
metalearner <- make_learner(Lrnr_nnls)
SL <- Lrnr_sl$new(learners = learner.list, metalearner = metalearner)
slcv_pipe <- make_learner(Pipeline, SL, Lrnr_cv)

Error: stack failed when calling `Lrnr_condensier` with message: Lrnr_condensier_equal.len_25_TRUE_NA_FALSE_NULL.

When I try to call Lrnr_condensier and train function, SL3 gives the following error (reproduced on both mac and linux):

Error in private$.train(subsetted_task, trained_sublearners) :
  All learners in stack have failed
In addition: Warning messages:
1: In private$.train(subsetted_task, trained_sublearners) :
  the stack failed with message: Lrnr_condensier_equal.len_25_TRUE_NA_FALSE_NULL. It will be removed from
2: In private$.train(subsetted_task, trained_sublearners) :
  the stack failed with message: Lrnr_condensier_equal.mass_20_TRUE_NA_FALSE_NULL. It will be removed from
3: In private$.train(subsetted_task, trained_sublearners) :
  the stack failed with message: Lrnr_condensier_equal.len_35_TRUE_NA_FALSE_NULL. It will be removed from
Failed on Stack
Error in self$compute_step() :
  Error in private$.train(subsetted_task, trained_sublearners) :
  All learners in stack have failed

You can reproduce the error with the following code:

library("simcausal")
D <- DAG.empty()
D <-
  D + node("W1", distr = "rbern", prob = 0.5) +
  node("W2", distr = "rbern", prob = 0.3) +
  node("W3", distr = "rbern", prob = 0.3) +
  node("sA.mu", distr = "rconst", const = (0.98 * W1 + 0.58 * W2 + 0.33 * W3)) +
  node("sA", distr = "rnorm", mean = sA.mu, sd = 1)
D <- set.DAG(D, n.test = 10)
datO <- sim(D, n = 10000, rndseed = 12345)


library("condensier")
library("sl3")
# ================================================================================
task <- sl3_Task$new(datO, covariates=c("W1", "W2", "W3"),outcome="sA")

lrn <- Lrnr_condensier$new(task, nbins = 35, bin_method = "equal.len", pool = TRUE, bin_estimator =
                             Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic"))
lrn1 <- Lrnr_condensier$new(task, nbins = 25, bin_method = "equal.len", pool = TRUE,
                            bin_estimator = Lrnr_glm_fast$new(family = "binomial"))
lrn2 <- Lrnr_condensier$new(task, nbins = 20, bin_method = "equal.mass", pool = TRUE,
                            bin_estimator = Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic"))
lrn3 <- Lrnr_condensier$new(task, nbins = 35, bin_method = "equal.len", pool = TRUE,
                            bin_estimator = Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic"))

sl <- Lrnr_sl$new(learners = list(lrn1, lrn2, lrn3),
                  metalearner = Lrnr_solnp_density$new())
sl_fit <- sl$train(task)

Easier access to SL library fits

In SuperLearner, users can retrieve fits as follows:

glm_fitted <- SL_fit$fitLibrary$SL.glm

In sl3, the analogous line would look like:

sl1_fit$fit_object$full_fit$fit_object$learner_fits[[1]]

which is obviously not as user-friendly.

Document Learners

Currently using @rdname undocumented_learner for a number of learners that I haven't had a chance to document yet

Lrnr_cv$new() should accept list and a stack of learners

Currently only stack is accepted. On the other hand, Lrnr_sl$new(learners = can accept both, list and Stack.

offset not working for glm and xgb

library(data.table)
library(sl3)

set.seed(10)
lrnr_glm <- make_learner(Lrnr_glm)
lrnr_xgboost = make_learner(Lrnr_xgboost, nrounds = 1000)

# typical gendata function
gendata = function (n, g0, Q0) 
{
  W1 = runif(n, -3, 3)
  W2 = rnorm(n)
  W3 = runif(n)
  W4 = rnorm(n)
  A = rbinom(n, 1, g0(W1, W2, W3, W4))
  Y = rbinom(n, 1, Q0(A, W1, W2, W3, W4))
  data.frame(A, W1, W2, W3, W4, Y)
}

# g and barQ defined
g0 = function (W1, W2, W3, W4) 
{
  plogis(0.5 * (-0.8 * W1 + 0.39 * W2 + 0.08 * W3 - 0.12 * 
                  W4 - 0.15))
}

Q0 = function (A, W1, W2, W3, W4) 
{
  plogis(0.14 * (2 * A + 20 * cos(W1) * A + cos(W1) - 4 * A * 
                   (W2^2) + 3 * cos(W4) * A + A * W1^2))
}

# generate the data and make an offset
n=300
data=data.frame(gendata(n, g0, Q0))
offset = rep(.1, nrow(data))

# set a task with and without an offset
test = make_sl3_Task(data = data, covariates = c("A", "W1", "W2", "W3", "W4"), 
                     outcome = "Y", offset = offset)
test1 = make_sl3_Task(data = data, covariates = c("A", "W1", "W2", "W3", "W4"), 
                      outcome = "Y")

# train on both tasks
xgb_test = lrnr_xgboost$train(test)
xgb_test1 = lrnr_xgboost$train(test1)

# predictions are the same so offset is not operational
xgb_test$predict()[1:10]
xgb_test1$predict()[1:10]

# glm not working with offset:
glm_test = lrnr_glm$train(test)
glm_test1 = lrnr_glm$train(test1)

gives following error:
Failed on Lrnr_glm
Error in linkinv(eta <- eta + offset) :
REAL() can only be applied to a 'numeric', not a 'list'

Support parallelization via future

Things like model stacking and cross-validation need to be parallelizable. We should continue to use the future package. We need to investigate the performance of nested uses of future_lapply, and how we can parallelize the right levels. Ideally, when we're parallelizing across cross-validated stacked learners (i.e. SuperLearner) we can parallelize to nodes=nlearners*nfolds

Improve learner naming

It's useful for learners to have "names" for things like model stacking. Right now this is being done as a haphazard combination of the class name and any specified params. Somebody needs to spend some time thinking about how to do this in a cleaner way

Learner(s) for preprocessing data

Things like indicators of missingness, conversion of factor variables to indicators, and interaction variables. This could be one learner with a lot of options or multiple learners.

Support for factor covariates

Currently nonexistent. One implementation exists here: https://github.com/jeremyrcoyle/sl3/blob/tmle-demo-fixes/R/sl3_Task.R#L275, however this is done on-the-fly every time task$X is called. This results in repeated computation for the indicators when fitting lots of models to the same task. We can do better.

maintain h2o pointer in task

Currently, each h2o learner re-uploads data to h2o. We should refactor so it's only done once per task

Support for different kinds of chaining

Currently, calling Learner$chain() defaults to creaking a task with a new set of covariates, defined as the predictions from the Learner. This works for a lot of applications of chaining, but sometimes we also want to redefine things like the outcomes and the weights. For instance, this will be helpful for applications like learning optimal treatment rules and using our package to fit TMLEs

Error: `UQE()` can only be used within a quasiquoted argument

After installing the latest versions of sl3/delayed/origami, every attempt to train a super learner in sl3 crashes and burns on my system. Not sure why its running fine on travis. Any ideas / clues where to look?

This is just running one of the tests from https://github.com/jeremyrcoyle/sl3/blob/f40c784d56e603b2b9df41b5edef0b801fb2df66/tests/testthat/test_sl.R#L4-L21

The error is the same regardless of what code I run.

R version 3.4.2 (2017-09-28) -- "Short Summer"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> options(sl3.verbose = TRUE)
> library(sl3)
library(origami)
library(SuperLearner)

data(cpp_imputed)
covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn")
outcome <- "haz"
task <- sl3_Task$new(data.table::copy(cpp_imputed), covariates = covars, outcome = outcome)
task2 <- sl3_Task$new(data.table::copy(cpp_imputed), covariates = covars, outcome = outcome)

glm_learner <- Lrnr_glm$new()
glmnet_learner <- Lrnr_pkg_SuperLearner$new("SL.glmnet")
subset_apgar <- Lrnr_subset_covariates$new(covariates = c("apgar1", "apgar5"))
learners <- list(glm_learner, glmnet_learner, subset_apgar)
sl1 <- make_learner(Lrnr_sl, learners, glm_learner)

sl1_fit <- sl1$train(task)

sl3 1.0.0
Please note the package is in early stages of development. Check often for updates and report bugs at http://github.com/jeremyrcoyle/sl3. 

> library(origami)
origami: Generalized Cross-Validation Framework
Version: 0.8.2
> library(SuperLearner)
Loading required package: nnls
Super Learner
Version: 2.0-22
Package created on 2017-07-18

> 
> data(cpp_imputed)
> covars <- c("apgar1", "apgar5", "parity", "gagebrth", "mage", "meducyrs", "sexn")
> outcome <- "haz"
> task <- sl3_Task$new(data.table::copy(cpp_imputed), covariates = covars, outcome = outcome)
> task2 <- sl3_Task$new(data.table::copy(cpp_imputed), covariates = covars, outcome = outcome)
> 
> glm_learner <- Lrnr_glm$new()
> glmnet_learner <- Lrnr_pkg_SuperLearner$new("SL.glmnet")
> subset_apgar <- Lrnr_subset_covariates$new(covariates = c("apgar1", "apgar5"))
> learners <- list(glm_learner, glmnet_learner, subset_apgar)
> sl1 <- make_learner(Lrnr_sl, learners, glm_learner)
> 
> sl1_fit <- sl1$train(task)
Error: `UQE()` can only be used within a quasiquoted argument

Better pipeline syntax

Something similar to the magrittr %>% for chaining learners

Dissemination: SuperLearner -> sl3 translation guide

In order to help with user migration, it would be really nice to have a "translation guide" for experienced users of the SuperLearner package to use sl3. This would take the place of a "SuperLearner3()" style wrapper function, instead helping users to find the correct sl3 idiom(s) to use in place of a given SuperLearner function or argument. An example of such would be the following:

library(SuperLearner)
library(sl3)
set.seed(57192)

# toy data
n <- 1000
p <- 3
x <- as.data.frame(replicate(p, rnorm(n, mean = 0, sd = 1)))
y <- sin(x[, 1]) + x[, 2] + rnorm(n, mean = 0, sd = 0.3)
dat <- as.data.frame(cbind(x, y))

# using SuperLearner
sl_lib <- c("SL.mean", "SL.glm")
sl_old <- SuperLearner(Y = y, X = x, SL.library = sl_lib)
y_pred_old <- predict(sl_old)$pred

# using sl3
sl_task <- make_sl3_Task(covariates = colnames(dat)[seq_len(p)], outcome = "y", data = dat)
sl3_lib <- make_learner_stack("Lrnr_mean", "Lrnr_glm_fast")
sl_new <- Lrnr_sl$new(learners = sl3_lib, metalearner = Lrnr_nnls$new())
sl_trained <- sl_new$train(sl_task)
y_pred_new <- sl_trained$predict()

# squared error difference
mean((y_pred_old - y_pred_new)^2)

...with the point being that an experience user (of SuperLearner, new to sl3) would be able to easily pick up the similarities between, for example, make_learner_stack and SL.library or calling the R6 .$predict method of the trained sl3 model vs. the S3 predict(...) method for an object of class SuperLearner. This is just an early-stage idea and could definitely use refinement and input from further discussion.

Speedglm in benchmark comparison

In the benchmark comparison, is there a reason that speedglm is used in sl3 but not used in SuperLearner? I doubt it will affect the results much, given the improvements from delayed etc., but it would be good to at least clarify that choice in the comparison, or use SL.speedglm in SL classic if it's faster than SL.glm.

`methods` package not loaded by default in some contexts

Error in is(task, "sl3_Task") : could not find function "is"

Rscript --default-packages=methods foo.R

Should be added to description

tag learners with metadata to indicate features

Similar to mlr's properties (e.g.: https://github.com/mlr-org/mlr/blob/master/R/RLearner_classif_randomForest.R#L25), to indicate support for various things like ids and weights and internal cv.

Rewrite readme

Current readme is from SuperLearner. Needs to be updated and CI needs to be set up again.

General tests

We should write a minimal set of tests any new learner must pass before being integrated

Learner Wishlist

wrapper function for SuperLearner

A nice function that does everything the current SuperLearner function does. Something for more novice users that don't want to worry about R6 and pipelines and such.

Poor error messaging from Lrnr_cv$new(...)$train(...)

It is currently very hard to debug the individual learners when doing cross-validation. Effectively, origami::cross-validate returns nothing / provides no messages when individual learners are failing, the only message returned is this, which is quite uninformative:

> cv_lrn <- Lrnr_cv$new(list(lrn1, lrn2, lrn3, lrn4))$train(task_1)
Warning message:
In cross_validate(cv_train, folds, learner, task, .combine = F,  :
  All iterations resulted in errors

As a result, its very time consuming to debug the learners with CV. Is there any way to pass error messages from inside this function call (for example, when getOption("sl3.verbose")==TRUE) ?

https://github.com/jeremyrcoyle/sl3/blob/e29892838e08985e92e5910ba86439605835d95a/R/Lrnr_cv.R#L70-L74

Unify internal structure for predictions

Currently using a matrix to store all predictions.
Should switch to data.table, with one column per model.
The column can be of arbitrary type (i.e., a list)) or numeric, depending on the regression problem (binary class / regression or multinomial classif.)