tidymodels / parsnip Goto Github PK

View Code? Open in Web Editor NEW

568.0 30.0 78.0 28.98 MB

A tidy unified interface to models

Home Page: https://parsnip.tidymodels.org

License: Other

R 100.00%

parsnip's Introduction

parsnip

Introduction

The goal of parsnip is to provide a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying packages.

Installation

# The easiest way to get parsnip is to install all of tidymodels:
install.packages("tidymodels")

# Alternatively, install just parsnip:
install.packages("parsnip")

# Or the development version from GitHub:
# install.packages("pak")
pak::pak("tidymodels/parsnip")

Getting started

One challenge with different modeling functions available in R that do the same thing is that they can have different interfaces and arguments. For example, to fit a random forest regression model, we might have:

# From randomForest
rf_1 <- randomForest(
  y ~ ., 
  data = dat, 
  mtry = 10, 
  ntree = 2000, 
  importance = TRUE
)

# From ranger
rf_2 <- ranger(
  y ~ ., 
  data = dat, 
  mtry = 10, 
  num.trees = 2000, 
  importance = "impurity"
)

# From sparklyr
rf_3 <- ml_random_forest(
  dat, 
  intercept = FALSE, 
  response = "y", 
  features = names(dat)[names(dat) != "y"], 
  col.sample.rate = 10,
  num.trees = 2000
)

Note that the model syntax can be very different and that the argument names (and formats) are also different. This is a pain if you switch between implementations.

In this example:

the type of model is “random forest”,
the mode of the model is “regression” (as opposed to classification, etc), and
the computational engine is the name of the R package.

The goals of parsnip are to:

Separate the definition of a model from its evaluation.
Decouple the model specification from the implementation (whether the implementation is in R, spark, or something else). For example, the user would call rand_forest instead of ranger::ranger or other specific packages.
Harmonize argument names (e.g. n.trees, ntrees, trees) so that users only need to remember a single name. This will help across model types too so that trees will be the same argument across random forest as well as boosting or bagging.

Using the example above, the parsnip approach would be:

library(parsnip)

rand_forest(mtry = 10, trees = 2000) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression")
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 10
#>   trees = 2000
#> 
#> Engine-Specific Arguments:
#>   importance = impurity
#> 
#> Computational engine: ranger

The engine can be easily changed. To use Spark, the change is straightforward:

rand_forest(mtry = 10, trees = 2000) %>%
  set_engine("spark") %>%
  set_mode("regression")
#> Random Forest Model Specification (regression)
#> 
#> Main Arguments:
#>   mtry = 10
#>   trees = 2000
#> 
#> Computational engine: spark

Either one of these model specifications can be fit in the same way:

set.seed(192)
rand_forest(mtry = 10, trees = 2000) %>%
  set_engine("ranger", importance = "impurity") %>%
  set_mode("regression") %>%
  fit(mpg ~ ., data = mtcars)
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~10,      x), num.trees = ~2000, importance = ~"impurity", num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
#> 
#> Type:                             Regression 
#> Number of trees:                  2000 
#> Sample size:                      32 
#> Number of independent variables:  10 
#> Mtry:                             10 
#> Target node size:                 5 
#> Variable importance mode:         impurity 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       5.976917 
#> R squared (OOB):                  0.8354559

A list of all parsnip models across different CRAN packages can be found at https://www.tidymodels.org/find/parsnip.

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
If you think you have encountered a bug, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

parsnip's People

Contributors

Stargazers

Watchers

Forkers

gshotwell dynamicwebpaige hlynurhallgrims joshzyj ledell grayskripko lrodriguezlujan guhjy navdeep-g malcolmbarrett louisahsmith jayhesselberth armanabraham minghao2016 stjordanis dpcscience dfalbel conradbm dincerti fahadshery xiaofei1010 jimhester smingerson patr1ckm fdrennan akatie davechilders cristian-dinu-69 bradthiessen hfrick wulixin jyuu blairj09 xiaosongz franciscopalomares abichat chrismuris jaredlander lepennec aespar21 athospd rorynolan ravihela jas0nwhite dirkschumacher kwiscion emilhvitfeldt yadevi markfairbanks shosaco mitushiananya jtlandis jcpsantiago mdancho84 mgaldame jonthegeek mattblackwell cderv jmarshallnz pursuitofdatascience alexisbocuze franzbischoff patelis joeycouse mattwarkentin bcjaeger schoonees jbarsotti thierrymoudiki gast1111 gmcmacran kondou449600 camargoatair allenlile drhmoosavi kscott-1 jodavid

parsnip's Issues

ranger with probabilities = TRUE

When getting class predictions, the post-processor has a bug:

obj <- 
  rand_forest(mode = "classification", others = list(probability = TRUE)) %>% 
  fit(Species ~ ., data = iris, engine = "ranger") 

predict(obj, newdata = iris[1:4, -5])

Installation error

ERROR: dependency 'modelgenerics' is not available for package 'parsnip'

https://github.com/topepo/parsnip/blob/master/DESCRIPTION#L39
I suppose it should be replaced with "r-lib/generics"

https://github.com/topepo/parsnip/blob/master/DESCRIPTION#L24

add a print method for model_spec

It looks like all the print methods are specific to the model

Remove S3 from model functions

Previously, the idea was to figure out the mode from data, so there were different methods for formulas, recipes etc.

However, the mode is manually specified now so we don't need a class for the models (and different methods).

If we kept them, we would need to always specify the mode even when there is on'y one choice (e.g. logistic regression). For example:

> library(parsnip)
> logistic_reg()
Logistic Regression Model Specification (classification)


> logistic_reg(mixture = varying())
Error in varying() : This is a placeholder and should not be evaluated
>
> #we would need to do:
> logistic_reg(mode = "classification", mixture = varying())
Logistic Regression Model Specification (classification)

Main Arguments:
mixture:   varying()

formula:formula calls are borked

This works:

rand_forest(mode = "regression") %>%
  fit(mpg ~ ., data = mtcars, engine = "ranger")

but not this:

foo <- function(dat, ...) {
  rand_forest(mode = "regression") %>%
    fit(..., data = dat, engine = "ranger")
}
foo(mtcars, mpg ~ .)
#  Error in eval_tidy(data) : object 'dat' not found

We try to avoid getting fancy by avoiding execution of the data object until it is needed here but we pass in the call$data expression. eval_tidy is digging in the wrong place! This should probably be a quosure to capture the environment (or try not to be fancy).

We do the same thing with the formula too.

parameterize prediction functions

Figure out how to pass arguments down to the prediction code.

For example, with glmnet and other sub-model enabled prediction methods, figure out how to pass args to get other parameter estimates back.

Reporting uncertainty

I'm reworking lots of broom::augment() methods at the moment and am discovering that packages do some crazy stuff to report uncertainty. Defining some standards for reporting uncertainty early on seems like a good idea.

For classification problems, reporting the class probabilities makes sense, but can become problematic for outcomes with high cardinalities. Nobody wants 1000 columns of class probabilities. One option is to just report the most likely class along with it's probability, or the top k = 5 or so classes by default.

For regression problems I think there's more nuance. Open questions:

Is the best way to report uncertainty in a regression outcome to add a column of standard errors se_fit or similar?
How should users specify that they want confidence intervals vs prediction intervals?
Should confidence intervals or prediction intervals be the default reporting option?

fit() should use `rlang::new_environment`

https://github.com/topepo/parsnip/blob/master/R/fit.R#L107

Vignette update request

Just took a look over the making a parsnip model object from scratch vignette. There was a lot going on and I had a bit of difficulty putting the pieces together from my short glance through. I think the vignette might benefit from being broken into two separate pieces:

Creating new model specifications
Adding a new estimator for an existing model specification
- Aside: it would be easy to confuse an engine with an estimator, so we should probably document the difference somewhere.

I finally have parsnip running on my laptop and I'm going to try to use it exclusively for a regression course this semester and see where I run into problems. Some things I imagine I'll be building fairly early on:

Mixed models. I should just be able to pass formulas to lme4 with the usual random effects syntax, right?
Robust covariance estimators. The easy way to do this is probably via a new lm_robust engine.

data descriptors for spark data objects

This will need to be different code for spark objects that should emulate get_descr_form (since we've constrained spark objects to the formula method).

non-right censoring for survival models with recipes

Surv objects of type= "counting", "interval1" or "interval2" are not currently supported.

This would mostly be for defining roles but in the case of interval censoring is is difficult. If we let two time variables be used with the outcome role, we don't know what their order should be (and the recipe might change the order). If we used roles like tmin and tmax, then juice and helpers won't recognize them as outcomes.

A more pipeable fit() interface

I gave a little thought to the fit() interface problem and this is what I came up with. I don't really like the interface arg name but thats just a naming thing.

# helper if required
xy <- function(x, y) {
  list(x = x, y = y)
}

# notice how engine would come before the _optional_ data param for pipeability
# all required params are now moved to the front
# engine could come before interface if you want to keep interface+data together
fit <- function(model_spec, interface, engine, data, control, ...) {
  #switch based on interface being a formula VS list
}

linear_reg() %>%
  fit(y ~ x1 + x2, "lm", fit_data)

linear_reg() %>%
  fit(xy(fit_data[,c("x1", "x2")], fit_data[,c("y")]), "lm")

# slightly simpler
xy_defn <- xy(
  x = fit_data[,c("x1", "x2")], 
  y = fit_data[,c("y")]
)

linear_reg() %>%
  fit(xy_defn, "lm")

switching between prediction arguments

For "multi-use" models, there might be prediction arguments that are variable depending on the type of prediction being requested.

For example, with randomForest , predict should use the argument type = "response" when predicting the classes or numbers but should use type = "prob" for predict_probs.

We could parameterize translate to work with specific classes with set this as appropriate. Right now, the prediction call object is created in predict.model_fit and similars. translate could do this too based on the embedded model specification sub-object.

Harmonize names of `predict_()` functions and `object$spec$method$`

Because having predict_num() go with method$pred is kind of confusing.

Also predict_classprob() goes with method$prob.

include namespaces in the fit call

corollary: don't fully load the underlying package, only the namespace

easily populate parameters from dials

If you use

param_grid <- random_grid(mtry, min_n, size = 5)

and want to populate a model specification, it gets kludgy.

If param is a row of param_grid:

   rand_forest(mtry = param[["mtry"]], min_n = param[["min_n"]])
# or
update(object, mtry = param[["mtry"]], min_n = param[["min_n"]])

there should be some version of update that automatically populates the parameters (which is why the names are standardized between dials and parsnip). Maybe coopt merge or some other relevant verb?

rename some objects

alternates is poorly named. defaults? The same is slightly true for the prediction function's args.

The constructors aren't really constructors (I think that they originally were constructor functions). Change them to "modules".

The {model name}_{engine}_fit objects contain everything so maybe call them {model name}_{engine}_data?

intermediary constructors for different engines

This would help further modify default arguments as well as protect against common issues.

For example, randomForest has an option called importance that is used a lot and it takes a logical value. ranger has an argument of the same name but it takes character strings. A ranger-specific constructor (or function) can be used to protect against this problem.

Also, since some primary arguments are available for each engine (e.g. regularization in glm), we also need to intercept and/or modify these arguments when used inappropriately (instead of just ignoring them).

Differentiating between models, estimators and engines

I'm think I can finally translate the thoughts from the modeling abstraction essay (a separate doc that grew out of #19) into parsnip terms. Some concepts to start:

A model is a family of probability distributions or functions. That is, a model is set.
An estimator is a way to calculate the parameters of a model from a dataset. Note that hyperparameters are most often properties of estimators.
The resulting estimates are a fit (I think @topepo often refers to this a sub-model). This is an element of the model.
There are often multiple algorithms and implementations of the same estimator. In this case, using parsnip terminology, each implementation is a different engine.

Estimators are typically implicit

lm specifies the OLS estimator for the linear model
glmnet specifies the elastic net estimator for the linear model

Estimator selection should be explicit

Something along the lines of

ols_hc1_fit <- linear_reg() %>% 
  linear_estimator(coefs = "ols", coef_covariance = "HC1") %>% 
  fit_xy(
    x = ...,
    y = ...,
    engine = "lm_robust"
  )

Perhaps the linear_reg() isn't necessary here, but it does feel the most explicit / low-level to me. In particular, I think it's important to explicitly select an estimator, rather than letting it be implicit in engine. All estimators are not created equal.

Different estimators should have informative subclasses

Currently the parsnip behavior is to always produce a model_fit object:

ols <- linear_reg() %>% 
  fit(hp ~ ., data = mtcars, engine = "lm")

class(ols)
># [1] "model_fit"

I'm strongly of the opinion that ols should have subclasses that indicate:

the model_fit was estimated using ordinary least squares
the model_fit object contains a single fit/submodel, as opposed to a set of fits/submodels

Without this differentiation I don't think it's possible to meaningfully define methods on ols for inference. Consider the following methods, all for the linear model:

plot_lasso_path() only makes sense for a set of fits from the LASSO estimator
coef_standard_errors() makes sense for a fit from the OLS estimator but not the LASSO estimator
interpret_coefficients() should have different behavior for an OLS fit and a GEE fit

What is a model?

Coffee-addled rant here, but bear with me. I think it'll be really valuable as the tidymodels universe takes off to have a clear and well documented definition of what a model is.

In classic statistics land, if you have some data x that live in a space X, a model is a distribution P(X) indexed by parameters theta. In linear regression with three features, theta lives in R^3. Then a fit often refers P(X) where we've picked a particular theta to work with, and there's an isomorphism between R^3 and all possible fits.

(Aside: calling a particular theta a fit isn't great language because fit should be a verb referring to model fitting, not a noun referring to the object returned by the fitting process).

To me, a key question is how do we express this idea in code. For example, if we write out a linear model:

y = theta_0 + theta_1 * x_1 + ... + theta_p * x_p + epsilon

where epsilon are IID Gaussian, then the following are all the same model (in the sense that they all have the same parameter space)

OLS
LASSO
RIDGE
Any other penalized regression technique
OLS estimated with Horseshoe priors
Etc

Sure, for the penalized regression methods you have to estimate the penalization parameter, but this is a hyperparameter, which I think we can broadly think about as a parameter that we have to estimate but that we don't really care what value it takes on. So these all have the same parameter space, but different hyperparameter spaces. Another way to express this same idea is that what differentiates MCP from LASSO from OLS, etc, etc is not that they are different models but rather that they are different techniques for estimating the same model.

(Aside: one interesting question is whether or not hierarchical models belong on the list above. I think it depends on whether or not you care about the group level parameters, in which case you are now in a new parameter space. OLS with HC errors is another interesting case to think about. In this case the model is still the linear model, but now we're more explicitly declaring that we want to estimate the covariance matrix, and also that we are going to use, say, HC1 to do so. I'd still call this a linear model, but only if the original definition of the linear model specified covariance as an estimand).

If I'm going to actually implement things in code, I want to work with an object that specifies the estimation method, which likely is closely tied to a hyperparameter space.

I think that a parsnip model specification shouldn't work with the classical stats sense of a model like we're defined above, but rather should encapsulate all the things you need to do to get parameters back. Parsnip is already doing a lot of this, but I think there's a lot of value in being very clear about what a parsnip object should specify. In my mind this includes, at the minimum:

The estimand, or parameters you want to estimate, mostly implicit in the model you select
The estimating procedure (i.e. LARS, or the analytic solution to OLS). Often implicit in the package you call.
Any hyperparameter spaces (ex: lambda in R+ for LASSO)
Procedures for picking hyperparameters (ex: random search over bootstrap samples picking the smallest lambda within 1 SE of the minimum RMSE)

For now I think it makes sense to call this a model specification, but I think it's critically important to distinguish between the model and the model plus all this other stuff. Similarly, after the model fitting process, when you have many different fits (one for each hyperparameter combination, say), there are tasks that involve working with all the fits together (you might be curious which LASSO variable entered the model first), and tasks that involving working with just one fit (i.e. looking at the LASSO coefficients themselves).

I strongly believe that a good interface very clearly differentiates between a group of fits together, and single fit, and provides type-safe methods for working with each of these.

Related issue: canonical modelling examples

A related issue is to find canonical modelling examples that are sufficient to develop our intuition about what the code objects should look like. OLS is too simple because it doesn't need a lot of the machinery that other models need. I think that a good starting place is to have one canonical example where we can employ the submodel trick (penalized regression seems like a good place to start), and one where we can't (maybe SVMs here?). Another way to think about this: we should have one canonical example where there is exploitable structure in the hyperparameter space, and one canonical example where there isn't.

next set of models

👆Already in parsnip or adjacent package
👇working on or thinking about

ordinal regression
rotation forests

determine convention for available data characteristics

When fit is called, there will be a need to pass unevaluated arguments to the fit function. For example, someone might want to use:

rand_forest(mtry = expr(floor(sqrt(ncol(x)))))

but x might not exist when this expression is evaluated.

There should be a standard set of objects that are guaranteed to be available at the time of fit, such as:

number of columns
number of row
minimum and maximum class sizes

This will need to be well documented; the number of columns depends on whether dummy variables have been created or not. As such, this might vary depending on how the data are exposed via fit.

Perhaps variables such as .n, .p and others could be used so that

rand_forest(mtry = expr(floor(sqrt(.p))))

would work.

Installation problems

Hello! I'm having troubles installing parsnip here from github.

devtools::install_github("topepo/parsnip")
#> Using GitHub PAT from envvar GITHUB_PAT
#> Downloading GitHub repo topepo/parsnip@master
#> from URL https://api.github.com/repos/topepo/parsnip/zipball/master
#> Installing parsnip
#> Using GitHub PAT from envvar GITHUB_PAT
#> Using GitHub PAT from envvar GITHUB_PAT
#> '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
#>   --no-environ --no-save --no-restore --quiet CMD INSTALL  \
#>   '/private/var/folders/nj/s2k7d2_93t9_87brhynnfwvc0000gn/T/Rtmpo0ApMz/devtools147a71ced9c8/topepo-parsnip-de55683'  \
#>   --library='/Library/Frameworks/R.framework/Versions/3.5/Resources/library'  \
#>   --install-tests
#>
#> ERROR: dependency ‘modelgenerics’ is not available for package ‘parsnip’
#> * removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/parsnip’
#> Installation failed: Command failed (1)

Created on 2018-08-26 by the reprex
package (v0.2.0).

It appears that the problems comes modelgenerics which is being downloaded remotely as tidymodels/modelgenerics. however that links directly to r-lib/generics.

fully evaluate primary arguments

such as mtry for random forests. Currently:

args <- list(
  mtry = rlang::enquo(mtry),
  trees = rlang::enquo(trees),
  min_n = rlang::enquo(min_n)
)

passing contrasts and other model.matrix options

In a lot of cases, there will be some data conversion from a data frame to a model matrix. There needs to be a clean interface so that the usual options can be passed along.

Also, we might need a flag for when to stop at a model frame and when to go all the way with calling model.matrix.

type_sum method

For better tibble printing

Separate backend for tidy prediction

I've been reworking the augment() methods and it's rapidly becoming clear that dealing with idiosyncratic predict() methods is going to slow down progress immensely.

In the end broom and parsnip are both going to want to wrap a bajillion predict methods, and we should report predictions in the same way for consistencies sake. I think we should move this functionality to a separate package. Potentially we could use the prediction package by Thomas Leeper, but we should decide on the behavior we want first.

If we define a new generic / series of generics, we can then test these behaviors in modeltests and allow other modelling package developers to guarantee that their predict() methods are sane and consistent.

What I want from a predict method:

Returns a tidy tibble
Never drops missing data (i.e. matches the behavior of predict.lm(..., na.action = na.pass))
Consistent naming of fitted values
Uncertainty in predictions

I want all of these to be guaranteed, and for methods that cannot meet these guarantees, I want an informative error rather than a partially correct output.

Add short description and references for models

Since people will be using models that they have never researched, make it easier for them to get a rough understanding of what they are asking for.

test abnormal class values

These become class names eventually so test back-ticks and maybe standardize on tibbles

protect against common interface mistakes

Didn't use named arguments and wrong arg is picked up:

# `ovarian` gets picked up as a `recipe` argument
fit(tt, Surv(futime, fustat) ~ ecog.ps + rx, ovarian)

Directly try to fit without fit:

surv_reg(Surv(futime, fustat) ~ ecog.ps + rx, data = ovarian)

parameter name changes

change units to hidden_units and weight_decay to regularization

predict(..., type = "oob")

I couldn't find out-of-bag predictions. In case it will be added keep this issue.
A couple of examples

rand_forest(mode = "classification") %>% 
    fit(hp ~ ., data = mtcars, engine = "ranger") %>% 
    {.$fit$predictions[1:5]}
# [1] 127.44091 128.56689  94.72514 121.45245 165.98219


rand_forest(mode = "classification", others = list(probability = T)) %>% 
    fit(hp ~ ., data = mtcars, engine = "ranger") %>% 
    {.$fit$predictions[1:5, 1:5]}
#            110         93        175        105         245
# [1,] 0.24082457 0.024893872 0.19493817 0.01449105 0.040247785
# [2,] 0.25530948 0.038556354 0.16503503 0.01379800 0.039072892
# [3,] 0.14358238 0.000000000 0.02238186 0.02633872 0.002298851
# [4,] 0.05724574 0.053273285 0.10391786 0.08190140 0.009538507
# [5,] 0.09842348 0.009909852 0.16075958 0.04509576 0.079740338


rand_forest(mode = "classification")) %>% 
    fit(hp ~ ., data = mtcars, engine = "randomForest") %>% 
    {.$fit$predicted[1:5]}
#         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive Hornet Sportabout 
#         123.09100         123.47858          96.17599         125.62471         164.67905

detect when to calculate descriptors

Right now, the fit function makes available certain variables that characterize the training data at the time of model fitting. The two underlying functions that do this are get_descr_form and get_descr_xy.

These may be costly, so we should have some code that determines if any data descriptors are used in the argument values. If at least one is found, we can execute the code to make them available,

how to include data checks and/or manipulations

For example, some functions have specific data requirements:

some must have a matrix as input (instead of a data frame or matrix) (e.g. glmnet),
want factors to be encoded as integers :-O (tensorflow),
a sparse matrix is required (xgboost), or
some require specific data types such as all factor predictors and so on.

Some of this could occur be modifying the default argument to the function (e.g. x = as.matrix(x)) but it would probably be better to include some code or module that checks or modifies the data.

The problem is the different interfaces. We should need one for formula, recipes, and x/y interfaces.

Regression Example: Error in current_env() : could not find function "current_env"

I was going through the Regression Example. Below is my code.

#devtools::install_github("topepo/modelgenerics")
#devtools::install_github("topepo/parsnip", dependencies=TRUE)
#install.packages("AmesHousing")
#install.packages("rsample")
#devtools::install_github("imbs-hl/ranger")
#install.packages("rlang")

library(parsnip)
library(AmesHousing)
library(tidyverse)
library(rsample)
library(ranger)

ames <- make_ames()
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75)

ames_train <- training(data_split)
ames_test  <- testing(data_split)

rf_defaults <- rand_forest(mode = "regression")
rf_defaults

preds <- c("Longitude", "Latitude", "Lot_Area", "Neighborhood", "Year_Sold")

rf_xy_fit <- rf_defaults %>%
  fit(
    x = ames_train[, preds],
    y = log10(ames_train$Sale_Price),
    engine = "ranger"
  )

I get the following error

Error in current_env() : could not find function "current_env"

I thought this was an rlang issue to I removed and reinstalled rlang. Then I thought is was a session issue so I restarted R, still no luck. I was wondering if you had any insight into this error.

predict method

Change the predict function to predict_num and write a general wrapper predict method that switches between these.

appropriate printing of vector arguments

> linear_reg(regularization = c(0.01, 0.10))
Linear Regression Model Specification (regression)

Main Arguments:
  regularization = c("0.01", "0.10")

more options for argument control

Currently, there is a "protect" field that is a list of argument names that the uer should not be able to mess with. For example, with stats::glm, "data" should not be modified until fit is run and so on.

There should be at least one other option though:

some parameters we might want to always be included regardless of whether they have been modified from their original. Examples include "family" for logistic regression and "seed" for ranger.

callbacks for mlp models in keras

Maybe have a callback option that enables a list of callbacks to be used (see here).

Parsnip name

Hi Max,

Great meeting you at rstudio::conf. So you did decide to ultimately call it parsnip? :-)

Cheers,
Bohdan

check data type for the model mode

Test to see if integers work for classification, etc.

Should this be another module?

add test cases for multivariate neural networks

Reminded by topepo/caret#466...

This should work but there are no test cases.

demote prediction helper functions to internal

For less confusing documentation.

Is `fit` actually checking for installed libs?

If I remove randomForest and then try and run:

fit(rand_forest(), formula = Species ~ ., data = iris, engine = "randomForest")

I get the following traceback

 Error in loadNamespace(name) : there is no package called ‘randomForest’ 
15.
stop(e) 
14.
value[[3L]](cond) 
13.
tryCatchOne(expr, names, parentenv, handlers[[1L]]) 
12.
tryCatchList(expr, classes, parentenv, handlers) 
11.
tryCatch(loadNamespace(name), error = function(e) stop(e)) 
10.
getNamespace(ns) 
9.
asNamespace(ns) 
8.
getExportedValue(pkg, name) 
7.
randomForest::randomForest 
6.
eval_tidy(e, ...) at fit.R#272
5.
eval_mod(fit_call, capture = control$verbosity == 0, catch = control$catch, 
    env = env, ...) at fit_helpers.R#107
4.
xy_xy(object = object, env = env, control = control, target = target) at fit_helpers.R#138
3.
form_xy(object = object, control = control, env = eval_env, target = object$method$fit$interface, 
    ...) at fit.R#135
2.
fit.model_spec(rand_forest(), formula = Species ~ ., data = iris, 
    engine = "randomForest") at models.R#116
1.
fit(rand_forest(), formula = Species ~ ., data = iris, engine = "randomForest")

Pretty sure that this line in fit() is supposed to be right after the check_engine() line:

# populate `method` with the details for this model type
    object <- get_method(object, engine = object$engine)

because check_installs() and load_libs() both use things from x$method$library (which I think is now x$method$libs) but x$method isn't populated until get_method() is run

Should "main" model spec args be protected from use in `others=`?

Meaning, should this be allowed?

> translate(linear_reg(penalty = 1, others = list(lambda = 1.5)), "glmnet")
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 1

Engine-Specific Arguments:
  lambda = 1.5

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    lambda = 1, lambda = 1.5, family = "gaussian")

Should be able to prevent this with the use of protect = c("lambda") in linear_reg_glmnet_data

glmnet predictions and lambda

We want to have the number of rows in the prediction results the same as the number of rows in new_data.

Right now, the code will make predictions at all lambda values contained in the model fit. Here is an example:

> all_lambda <- 
+     linear_reg() %>%
+     fit(mpg ~ ., data = mtcars, engine = "glmnet")
> 
> predict(all_lambda, new_data = mtcars[1:3, -1])
# A tibble: 237 x 2
   .pred_values .pred_lambda
          <dbl>        <dbl>
 1         20.1         5.15
 2         20.1         5.15
 3         20.1         5.15
 4         20.4         4.69
 5         20.2         4.69
 6         20.5         4.69
 7         20.5         4.27
 8         20.4         4.27
 9         21.0         4.27
10         20.7         3.89
# ... with 227 more rows
> # yuk ".pred_lambda" needs to go regardless
> 
> length(unique(.Last.value$.pred_lambda))
[1] 79

The new multi_predict will generate predictions at multiple lambda values and should be preferred in this case. I suggest that

predict only produces predictions at a single lambda (and otherwise throws an error that directs people to multi_predict for this instance).
We could write some specialized predict methods for the glmnet subclasses (e.g. multnet, lognet, etc) that have a penalty argument that accepts a single value. This would appear seamless to to the user since

predict(all_lambda, new_data = mtcars[1:3, -1])               # errors but
predict(all_lambda, new_data = mtcars[1:3, -1], penalty = .1) # would work

since

> class(all_lambda)
[1] "model_fit" "_elnet"

(and predict._elnet would just call predict.model_fit with a single parameter value)

crappy straw-man alternative: The issue is that a lot of people are going to leave penalty unspecified, expect to get all possible predictions back, and be frustrated that they have to use a different predict function to get them. Making predict.model_fit make predictions at all lambdas is possible but then it behaves differently for this model (which is how we got into this mess).

Any other suggestions?

fixed method dispatch

tasks for user-defined models

export and document check_empty_ellipse, make_classes, model_printer, and show_call
document with model_fit and model_spec object structures
complete vignette on creating models from scratch.