tidymodels / parsnip Goto Github PK
View Code? Open in Web Editor NEWA tidy unified interface to models
Home Page: https://parsnip.tidymodels.org
License: Other
A tidy unified interface to models
Home Page: https://parsnip.tidymodels.org
License: Other
These become class names eventually so test back-ticks and maybe standardize on tibbles
I couldn't find out-of-bag predictions. In case it will be added keep this issue.
A couple of examples
rand_forest(mode = "classification") %>%
fit(hp ~ ., data = mtcars, engine = "ranger") %>%
{.$fit$predictions[1:5]}
# [1] 127.44091 128.56689 94.72514 121.45245 165.98219
rand_forest(mode = "classification", others = list(probability = T)) %>%
fit(hp ~ ., data = mtcars, engine = "ranger") %>%
{.$fit$predictions[1:5, 1:5]}
# 110 93 175 105 245
# [1,] 0.24082457 0.024893872 0.19493817 0.01449105 0.040247785
# [2,] 0.25530948 0.038556354 0.16503503 0.01379800 0.039072892
# [3,] 0.14358238 0.000000000 0.02238186 0.02633872 0.002298851
# [4,] 0.05724574 0.053273285 0.10391786 0.08190140 0.009538507
# [5,] 0.09842348 0.009909852 0.16075958 0.04509576 0.079740338
rand_forest(mode = "classification")) %>%
fit(hp ~ ., data = mtcars, engine = "randomForest") %>%
{.$fit$predicted[1:5]}
# Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout
# 123.09100 123.47858 96.17599 125.62471 164.67905
Previously, the idea was to figure out the mode from data, so there were different methods for formulas, recipes etc.
However, the mode is manually specified now so we don't need a class for the models (and different methods).
If we kept them, we would need to always specify the mode even when there is on'y one choice (e.g. logistic regression). For example:
> library(parsnip)
> logistic_reg()
Logistic Regression Model Specification (classification)
> logistic_reg(mixture = varying())
Error in varying() : This is a placeholder and should not be evaluated
>
> #we would need to do:
> logistic_reg(mode = "classification", mixture = varying())
Logistic Regression Model Specification (classification)
Main Arguments:
mixture: varying()
Meaning, should this be allowed?
> translate(linear_reg(penalty = 1, others = list(lambda = 1.5)), "glmnet")
Linear Regression Model Specification (regression)
Main Arguments:
penalty = 1
Engine-Specific Arguments:
lambda = 1.5
Computational engine: glmnet
Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(),
lambda = 1, lambda = 1.5, family = "gaussian")
Should be able to prevent this with the use of protect = c("lambda")
in linear_reg_glmnet_data
change units
to hidden_units
and weight_decay
to regularization
Since people will be using models that they have never researched, make it easier for them to get a rough understanding of what they are asking for.
I was going through the Regression Example. Below is my code.
#devtools::install_github("topepo/modelgenerics")
#devtools::install_github("topepo/parsnip", dependencies=TRUE)
#install.packages("AmesHousing")
#install.packages("rsample")
#devtools::install_github("imbs-hl/ranger")
#install.packages("rlang")
library(parsnip)
library(AmesHousing)
library(tidyverse)
library(rsample)
library(ranger)
ames <- make_ames()
set.seed(4595)
data_split <- initial_split(ames, strata = "Sale_Price", p = 0.75)
ames_train <- training(data_split)
ames_test <- testing(data_split)
rf_defaults <- rand_forest(mode = "regression")
rf_defaults
preds <- c("Longitude", "Latitude", "Lot_Area", "Neighborhood", "Year_Sold")
rf_xy_fit <- rf_defaults %>%
fit(
x = ames_train[, preds],
y = log10(ames_train$Sale_Price),
engine = "ranger"
)
I get the following error
Error in current_env() : could not find function "current_env"
I thought this was an rlang
issue to I removed and reinstalled rlang
. Then I thought is was a session issue so I restarted R
, still no luck. I was wondering if you had any insight into this error.
Because having predict_num()
go with method$pred
is kind of confusing.
Also predict_classprob()
goes with method$prob
.
# `ovarian` gets picked up as a `recipe` argument
fit(tt, Surv(futime, fustat) ~ ecog.ps + rx, ovarian)
fit
:surv_reg(Surv(futime, fustat) ~ ecog.ps + rx, data = ovarian)
Currently, there is a "protect" field that is a list of argument names that the uer should not be able to mess with. For example, with stats::glm
, "data" should not be modified until fit
is run and so on.
There should be at least one other option though:
ranger
.This would help further modify default arguments as well as protect against common issues.
For example, randomForest
has an option called importance
that is used a lot and it takes a logical value. ranger
has an argument of the same name but it takes character strings. A ranger
-specific constructor (or function) can be used to protect against this problem.
Also, since some primary arguments are available for each engine (e.g. regularization in glm
), we also need to intercept and/or modify these arguments when used inappropriately (instead of just ignoring them).
It looks like all the print methods are specific to the model
Coffee-addled rant here, but bear with me. I think it'll be really valuable as the tidymodels
universe takes off to have a clear and well documented definition of what a model is.
In classic statistics land, if you have some data x
that live in a space X
, a model is a distribution P(X)
indexed by parameters theta
. In linear regression with three features, theta
lives in R^3
. Then a fit
often refers P(X)
where we've picked a particular theta
to work with, and there's an isomorphism between R^3
and all possible fits.
(Aside: calling a particular theta
a fit isn't great language because fit
should be a verb referring to model fitting, not a noun referring to the object returned by the fitting process).
To me, a key question is how do we express this idea in code. For example, if we write out a linear model:
y = theta_0 + theta_1 * x_1 + ... + theta_p * x_p + epsilon
where epsilon
are IID Gaussian, then the following are all the same model (in the sense that they all have the same parameter space)
Sure, for the penalized regression methods you have to estimate the penalization parameter, but this is a hyperparameter, which I think we can broadly think about as a parameter that we have to estimate but that we don't really care what value it takes on. So these all have the same parameter space, but different hyperparameter spaces. Another way to express this same idea is that what differentiates MCP from LASSO from OLS, etc, etc is not that they are different models but rather that they are different techniques for estimating the same model.
(Aside: one interesting question is whether or not hierarchical models belong on the list above. I think it depends on whether or not you care about the group level parameters, in which case you are now in a new parameter space. OLS with HC errors is another interesting case to think about. In this case the model is still the linear model, but now we're more explicitly declaring that we want to estimate the covariance matrix, and also that we are going to use, say, HC1 to do so. I'd still call this a linear model, but only if the original definition of the linear model specified covariance as an estimand).
If I'm going to actually implement things in code, I want to work with an object that specifies the estimation method, which likely is closely tied to a hyperparameter space.
I think that a parsnip model specification shouldn't work with the classical stats sense of a model like we're defined above, but rather should encapsulate all the things you need to do to get parameters back. Parsnip is already doing a lot of this, but I think there's a lot of value in being very clear about what a parsnip object should specify. In my mind this includes, at the minimum:
lambda in R+
for LASSO)lambda
within 1 SE of the minimum RMSE)For now I think it makes sense to call this a model specification
, but I think it's critically important to distinguish between the model and the model plus all this other stuff. Similarly, after the model fitting process, when you have many different fits
(one for each hyperparameter combination, say), there are tasks that involve working with all the fits
together (you might be curious which LASSO variable entered the model first), and tasks that involving working with just one fit
(i.e. looking at the LASSO coefficients themselves).
I strongly believe that a good interface very clearly differentiates between a group of fits
together, and single fit
, and provides type-safe methods for working with each of these.
A related issue is to find canonical modelling examples that are sufficient to develop our intuition about what the code objects should look like. OLS is too simple because it doesn't need a lot of the machinery that other models need. I think that a good starting place is to have one canonical example where we can employ the submodel trick (penalized regression seems like a good place to start), and one where we can't (maybe SVMs here?). Another way to think about this: we should have one canonical example where there is exploitable structure in the hyperparameter space, and one canonical example where there isn't.
I'm think I can finally translate the thoughts from the modeling abstraction essay (a separate doc that grew out of #19) into parsnip
terms. Some concepts to start:
parsnip
terminology, each implementation is a different engine.lm
specifies the OLS estimator for the linear modelglmnet
specifies the elastic net estimator for the linear modelSomething along the lines of
ols_hc1_fit <- linear_reg() %>%
linear_estimator(coefs = "ols", coef_covariance = "HC1") %>%
fit_xy(
x = ...,
y = ...,
engine = "lm_robust"
)
Perhaps the linear_reg()
isn't necessary here, but it does feel the most explicit / low-level to me. In particular, I think it's important to explicitly select an estimator, rather than letting it be implicit in engine
. All estimators are not created equal.
Currently the parsnip
behavior is to always produce a model_fit
object:
ols <- linear_reg() %>%
fit(hp ~ ., data = mtcars, engine = "lm")
class(ols)
># [1] "model_fit"
I'm strongly of the opinion that ols
should have subclasses that indicate:
model_fit
was estimated using ordinary least squaresmodel_fit
object contains a single fit/submodel, as opposed to a set of fits/submodelsWithout this differentiation I don't think it's possible to meaningfully define methods on ols
for inference. Consider the following methods, all for the linear model:
plot_lasso_path()
only makes sense for a set of fits from the LASSO estimatorcoef_standard_errors()
makes sense for a fit from the OLS estimator but not the LASSO estimatorinterpret_coefficients()
should have different behavior for an OLS fit and a GEE fitJust took a look over the making a parsnip model object from scratch vignette. There was a lot going on and I had a bit of difficulty putting the pieces together from my short glance through. I think the vignette might benefit from being broken into two separate pieces:
I finally have parsnip
running on my laptop and I'm going to try to use it exclusively for a regression course this semester and see where I run into problems. Some things I imagine I'll be building fairly early on:
lme4
with the usual random effects syntax, right?lm_robust
engine.Test to see if integers work for classification, etc.
Should this be another module?
Maybe have a callback
option that enables a list of callbacks to be used (see here).
This will need to be different code for spark objects that should emulate get_descr_form
(since we've constrained spark
objects to the formula method).
corollary: don't fully load the underlying package, only the namespace
Hi Max,
Great meeting you at rstudio::conf. So you did decide to ultimately call it parsnip
? :-)
Cheers,
Bohdan
Right now, the fit
function makes available certain variables that characterize the training data at the time of model fitting. The two underlying functions that do this are get_descr_form
and get_descr_xy
.
These may be costly, so we should have some code that determines if any data descriptors are used in the argument values. If at least one is found, we can execute the code to make them available,
In a lot of cases, there will be some data conversion from a data frame to a model matrix. There needs to be a clean interface so that the usual options can be passed along.
Also, we might need a flag for when to stop at a model frame and when to go all the way with calling model.matrix
.
There are a number of test cases that pass in a fresh interactive session but fail when running devtools::check()
.
I'm reworking lots of broom::augment()
methods at the moment and am discovering that packages do some crazy stuff to report uncertainty. Defining some standards for reporting uncertainty early on seems like a good idea.
For classification problems, reporting the class probabilities makes sense, but can become problematic for outcomes with high cardinalities. Nobody wants 1000 columns of class probabilities. One option is to just report the most likely class along with it's probability, or the top k = 5
or so classes by default.
For regression problems I think there's more nuance. Open questions:
se_fit
or similar?check_empty_ellipse
, make_classes
, model_printer
, and show_call
model_fit
and model_spec
object structureskknn
packagedecision_tree
via rpart
, C5.0
, spark
(others?)kernlab
)glment
and spark
earth
packageklaR
, rpart
version and side package)linear_reg
)multilevelmod
package)censored
package)👆Already in parsnip
or adjacent package
👇working on or thinking about
If I remove randomForest
and then try and run:
fit(rand_forest(), formula = Species ~ ., data = iris, engine = "randomForest")
I get the following traceback
Error in loadNamespace(name) : there is no package called ‘randomForest’
15.
stop(e)
14.
value[[3L]](cond)
13.
tryCatchOne(expr, names, parentenv, handlers[[1L]])
12.
tryCatchList(expr, classes, parentenv, handlers)
11.
tryCatch(loadNamespace(name), error = function(e) stop(e))
10.
getNamespace(ns)
9.
asNamespace(ns)
8.
getExportedValue(pkg, name)
7.
randomForest::randomForest
6.
eval_tidy(e, ...) at fit.R#272
5.
eval_mod(fit_call, capture = control$verbosity == 0, catch = control$catch,
env = env, ...) at fit_helpers.R#107
4.
xy_xy(object = object, env = env, control = control, target = target) at fit_helpers.R#138
3.
form_xy(object = object, control = control, env = eval_env, target = object$method$fit$interface,
...) at fit.R#135
2.
fit.model_spec(rand_forest(), formula = Species ~ ., data = iris,
engine = "randomForest") at models.R#116
1.
fit(rand_forest(), formula = Species ~ ., data = iris, engine = "randomForest")
Pretty sure that this line in fit()
is supposed to be right after the check_engine()
line:
# populate `method` with the details for this model type
object <- get_method(object, engine = object$engine)
because check_installs()
and load_libs()
both use things from x$method$library
(which I think is now x$method$libs
) but x$method
isn't populated until get_method()
is run
Surv
objects of type= "counting"
, "interval1"
or "interval2"
are not currently supported.
This would mostly be for defining roles but in the case of interval censoring is is difficult. If we let two time variables be used with the outcome
role, we don't know what their order should be (and the recipe might change the order). If we used roles like tmin
and tmax
, then juice
and helpers won't recognize them as outcomes.
such as mtry
for random forests. Currently:
args <- list(
mtry = rlang::enquo(mtry),
trees = rlang::enquo(trees),
min_n = rlang::enquo(min_n)
)
For example, some functions have specific data requirements:
glmnet
),tensorflow
),xgboost
), orSome of this could occur be modifying the default argument to the function (e.g. x = as.matrix(x)
) but it would probably be better to include some code or module that checks or modifies the data.
The problem is the different interfaces. We should need one for formula, recipes, and x/y
interfaces.
alternates
is poorly named. defaults
? The same is slightly true for the prediction function's args
.
The constructors aren't really constructors (I think that they originally were constructor functions). Change them to "modules".
The {model name}_{engine}_fit
objects contain everything so maybe call them {model name}_{engine}_data
?
Change the predict
function to predict_num
and write a general wrapper predict
method that switches between these.
For less confusing documentation.
Figure out how to pass arguments down to the prediction code.
For example, with glmnet
and other sub-model enabled prediction methods, figure out how to pass args to get other parameter estimates back.
Reminded by topepo/caret#466...
This should work but there are no test cases.
I gave a little thought to the fit()
interface problem and this is what I came up with. I don't really like the interface
arg name but thats just a naming thing.
# helper if required
xy <- function(x, y) {
list(x = x, y = y)
}
# notice how engine would come before the _optional_ data param for pipeability
# all required params are now moved to the front
# engine could come before interface if you want to keep interface+data together
fit <- function(model_spec, interface, engine, data, control, ...) {
#switch based on interface being a formula VS list
}
linear_reg() %>%
fit(y ~ x1 + x2, "lm", fit_data)
linear_reg() %>%
fit(xy(fit_data[,c("x1", "x2")], fit_data[,c("y")]), "lm")
# slightly simpler
xy_defn <- xy(
x = fit_data[,c("x1", "x2")],
y = fit_data[,c("y")]
)
linear_reg() %>%
fit(xy_defn, "lm")
> linear_reg(regularization = c(0.01, 0.10))
Linear Regression Model Specification (regression)
Main Arguments:
regularization = c("0.01", "0.10")
For better tibble printing
This works:
rand_forest(mode = "regression") %>%
fit(mpg ~ ., data = mtcars, engine = "ranger")
but not this:
foo <- function(dat, ...) {
rand_forest(mode = "regression") %>%
fit(..., data = dat, engine = "ranger")
}
foo(mtcars, mpg ~ .)
# Error in eval_tidy(data) : object 'dat' not found
We try to avoid getting fancy by avoiding execution of the data
object until it is needed here but we pass in the call$data
expression. eval_tidy
is digging in the wrong place! This should probably be a quosure to capture the environment (or try not to be fancy).
We do the same thing with the formula too.
When fit
is called, there will be a need to pass unevaluated arguments to the fit function. For example, someone might want to use:
rand_forest(mtry = expr(floor(sqrt(ncol(x)))))
but x
might not exist when this expression is evaluated.
There should be a standard set of objects that are guaranteed to be available at the time of fit, such as:
This will need to be well documented; the number of columns depends on whether dummy variables have been created or not. As such, this might vary depending on how the data are exposed via fit
.
Perhaps variables such as .n
, .p
and others could be used so that
rand_forest(mtry = expr(floor(sqrt(.p))))
would work.
For "multi-use" models, there might be prediction arguments that are variable depending on the type of prediction being requested.
For example, with randomForest
, predict
should use the argument type = "response"
when predicting the classes or numbers but should use type = "prob"
for predict_probs
.
We could parameterize translate
to work with specific classes with set this as appropriate. Right now, the prediction call object is created in predict.model_fit
and similars. translate
could do this too based on the embedded model specification sub-object.
ERROR: dependency 'modelgenerics' is not available for package 'parsnip'
https://github.com/topepo/parsnip/blob/master/DESCRIPTION#L39
I suppose it should be replaced with "r-lib/generics"
https://github.com/topepo/parsnip/blob/master/DESCRIPTION#L24
We want to have the number of rows in the prediction results the same as the number of rows in new_data
.
Right now, the code will make predictions at all lambda values contained in the model fit. Here is an example:
> all_lambda <-
+ linear_reg() %>%
+ fit(mpg ~ ., data = mtcars, engine = "glmnet")
>
> predict(all_lambda, new_data = mtcars[1:3, -1])
# A tibble: 237 x 2
.pred_values .pred_lambda
<dbl> <dbl>
1 20.1 5.15
2 20.1 5.15
3 20.1 5.15
4 20.4 4.69
5 20.2 4.69
6 20.5 4.69
7 20.5 4.27
8 20.4 4.27
9 21.0 4.27
10 20.7 3.89
# ... with 227 more rows
> # yuk ".pred_lambda" needs to go regardless
>
> length(unique(.Last.value$.pred_lambda))
[1] 79
The new multi_predict
will generate predictions at multiple lambda values and should be preferred in this case. I suggest that
predict
only produces predictions at a single lambda (and otherwise throws an error that directs people to multi_predict
for this instance).
We could write some specialized predict methods for the glmnet
subclasses (e.g. multnet
, lognet
, etc) that have a penalty
argument that accepts a single value. This would appear seamless to to the user since
predict(all_lambda, new_data = mtcars[1:3, -1]) # errors but
predict(all_lambda, new_data = mtcars[1:3, -1], penalty = .1) # would work
since
> class(all_lambda)
[1] "model_fit" "_elnet"
(and predict._elnet
would just call predict.model_fit
with a single parameter value)
crappy straw-man alternative: The issue is that a lot of people are going to leave penalty
unspecified, expect to get all possible predictions back, and be frustrated that they have to use a different predict function to get them. Making predict.model_fit
make predictions at all lambdas is possible but then it behaves differently for this model (which is how we got into this mess).
Any other suggestions?
fixed method dispatch
I've been reworking the augment()
methods and it's rapidly becoming clear that dealing with idiosyncratic predict()
methods is going to slow down progress immensely.
In the end broom
and parsnip
are both going to want to wrap a bajillion predict methods, and we should report predictions in the same way for consistencies sake. I think we should move this functionality to a separate package. Potentially we could use the prediction
package by Thomas Leeper, but we should decide on the behavior we want first.
If we define a new generic / series of generics, we can then test these behaviors in modeltests
and allow other modelling package developers to guarantee that their predict()
methods are sane and consistent.
What I want from a predict method:
predict.lm(..., na.action = na.pass)
)I want all of these to be guaranteed, and for methods that cannot meet these guarantees, I want an informative error rather than a partially correct output.
Hello! I'm having troubles installing parsnip here from github.
devtools::install_github("topepo/parsnip")
#> Using GitHub PAT from envvar GITHUB_PAT
#> Downloading GitHub repo topepo/parsnip@master
#> from URL https://api.github.com/repos/topepo/parsnip/zipball/master
#> Installing parsnip
#> Using GitHub PAT from envvar GITHUB_PAT
#> Using GitHub PAT from envvar GITHUB_PAT
#> '/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file \
#> --no-environ --no-save --no-restore --quiet CMD INSTALL \
#> '/private/var/folders/nj/s2k7d2_93t9_87brhynnfwvc0000gn/T/Rtmpo0ApMz/devtools147a71ced9c8/topepo-parsnip-de55683' \
#> --library='/Library/Frameworks/R.framework/Versions/3.5/Resources/library' \
#> --install-tests
#>
#> ERROR: dependency ‘modelgenerics’ is not available for package ‘parsnip’
#> * removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/parsnip’
#> Installation failed: Command failed (1)
Created on 2018-08-26 by the reprex
package (v0.2.0).
It appears that the problems comes modelgenerics which is being downloaded remotely as tidymodels/modelgenerics. however that links directly to r-lib/generics.
When getting class predictions, the post-processor has a bug:
obj <-
rand_forest(mode = "classification", others = list(probability = TRUE)) %>%
fit(Species ~ ., data = iris, engine = "ranger")
predict(obj, newdata = iris[1:4, -5])
related to tidymodels/recipes#181...
Use a generic to create a textual summary of a model (e.g. "a random forest classifier with mtry = 4 and the number of trees = 1000")
If you use
param_grid <- random_grid(mtry, min_n, size = 5)
and want to populate a model specification, it gets kludgy.
If param
is a row of param_grid
:
rand_forest(mtry = param[["mtry"]], min_n = param[["min_n"]])
# or
update(object, mtry = param[["mtry"]], min_n = param[["min_n"]])
there should be some version of update
that automatically populates the parameters (which is why the names are standardized between dials
and parsnip
). Maybe coopt merge
or some other relevant verb?
I am trying to do some parameter tuning with the ranger
engine in parsnip
. I see that there is a varying()
parameter, but I am confused on how to implement tuning with it.
I really liked your tuning example here from the rsample
package. I would benefit from documentation on how to implement this nested resampling strategy in parsnip
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.