tidymodels / hardhat Goto Github PK

View Code? Open in Web Editor NEW

100.0 100.0 13.0 12.15 MB

Construct Modeling Packages

Home Page: https://hardhat.tidymodels.org

License: Other

R 100.00%

hardhat's Introduction

tidymodels

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
dials has tools to create and manage values of tuning parameters.
dplyr contains a grammar for data manipulation.
ggplot2 implements a grammar of graphics.
infer is a modern approach to statistical inference.
parsnip is a tidy, unified interface to creating models.
purrr is a functional programming toolkit.
recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.
rsample has infrastructure for resampling data so that models can be assessed and empirically validated.
tibble has a modern re-imagining of the data frame.
tune contains the functions to optimize model hyper-parameters.
workflows has methods to combine pre-processing steps and models into a single object.
yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
Most issues will likely belong on the GitHub repo of an individual package. If you think you have encountered a bug with the tidymodels metapackage itself, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

hardhat's People

Contributors

Stargazers

Watchers

Forkers

jyuu marlycormar gravitytrope batpigandme davisvaughan jroachell15 franzbischoff jonthegeek cregouby frank113

hardhat's Issues

Export add_intercept_column()

Maybe also warn that we aren't doing anything if an intercept col exists already

mold.data.frame() should be documented with default_xy_engine()

Do the same with the other engines

mold() is more of a stub that pushes you through to the other engine specific mold() pages.

That is where documentation for x and y is done, which makes the documentation cleaner to read.

validate_new_data_level_order()

for ordered factors

Add more documentation to other enforce functions specifying what should be run first

indicators = FALSE behavior

These should all throw warnings of some kind.

Maybe when checking the formula RHS with indicators = FALSE, we should look for only + and names, and warn about anything else! (rather than special casing everything)

library(hardhat)
library(gapminder)
gapminder <- gapminder[1:5,]
mold(year ~ year*continent, gapminder, indicators = FALSE)
#> $predictors
#> # A tibble: 5 x 2
#>    year continent
#>   <int> <fct>    
#> 1  1952 Asia     
#> 2  1957 Asia     
#> 3  1962 Asia     
#> 4  1967 Asia     
#> 5  1972 Asia     
#> 
#> $outcomes
#> # A tibble: 5 x 1
#>    year
#>   <int>
#> 1  1952
#> 2  1957
#> 3  1962
#> 4  1967
#> 5  1972
#> 
#> $preprocessor
#> Formula Preprocessor: 
#>  
#> # Predictors: 2 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: FALSE 
#> 
#> $offset
#> NULL
mold(year ~ year*continent - year, gapminder, indicators = FALSE)
#> $predictors
#> # A tibble: 5 x 2
#>    year continent
#>   <int> <fct>    
#> 1  1952 Asia     
#> 2  1957 Asia     
#> 3  1962 Asia     
#> 4  1967 Asia     
#> 5  1972 Asia     
#> 
#> $outcomes
#> # A tibble: 5 x 1
#>    year
#>   <int>
#> 1  1952
#> 2  1957
#> 3  1962
#> 4  1967
#> 5  1972
#> 
#> $preprocessor
#> Formula Preprocessor: 
#>  
#> # Predictors: 2 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: FALSE 
#> 
#> $offset
#> NULL
mold(year ~ (year+continent+pop)^2, gapminder, indicators = FALSE)
#> $predictors
#> # A tibble: 5 x 3
#>    year continent      pop
#>   <int> <fct>        <int>
#> 1  1952 Asia       8425333
#> 2  1957 Asia       9240934
#> 3  1962 Asia      10267083
#> 4  1967 Asia      11537966
#> 5  1972 Asia      13079460
#> 
#> $outcomes
#> # A tibble: 5 x 1
#>    year
#>   <int>
#> 1  1952
#> 2  1957
#> 3  1962
#> 4  1967
#> 5  1972
#> 
#> $preprocessor
#> Formula Preprocessor: 
#>  
#> # Predictors: 3 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: FALSE 
#> 
#> $offset
#> NULL

^{Created on 2019-02-16 by the reprex package (v0.2.1.9000)}

Do we really need forge_impl()?

It would get around the non-exported generic extensibility problem if we just moved it into forge()

Also, does forge() need to be generic? Error catching should be done by $clean()

mold() arguments

Rather than mold() with indicators and intercept, those args should only be in the engine as they are properties of the engine to begin with.

So mold(x, data) would use the default engine with intercept=FALSE but if you want to use an intercept you'd do mold(x, data, engine = default_formula_engine(intercept = TRUE))

Should hardhat have typed error messages?

Print method for a preprocessor

Maybe move engine$clean() and engine$process() argument checks into the base engine

So new_engine() could check engine$mold$clean() for an engine arg (not the data as that differs)

and engine$forge$clean() could be checked for engine and new_data

`mold()` signature regarding the dots

mold(x, data, ..., engine = NULL) because engine is "details" and it is good practice to make users be explicit about this

validate_new_data_predictors_exist()

Add tests for more complex interaction specifications

These are all correct but should be tested

library(hardhat)
library(gapminder)

# year + continent + year:continent
mold(year ~ year*continent, gapminder)
#> $predictors
#> # A tibble: 1,704 x 10
#>     year continentAfrica continentAmeric… continentAsia continentEurope
#>    <dbl>           <dbl>            <dbl>         <dbl>           <dbl>
#>  1  1952               0                0             1               0
#>  2  1957               0                0             1               0
#>  3  1962               0                0             1               0
#>  4  1967               0                0             1               0
#>  5  1972               0                0             1               0
#>  6  1977               0                0             1               0
#>  7  1982               0                0             1               0
#>  8  1987               0                0             1               0
#>  9  1992               0                0             1               0
#> 10  1997               0                0             1               0
#> # … with 1,694 more rows, and 5 more variables: continentOceania <dbl>,
#> #   `year:continentAmericas` <dbl>, `year:continentAsia` <dbl>,
#> #   `year:continentEurope` <dbl>, `year:continentOceania` <dbl>
#> 
#> $outcomes
#> # A tibble: 1,704 x 1
#>     year
#>    <int>
#>  1  1952
#>  2  1957
#>  3  1962
#>  4  1967
#>  5  1972
#>  6  1977
#>  7  1982
#>  8  1987
#>  9  1992
#> 10  1997
#> # … with 1,694 more rows
#> 
#> $preprocessor
#> Formula Preprocessor: 
#>  
#> # Predictors: 2 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: TRUE 
#> 
#> $offset
#> NULL

# basically year + continent
mold(year ~ year*continent - year:continent, gapminder)
#> $predictors
#> # A tibble: 1,704 x 6
#>     year continentAfrica continentAmeric… continentAsia continentEurope
#>    <dbl>           <dbl>            <dbl>         <dbl>           <dbl>
#>  1  1952               0                0             1               0
#>  2  1957               0                0             1               0
#>  3  1962               0                0             1               0
#>  4  1967               0                0             1               0
#>  5  1972               0                0             1               0
#>  6  1977               0                0             1               0
#>  7  1982               0                0             1               0
#>  8  1987               0                0             1               0
#>  9  1992               0                0             1               0
#> 10  1997               0                0             1               0
#> # … with 1,694 more rows, and 1 more variable: continentOceania <dbl>
#> 
#> $outcomes
#> # A tibble: 1,704 x 1
#>     year
#>    <int>
#>  1  1952
#>  2  1957
#>  3  1962
#>  4  1967
#>  5  1972
#>  6  1977
#>  7  1982
#>  8  1987
#>  9  1992
#> 10  1997
#> # … with 1,694 more rows
#> 
#> $preprocessor
#> Formula Preprocessor: 
#>  
#> # Predictors: 2 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: TRUE 
#> 
#> $offset
#> NULL

# year, continent, pop, all 2nd ord interact
mold(year ~ (year+continent+pop)^2, gapminder)
#> $predictors
#> # A tibble: 1,704 x 16
#>     year continentAfrica continentAmeric… continentAsia continentEurope
#>    <dbl>           <dbl>            <dbl>         <dbl>           <dbl>
#>  1  1952               0                0             1               0
#>  2  1957               0                0             1               0
#>  3  1962               0                0             1               0
#>  4  1967               0                0             1               0
#>  5  1972               0                0             1               0
#>  6  1977               0                0             1               0
#>  7  1982               0                0             1               0
#>  8  1987               0                0             1               0
#>  9  1992               0                0             1               0
#> 10  1997               0                0             1               0
#> # … with 1,694 more rows, and 11 more variables: continentOceania <dbl>,
#> #   pop <dbl>, `year:continentAmericas` <dbl>, `year:continentAsia` <dbl>,
#> #   `year:continentEurope` <dbl>, `year:continentOceania` <dbl>,
#> #   `year:pop` <dbl>, `continentAmericas:pop` <dbl>,
#> #   `continentAsia:pop` <dbl>, `continentEurope:pop` <dbl>,
#> #   `continentOceania:pop` <dbl>
#> 
#> $outcomes
#> # A tibble: 1,704 x 1
#>     year
#>    <int>
#>  1  1952
#>  2  1957
#>  3  1962
#>  4  1967
#>  5  1972
#>  6  1977
#>  7  1982
#>  8  1987
#>  9  1992
#> 10  1997
#> # … with 1,694 more rows
#> 
#> $preprocessor
#> Formula Preprocessor: 
#>  
#> # Predictors: 3 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: TRUE 
#> 
#> $offset
#> NULL

# year + year:continent
mold(pop ~ year + continent %in% year, gapminder)
#> $predictors
#> # A tibble: 1,704 x 6
#>     year `year:continent… `year:continent… `year:continent…
#>    <dbl>            <dbl>            <dbl>            <dbl>
#>  1  1952                0                0             1952
#>  2  1957                0                0             1957
#>  3  1962                0                0             1962
#>  4  1967                0                0             1967
#>  5  1972                0                0             1972
#>  6  1977                0                0             1977
#>  7  1982                0                0             1982
#>  8  1987                0                0             1987
#>  9  1992                0                0             1992
#> 10  1997                0                0             1997
#> # … with 1,694 more rows, and 2 more variables:
#> #   `year:continentEurope` <dbl>, `year:continentOceania` <dbl>
#> 
#> $outcomes
#> # A tibble: 1,704 x 1
#>         pop
#>       <int>
#>  1  8425333
#>  2  9240934
#>  3 10267083
#>  4 11537966
#>  5 13079460
#>  6 14880372
#>  7 12881816
#>  8 13867957
#>  9 16317921
#> 10 22227415
#> # … with 1,694 more rows
#> 
#> $preprocessor
#> Formula Preprocessor: 
#>  
#> # Predictors: 2 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: TRUE 
#> 
#> $offset
#> NULL

^{Created on 2019-02-16 by the reprex package (v0.2.1.9000)}

Documentation) pkgdown reference update

To reflect having engines now

mold(formula, dummy = TRUE) for factors

Tree based methods might want to use the formula method but not expand factors to dummies (straight up factors, or interactions with factors), but they might still want purely numeric interactions.

Should the LHS be checked for interaction terms? Other things?

Maybe don't allow for `type =` flexibility

It would greatly simplify some things, and make it straightforward to add a run_model_matrix = FALSE arg to the formula method of mold() if we always returned a tibble.

It would also generally clean up the mold() fn call. And things would generally be more type stable for the developer, at the cost of some performance loss if the user passes in a matrix to mold(x = <matrix>) and then the developer wants that back as a matrix. (Then again, they wouldn't have to use mold() if they didn't want to, and we could export the add_intercept_column() fn if we wanted to)

Offsets

mold() currently allows for offsets in the formula method directly ~ offset(Sepal.Length) but you don't get them back at all. We should return the offset as a slot in the return value from mold() as a tibble with 1 column, .offset. Extract them in bake_terms_() with model.offset() if they exist.

In forge(), we could do the same thing.

The actual preprocessor for the terms method should store an offset = FALSE indicator to know whether or not we need to look for it in forge()

offset should count as an `extras` for the formula interface

but shouldn't be present in the others

forge(outcomes= TRUE) when using the XY method

Since I add a default column name, .outcome, to y when it is converted to a tibble, I could just have forge() look for a column named .outcome to be there in new_data. This would require 0 extra effort on my part, it would do this already if I didn’t prematurely error out.

This would also allow the user to pass in a data frame for y (where you obviously know the column name for the outcome) and then request outcomes to be processed in forge(). (You currently can’t do this because that also goes through the XY method)

The only extra thing I would do is have a special check in place if forge(outcomes = TRUE) is requested, and the ".outcome" column doesn’t exist in new_data. It would make it very clear that the user passed a vector to y and that vector was given the name .outcome so that is what forge() is looking for.

Data first functions

scream(new_data, preprocessor, outcome) not scream(preprocessor, new_data, outcome)

Should outcome_levels be held onto?

To validate new_data outcomes with preprocess(outcome = TRUE)

mold() with dots doesn't remove the LHS

library(hardhat)

mold(Species ~ ., iris)
#> $predictors
#> # A tibble: 150 x 7
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Speciessetosa
#>           <dbl>       <dbl>        <dbl>       <dbl>         <dbl>
#>  1          5.1         3.5          1.4         0.2             1
#>  2          4.9         3            1.4         0.2             1
#>  3          4.7         3.2          1.3         0.2             1
#>  4          4.6         3.1          1.5         0.2             1
#>  5          5           3.6          1.4         0.2             1
#>  6          5.4         3.9          1.7         0.4             1
#>  7          4.6         3.4          1.4         0.3             1
#>  8          5           3.4          1.5         0.2             1
#>  9          4.4         2.9          1.4         0.2             1
#> 10          4.9         3.1          1.5         0.1             1
#> # … with 140 more rows, and 2 more variables: Speciesversicolor <dbl>,
#> #   Speciesvirginica <dbl>
#> 
#> $outcomes
#> # A tibble: 150 x 1
#>    Species
#>    <fct>  
#>  1 setosa 
#>  2 setosa 
#>  3 setosa 
#>  4 setosa 
#>  5 setosa 
#>  6 setosa 
#>  7 setosa 
#>  8 setosa 
#>  9 setosa 
#> 10 setosa 
#> # … with 140 more rows
#> 
#> $engine
#> Formula Engine: 
#>  
#> # Predictors: 5 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#>   Indicators: TRUE 
#> 
#> $extras
#> $extras$offset
#> NULL

^{Created on 2019-03-01 by the reprex package (v0.2.1.9000)}

More flexible LHS terms object

Investigate using 0 row slice for `info`

prepare and preprocess aren't type stable wrt the outcome

prepare() can return a vector or matrix or data frame depending on the preprocessor and whether or not we are doing multivariate. preprocess() returns a data.frame for formula method, or tibble for recipes which is a little better. Do we need a arg for outcome type?

Clean up structure of preprocessor elements

All predictor elements should be in a list together

default_preprocessor objects should have a slot for the predictors

Since all of the other preprocessors contain the info about the predictors. That way shrink() can work without the user implementing a model that holds onto predictors.

checking if variable is being used as a outcome and predictor

I don't know whether it would fit within this package, or somewhere else. But would it be beneficial to alert (or error) the user if a variable appear on the left and right side in a formula?

new_default_formula_engine() helpers

So we need default_formula_engine()

This might mean we need the constructor to have the full argument set again, and can be subclassed. This would simplify refresh_engine()

new_data factors with a subset of original factor levels

Posted from slack:

So here is a question for something forge() could do. Say in the fit function you had a factor f and it has levels c("a", "b", "c")

Then you went to predict 1 new value, and it had that same f factor predictor, but it just happened to only have levels "a".

I don’t think you want forge() to fail here, but I also don’t think you want it to do…nothing.

We have all of the information required to recode that factor using factor(<new_data_factor>, levels = <original_levels>) and then pass it along if required.

Does this seem sensible? Note that we still warn and coerce new factor levels to NA, but this is when the factor has a subset of the originals

Generalize

What if we exported all of the new_preprocessor() functions, and their engines? And then standardized the preprocessor engines to all have a mold() and forge() function attached? Basically how the default engine has process() right now. This would let us export the functionality of what bake_terms_engine() does for us in a clean way, wrapped up in preprocessor$engine$forge() that should have standard args across engines

Vignette on using hardhat in a package

Maybe a super simple model example showing when to call mold() and forge()

Should `engine` be something else?

So we don't clash with the parsnip idea of an engine?

spruce_conf_int()

What are the inputs and outputs?

What about classifcation vs regression? Different number of output columns

Add documentation (to package-template?) on case weights

They aren't supported directly by mold(), but would be up to the modeler to use correctly. Maybe we can provide a validation function to ensure they are valid looking case weights (integer-ish, same length as x, etc)

linear_regression <- function(x, y, case_weights) {
  x <- mold(x, y)
  xx <- linear_regression_impl(x = x$predictors, y = x$outcomes, case_weights)
  linear_reg_obj(xx, pre = x$preprocessor)
}

predict.line_reg_obj <- function(object, new_data) {
  new_data <- forge(object$pre, new_data)
  ...
  spruce_()
}

better error message if outcome is not in new_data and preprocess() requests it

`forge()` doesn't need `...`

standard helpers for output lists

returned from forge-process-predictors and forge-process-outcomes function: (output would get directly assigned to either predictors/outcomes so the name isn't that important)

list(
    engine = engine,
    output = list(
      data = data,
      extras = NULL
    )
  )

returned from forge-process functions:

list(
    engine = engine,
    predictors = .predictors,
    outcomes = .outcomes
  )

returned from forge-clean

list(
    engine = engine,
    new_data = new_data
  )

returned from mold-process-predictors/outcomes:

list(
    engine = engine,
    output = list(
      data = data,
      info = info,
      extras = NULL
    )
  )

returned from mold-process:

list(
    engine = engine,
    predictors = predictors,
    outcomes = outcomes
  )

returned from mold-clean

list(
    engine = engine,
    data = data
  )

returned from mold-clean-xy

list(
    engine = engine,
    x = x,
    y = y
  )

Documentation) Fix offset->extras mentions

Better input validation to mold()

I don't think data is currently being checked

extract_info()

Takes in a data frame or matrix and returns an info list with names, levels, classes. This is the easiest way to expose this part to developers

Maybe have `model_frame()`

And return a list of frame (a tibble) and terms

Silently add missing levels back

This might require rethinking the naming of the enforce_ functions

Add a test for nested inline offsets

Just to show this is the same as base R, these are not recognized

# not recognized as offset! good!
library(gapminder)
mf <- model.frame(country ~ log(offset(year)), gapminder)
attr(mf, "terms")
#> country ~ log(offset(year))
#> attr(,"variables")
#> list(country, log(offset(year)))
#> attr(,"factors")
#>                   log(offset(year))
#> country                           0
#> log(offset(year))                 1
#> attr(,"term.labels")
#> [1] "log(offset(year))"
#> attr(,"order")
#> [1] 1
#> attr(,"intercept")
#> [1] 1
#> attr(,"response")
#> [1] 1
#> attr(,".Environment")
#> <environment: R_GlobalEnv>
#> attr(,"predvars")
#> list(country, log(offset(year)))
#> attr(,"dataClasses")
#>           country log(offset(year)) 
#>          "factor"         "numeric"
head(model.matrix(terms(mf), mf))
#>   (Intercept) log(offset(year))
#> 1           1          7.576610
#> 2           1          7.579168
#> 3           1          7.581720
#> 4           1          7.584265
#> 5           1          7.586804
#> 6           1          7.589336
model.offset(mf)
#> NULL

^{Created on 2019-02-16 by the reprex package (v0.2.1.9000)}

Should predictor classes be held onto?

If a matrix is used, store the column names as all numeric. The new_data can be a data frame or matrix so we would still need to validate data frame input.

We could use .MFclass() because it collapses integer/double together and has special handling for matrices.

What about outcome classes? For preprocess(outcome = TRUE) we might need to validate these too.

Investigate using vctrs for coercion

Coercion of new_data columns into their correct type (maybe characters can be coerced to factors automatically)

tidymodels / hardhat Goto Github PK

hardhat's Introduction

tidymodels

Overview

Contributing

hardhat's People

Contributors

Stargazers

Watchers

Forkers

hardhat's Issues

Recommend Projects

Recommend Topics

Recommend Org