Giter Club home page Giter Club logo

stacks's Introduction

R build status Codecov test coverage CRAN status lifecycle

stacks - tidy model stacking

stacks is an R package for model stacking that aligns with the tidymodels. Model stacking is an ensembling method that takes the outputs of many models and combines them to generate a new model—referred to as an ensemble in this package—that generates predictions informed by each of its members.

The process goes something like this:

  1. Define candidate ensemble members using functionality from rsample, parsnip, workflows, recipes, and tune
  2. Initialize a data_stack object with stacks()
  3. Iteratively add candidate ensemble members to the data_stack with add_candidates()
  4. Evaluate how to combine their predictions with blend_predictions()
  5. Fit candidate ensemble members with non-zero stacking coefficients with fit_members()
  6. Predict on new data with predict()

You can install the (unstable) development version of this package with the following code!

remotes::install_github("tidymodels/stacks", ref = "main")

Rather than diving right into the implementation, we’ll focus here on how the pieces fit together, conceptually, in building an ensemble with stacks. See the basics vignette for an example of the API in action!

a grammar

At the highest level, ensembles are formed from model definitions. In this package, model definitions are an instance of a minimal workflow, containing a model specification (as defined in the parsnip package) and, optionally, a preprocessor (as defined in the recipes package). Model definitions specify the form of candidate ensemble members.

To be used in the same ensemble, each of these model definitions must share the same resample. This rsample rset object, when paired with the model definitions, can be used to generate the tuning/fitting results objects for the candidate ensemble members with tune.

The package will sometimes refer to sub-models. An ensemble member is a sub-model that has actually been selected (and possibly trained) for use in the ensemble (via nonzero stacking coefficients, usually) that is not regarded as resulting from a specific model definition, where-as a sub-model is an untrained candidate ensemble member.

Sub-models first come together in a data_stack object through the add_candidates() function. Principally, these objects are just tibbles, where the first column gives the true outcome in the assessment set, and the remaining columns give the predictions from each candidate ensemble member. (When the outcome is numeric, there’s only one column per candidate ensemble member. Classification requires as many columns per candidate as there are levels in the outcome variable.) They also bring along a few extra attributes to keep track of model definitions.

Then, the data stack can be evaluated using blend_predictions() to determine to how best to combine the outputs from each of the candidate member models.

Note that the fitting process is not sensitive to model definition membership. That is, while fitting an ensemble from a stack, the components are regarded as candidate ensemble members rather than as sub-models.

The outputs of each member are likely highly correlated. Thus, depending on the degree of regularization you choose, the coefficients for the inputs of (possibly) many of the members will zero out—their predictions will have no influence on the final output, and those terms will thus be thrown out.

These stacking coefficients decide then which sub-models will be ensemble members—sub-models with non-zero stacking coefficients are then fitted, altogether making up a model_stack object.

This model stack object, outputted from fit_members(), is ready to predict on new data!

At a high level, the process follows these steps:

The API for the package closely mirrors these ideas. See the basics vignette for an example of how this grammar is implemented!

contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

In the stacks package, some test objects take too long to build with every commit. If your contribution changes the structure of data_stack or model_stacks objects, please regenerate these test objects by running the scripts in man-roxygen/example_models.Rmd, including those with chunk options eval = FALSE.

stacks's People

Contributors

simonpcouch avatar topepo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.