Giter Club home page Giter Club logo

modeldata's Introduction

tidymodels

R-CMD-check Codecov test coverage CRAN_Status_Badge Downloads lifecycle

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

  • broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.

  • dials has tools to create and manage values of tuning parameters.

  • dplyr contains a grammar for data manipulation.

  • ggplot2 implements a grammar of graphics.

  • infer is a modern approach to statistical inference.

  • parsnip is a tidy, unified interface to creating models.

  • purrr is a functional programming toolkit.

  • recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.

  • rsample has infrastructure for resampling data so that models can be assessed and empirically validated.

  • tibble has a modern re-imagining of the data frame.

  • tune contains the functions to optimize model hyper-parameters.

  • workflows has methods to combine pre-processing steps and models into a single object.

  • yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

modeldata's People

Contributors

davisvaughan avatar emilhvitfeldt avatar hfrick avatar juliasilge avatar mdogucu avatar simonpcouch avatar topepo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

modeldata's Issues

Consider replacing okc_text dataset

There was some controversy with this dataset and OkCupid data in general, https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release and https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/ and in some ways okc_text feels a little icky to use.

In addition, the data doesn't lend itself to modeling tasks well.

It would be nice if we could find a clearly public dataset that doesn't have these problems.

As far as I can see, {textrecipes} is the only package using it: https://github.com/search?p=1&q=okc_text&type=Code. So the repercussions from this change will mostly be on my shoulders.

CC @juliasilge

Maybe use a cache for larger data sets

To get around CRAN's package size limit, we could try and have URLs that point to data sets which would live on github in this repo, and then cache them on the user's machine.

I imagine it would look like:

data_ames <- function() {
  if (has_data_in_cache("ames")) {
    get_data_from_cache("ames")
  } else {
    get_data_from_url_and_cache_it("ames")
  }
}

We could follow the lead of pak, which uses the following function to determine where R's global permanent cache is:

https://github.com/r-lib/pak/blob/e65de1e9630dbfcaf1044718b742bf806486b107/R/utils.R#L84

and then we could save into <cache-path>/model-data/ames.rds

To be even faster, we would only load the data once per R session. Once we load it from the cache directory, we would store it in an environment internal to modeldata and pull it from there each time data_ames() is called. So it might look more like:

data_ames <- function() {
  if (has_data_in_internal_environment("ames")) {
    get_data_from_internal_environment("ames")
  } else if (has_data_in_cache("ames")) {
    get_data_from_cache("ames")
  } else {
    get_data_from_url_and_cache_it("ames")
  }
}

The datasets themselves would actually live in a folder in this repo that would be .Rbuildignore-d. For example: inst/data/ames.rds and then ignore inst/data

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

  • Help us firm up the list of targetted repositories
  • Make sure all maintainers are aware of what's coming
  • Give us an issue to close when the job is done
  • Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Release modeldata 1.0.1

Prepare for release:

  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::cloud_check()
  • Update cran-comments.md

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()

Release modeldata 0.2.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::cloud_check()
  • Update cran-comments.md
  • git push
  • Draft blog post
  • Slack link to draft blog in #open-source-comms

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Have data sets with character variables

From a quick look through the data sets, I couldn't find any that had categorical variables as encoded as character variables. It is good practice to have these variables encoded as factor variables, but it does make it harder to create good examples of turning characters variables into factor variables.

I don't know the fix since changing these variables could have annoying downstream problems

Release modeldata 0.1.0

Prepare for release:

  • devtools::build_readme()
  • Check current CRAN check results
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Update cran-comments.md
  • Polish NEWS
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Release modeldata 1.2.0

Prepare for release:

  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::cloud_check()
  • Update cran-comments.md
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post
  • Ping Tracy Teal on Slack

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Release modeldata 1.0.0

Prepare for release:

  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::cloud_check()
  • Update cran-comments.md
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post
  • Ping Tracy Teal on Slack

Submit to CRAN:

  • usethis::use_version('major')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Add str(dataset) to all data sets

I preface this issue by acknowledging that this package should be kept as small as possible.

It would be nice if all the data sets had str(dataset) in examples like Chicago does. It makes it a little easier to quickly find a data set with the features you need.

Release modeldata 1.0.0

Prepare for release:

  • Check current CRAN check results
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::cloud_check()
  • Update cran-comments.md
  • Review pkgdown reference index for, e.g., missing topics
  • Draft blog post
  • Ping Tracy Teal on Slack

Submit to CRAN:

  • usethis::use_version('major')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Upkeep for modeldata

Pre-history

  • usethis::use_readme_rmd()
  • usethis::use_roxygen_md()
  • usethis::use_github_links()
  • usethis::use_pkgdown_github_pages()
  • usethis::use_tidy_github_labels()
  • usethis::use_tidy_style()
  • usethis::use_tidy_description()
  • urlchecker::url_check()

2020

  • usethis::use_package_doc()
    Consider letting usethis manage your @importFrom directives here.
    usethis::use_import_from() is handy for this.
  • usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
  • Align the names of R/ files and test/ files for workflow happiness.
    usethis::rename_files() can be helpful.

2021

  • usethis::use_tidy_dependencies()
  • usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
  • Remove check environments section from cran-comments.md
  • Bump required R version in DESCRIPTION to 3.4
  • Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions
  • Add RStudio to DESCRIPTION as funder, if appropriate

2022

URL for rcompanion is currently bad

In the docs for crickets, we have this:

modeldata/R/crickets.R

Lines 13 to 14 in df553cb

#' @source Mangiafico, S. 2015. "An R Companion for the Handbook of Biological
#' Statistics." \url{https://rcompanion.org/handbook/}.

That URL currently gives:

Error: Could not resolve host: rcompanion.org

I don't know at this point if it is down for good or not, but let's check it out before the next release.

Release modeldata 0.1.1

Prepare for release:

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()

Release modeldata 1.3.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Bump required R version in DESCRIPTION to 3.6
  • Polish NEWS
  • urlchecker::url_check()
  • devtools::build_readme()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • revdepcheck::cloud_check()
  • Update cran-comments.md
  • git push
  • Draft blog post
  • Slack link to draft blog in #open-source-comms

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • Add preemptive link to blog post in pkgdown news menu
  • usethis::use_github_release()
  • usethis::use_dev_version(push = TRUE)
  • Finish blog post
  • Tweet

Can't access to data sets using `::`

Hi,

Thanks for this package and all the work put in tmwr!

I noted I can't access a data set using ::. There is a particular reason why datasets cannot be accessed using this way like other data packages?

library(modeldata)

modeldata::ames
#> Error: 'ames' is not an exported object from 'namespace:modeldata'
modeldata::credit_data
#> Error: 'credit_data' is not an exported object from 'namespace:modeldata'

# example 1
library(babynames)
babynames::babynames
#> # A tibble: 1,924,665 × 5
#>     year sex   name          n   prop
#>    <dbl> <chr> <chr>     <int>  <dbl>
#>  1  1880 F     Mary       7065 0.0724
#>  2  1880 F     Anna       2604 0.0267
#>  3  1880 F     Emma       2003 0.0205
#>  4  1880 F     Elizabeth  1939 0.0199
#>  5  1880 F     Minnie     1746 0.0179
#>  6  1880 F     Margaret   1578 0.0162
#>  7  1880 F     Ida        1472 0.0151
#>  8  1880 F     Alice      1414 0.0145
#>  9  1880 F     Bertha     1320 0.0135
#> 10  1880 F     Sarah      1288 0.0132
#> # … with 1,924,655 more rows


# example 2
library(nycflights13)
nycflights13::airlines
#> # A tibble: 16 × 2
#>    carrier name                       
#>    <chr>   <chr>                      
#>  1 9E      Endeavor Air Inc.          
#>  2 AA      American Airlines Inc.     
#>  3 AS      Alaska Airlines Inc.       
#>  4 B6      JetBlue Airways            
#>  5 DL      Delta Air Lines Inc.       
#>  6 EV      ExpressJet Airlines Inc.   
#>  7 F9      Frontier Airlines Inc.     
#>  8 FL      AirTran Airways Corporation
#>  9 HA      Hawaiian Airlines Inc.     
#> 10 MQ      Envoy Air                  
#> 11 OO      SkyWest Airlines Inc.      
#> 12 UA      United Air Lines Inc.      
#> 13 US      US Airways Inc.            
#> 14 VX      Virgin America             
#> 15 WN      Southwest Airlines Co.     
#> 16 YV      Mesa Airlines Inc.

Created on 2022-05-10 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Spanish_Spain.utf8
#>  ctype    Spanish_Spain.utf8
#>  tz       America/Santiago
#>  date     2022-05-10
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  babynames    * 1.0.1   2021-04-12 [1] CRAN (R 4.2.0)
#>  cli            3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  crayon         1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.2.0)
#>  knitr          1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modeldata    * 0.1.1   2021-07-14 [1] CRAN (R 4.2.0)
#>  nycflights13 * 1.0.2   2021-04-12 [1] CRAN (R 4.2.0)
#>  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rmarkdown      2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  tibble         3.1.6   2021-11-07 [1] CRAN (R 4.2.0)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun           0.30    2022-03-02 [1] CRAN (R 4.2.0)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/jbkun/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Thanks so much in advance.
Kinkd regrads,

Release modeldata 1.1.0

Prepare for release:

  • git pull
  • Check current CRAN check results
  • Check if any deprecation processes should be advanced, as described in Gradual deprecation
  • Polish NEWS
  • devtools::build_readme()
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::cloud_check()
  • Update cran-comments.md
  • git push
  • Draft blog post
  • Slack link to draft blog in #open-source-comms

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • git push
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

Upkeep for modeldata

2023

Necessary:

  • Update copyright holder in DESCRIPTION: person(given = "Posit Software, PBC", role = c("cph", "fnd"))
  • Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
  • Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
  • usethis::use_tidy_coc()
  • usethis::use_tidy_github_actions()

Optional:

  • Review 2022 checklist to see if you completed the pkgdown updates
  • Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
  • Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()
  • use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers
  • Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples

Release modeldata 0.1.1

Prepare for release:

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()

call `str(dataset)` in examples

This wouldn't make a much of a difference when viewing help-files, but it would make the pkgdown site much more informative. e.g. when looking for a dataset with a date column or with a certain range of rows, I could quickly click through links in the Reference page rather than calling str() on each dataset in the console:

example without str

vs.

example with str

Multi class data set

I'm been working with {themis} documentation, and I had a hard time finding a good example data set that

  1. Includes multiple classes (4 or more)
  2. Didn't have any missing data in the response
  3. Be a reasonable response for the data set

for now, I'm using the credit_data data set with the Home variables but I need to do some prep work

credit_data0 <- credit_data %>%
  filter(Home != "ignore") %>%
  mutate(Home = as.character(Home))

For illustration, it works fine, but I worry about giving the wrong signal by balancing over an obvious non-response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.