The modeldata's discuss from tidymodels

Multi class data set

I'm been working with {themis} documentation, and I had a hard time finding a good example data set that

Includes multiple classes (4 or more)
Didn't have any missing data in the response
Be a reasonable response for the data set

for now, I'm using the credit_data data set with the Home variables but I need to do some prep work

credit_data0 <- credit_data %>%
  filter(Home != "ignore") %>%
  mutate(Home = as.character(Home))

For illustration, it works fine, but I worry about giving the wrong signal by balancing over an obvious non-response

Have data sets with character variables

From a quick look through the data sets, I couldn't find any that had categorical variables as encoded as character variables. It is good practice to have these variables encoded as factor variables, but it does make it harder to create good examples of turning characters variables into factor variables.

I don't know the fix since changing these variables could have annoying downstream problems

Release modeldata 1.1.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Maybe use a cache for larger data sets

To get around CRAN's package size limit, we could try and have URLs that point to data sets which would live on github in this repo, and then cache them on the user's machine.

I imagine it would look like:

data_ames <- function() {
  if (has_data_in_cache("ames")) {
    get_data_from_cache("ames")
  } else {
    get_data_from_url_and_cache_it("ames")
  }
}

We could follow the lead of pak, which uses the following function to determine where R's global permanent cache is:

https://github.com/r-lib/pak/blob/e65de1e9630dbfcaf1044718b742bf806486b107/R/utils.R#L84

and then we could save into <cache-path>/model-data/ames.rds

To be even faster, we would only load the data once per R session. Once we load it from the cache directory, we would store it in an environment internal to modeldata and pull it from there each time data_ames() is called. So it might look more like:

data_ames <- function() {
  if (has_data_in_internal_environment("ames")) {
    get_data_from_internal_environment("ames")
  } else if (has_data_in_cache("ames")) {
    get_data_from_cache("ames")
  } else {
    get_data_from_url_and_cache_it("ames")
  }
}

The datasets themselves would actually live in a folder in this repo that would be .Rbuildignore-d. For example: inst/data/ames.rds and then ignore inst/data

Release modeldata 0.1.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Release modeldata 1.0.0

Prepare for release:

Submit to CRAN:

usethis::use_version('major')
devtools::submit_cran()
Approve email

Wait for CRAN...

Upkeep for modeldata

2023

Necessary:

Update copyright holder in DESCRIPTION: person(given = "Posit Software, PBC", role = c("cph", "fnd"))
Double check license file uses '[package] authors' as copyright holder. Run use_mit_license()
Update logo (https://github.com/rstudio/hex-stickers); run use_tidy_logo()
usethis::use_tidy_coc()
usethis::use_tidy_github_actions()

Optional:

Review 2022 checklist to see if you completed the pkgdown updates
Prefer pak::pak("org/pkg") over devtools::install_github("org/pkg") in README
~~Consider running use_tidy_dependencies() and/or replace compat files with use_standalone()~~
~~use_standalone("r-lib/rlang", "types-check") instead of home grown argument checkers~~
~~Add alt-text to pictures, plots, etc; see https://posit.co/blog/knitr-fig-alt/ for examples~~

Release modeldata 0.1.1

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

Release modeldata 1.2.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

sparse data set

Mainly to be used in examples and unit tests.

It would be nice if {Matrix} isn't needed to be added as suggest/import

Add str(dataset) to all data sets

I preface this issue by acknowledging that this package should be kept as small as possible.

It would be nice if all the data sets had str(dataset) in examples like Chicago does. It makes it a little easier to quickly find a data set with the features you need.

Release modeldata 1.0.0

Prepare for release:

Submit to CRAN:

usethis::use_version('major')
devtools::submit_cran()
Approve email

Wait for CRAN...

Upkeep for modeldata

Pre-history

usethis::use_readme_rmd()
usethis::use_roxygen_md()
usethis::use_github_links()
usethis::use_pkgdown_github_pages()
usethis::use_tidy_github_labels()
usethis::use_tidy_style()
usethis::use_tidy_description()
urlchecker::url_check()

2020

usethis::use_package_doc()
Consider letting usethis manage your @importFrom directives here.
usethis::use_import_from() is handy for this.
usethis::use_testthat(3) and upgrade to 3e, testthat 3e vignette
Align the names of R/ files and test/ files for workflow happiness.
usethis::rename_files() can be helpful.

2021

usethis::use_tidy_dependencies()
usethis::use_tidy_github_actions() and update artisanal actions to use setup-r-dependencies
Remove check environments section from cran-comments.md
Bump required R version in DESCRIPTION to 3.4
Use lifecycle instead of artisanal deprecation messages, as described in Communicate lifecycle changes in your functions
Add RStudio to DESCRIPTION as funder, if appropriate

2022

usethis::use_tidy_coc()
Update errors to rlang 1.0.0. Helpful guides:
https://rlang.r-lib.org/reference/topic-error-call.html
https://rlang.r-lib.org/reference/topic-error-chaining.html
https://rlang.r-lib.org/reference/topic-condition-formatting.html
Update pkgdown site using instructions at https://tidytemplate.tidyverse.org
Re-publish released site using r-lib/pkgdown#2051
Ensure pkgdown development is mode: auto in pkgdown config
Handle and close any still-open master --> main issues
Update README badges, instructions in r-lib/usethis#1594

Consider replacing okc_text dataset

There was some controversy with this dataset and OkCupid data in general, https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release and https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/ and in some ways okc_text feels a little icky to use.

In addition, the data doesn't lend itself to modeling tasks well.

It would be nice if we could find a clearly public dataset that doesn't have these problems.

As far as I can see, {textrecipes} is the only package using it: https://github.com/search?p=1&q=okc_text&type=Code. So the repercussions from this change will mostly be on my shoulders.

CC @juliasilge

call `str(dataset)` in examples

This wouldn't make a much of a difference when viewing help-files, but it would make the pkgdown site much more informative. e.g. when looking for a dataset with a date column or with a certain range of rows, I could quickly click through links in the Reference page rather than calling str() on each dataset in the console:

vs.

Title of "grants" data set is incorrectly labelled "Ames Housing Data"

as the title says. See https://modeldata.tidymodels.org/reference/grants.html

Can't access to data sets using `::`

Hi,

Thanks for this package and all the work put in tmwr!

I noted I can't access a data set using ::. There is a particular reason why datasets cannot be accessed using this way like other data packages?

library(modeldata)

modeldata::ames
#> Error: 'ames' is not an exported object from 'namespace:modeldata'
modeldata::credit_data
#> Error: 'credit_data' is not an exported object from 'namespace:modeldata'

# example 1
library(babynames)
babynames::babynames
#> # A tibble: 1,924,665 × 5
#>     year sex   name          n   prop
#>    <dbl> <chr> <chr>     <int>  <dbl>
#>  1  1880 F     Mary       7065 0.0724
#>  2  1880 F     Anna       2604 0.0267
#>  3  1880 F     Emma       2003 0.0205
#>  4  1880 F     Elizabeth  1939 0.0199
#>  5  1880 F     Minnie     1746 0.0179
#>  6  1880 F     Margaret   1578 0.0162
#>  7  1880 F     Ida        1472 0.0151
#>  8  1880 F     Alice      1414 0.0145
#>  9  1880 F     Bertha     1320 0.0135
#> 10  1880 F     Sarah      1288 0.0132
#> # … with 1,924,655 more rows


# example 2
library(nycflights13)
nycflights13::airlines
#> # A tibble: 16 × 2
#>    carrier name                       
#>    <chr>   <chr>                      
#>  1 9E      Endeavor Air Inc.          
#>  2 AA      American Airlines Inc.     
#>  3 AS      Alaska Airlines Inc.       
#>  4 B6      JetBlue Airways            
#>  5 DL      Delta Air Lines Inc.       
#>  6 EV      ExpressJet Airlines Inc.   
#>  7 F9      Frontier Airlines Inc.     
#>  8 FL      AirTran Airways Corporation
#>  9 HA      Hawaiian Airlines Inc.     
#> 10 MQ      Envoy Air                  
#> 11 OO      SkyWest Airlines Inc.      
#> 12 UA      United Air Lines Inc.      
#> 13 US      US Airways Inc.            
#> 14 VX      Virgin America             
#> 15 WN      Southwest Airlines Co.     
#> 16 YV      Mesa Airlines Inc.

^{Created on 2022-05-10 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.0 (2022-04-22 ucrt)
#>  os       Windows 10 x64 (build 22000)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Spanish_Spain.utf8
#>  ctype    Spanish_Spain.utf8
#>  tz       America/Santiago
#>  date     2022-05-10
#>  pandoc   2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  babynames    * 1.0.1   2021-04-12 [1] CRAN (R 4.2.0)
#>  cli            3.3.0   2022-04-25 [1] CRAN (R 4.2.0)
#>  crayon         1.5.1   2022-03-26 [1] CRAN (R 4.2.0)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.2.0)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.2.0)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.2.0)
#>  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.0)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.0)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.2.0)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.0)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.2.0)
#>  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.2.0)
#>  knitr          1.39    2022-04-26 [1] CRAN (R 4.2.0)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.2.0)
#>  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.0)
#>  modeldata    * 0.1.1   2021-07-14 [1] CRAN (R 4.2.0)
#>  nycflights13 * 1.0.2   2021-04-12 [1] CRAN (R 4.2.0)
#>  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.2.0)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.0)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.2.0)
#>  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.2.0)
#>  rmarkdown      2.14    2022-04-25 [1] CRAN (R 4.2.0)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.2.0)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.0)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.2.0)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.2.0)
#>  tibble         3.1.6   2021-11-07 [1] CRAN (R 4.2.0)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.0)
#>  vctrs          0.4.1   2022-04-13 [1] CRAN (R 4.2.0)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.0)
#>  xfun           0.30    2022-03-02 [1] CRAN (R 4.2.0)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.2.0)
#> 
#>  [1] C:/Users/jbkun/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.0/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Thanks so much in advance.
Kinkd regrads,

URL for rcompanion is currently bad

In the docs for crickets, we have this:

modeldata/R/crickets.R

Lines 13 to 14 in df553cb

 #' @source Mangiafico, S. 2015. "An R Companion for the Handbook of Biological 

 #' Statistics." \url{https://rcompanion.org/handbook/}.

That URL currently gives:

Error: Could not resolve host: rcompanion.org

I don't know at this point if it is down for good or not, but let's check it out before the next release.

Description of stem in okc dataset

Please consider adding a description of what the variable stem in the okc dataset stands for. I've looked at the paper and description but cannot figure it out.

Release modeldata 1.3.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Release modeldata 1.0.1

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

Release modeldata 0.2.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Wrong linkage to tidymodels/tidymodels

Description refer bug reports to tidymodels/tidymodels

modeldata/DESCRIPTION

Line 15 in 7774fd1

BugReports: https://github.com/tidymodels/tidymodels/issues

And some of the links (including the Github icon top right) also links to tidymodels/tidymodels

bivariate data is missing

The entry is there in R/bivariate.R but the data is not there in data/

Release modeldata 0.1.1

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

Help us firm up the list of targetted repositories
Make sure all maintainers are aware of what's coming
Give us an issue to close when the job is done
Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Use LazyData!

Making LazyData https://github.com/tidymodels/modeldata/blob/master/DESCRIPTION#L11 FALSE means that users have to write an extra line of code that they are probably out of the habit of writing.

Consider a dataset to demonstrate fairness

Possible datasets to consider include:

HDMA mortgage data, like Erin's h2o demonstration
COMPAS predictive policing dataset, included in fairness package

Release modeldata 1.4.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Fix description of "grants" dataset

I've also noticed that the description of "grants" dataset (https://modeldata.tidymodels.org/reference/grants.html) states "Ames Housing Data".

Originally posted by @jromanowska in #17 (comment)

	#' @source Mangiafico, S. 2015. "An R Companion for the Handbook of Biological
	#' Statistics." \url{https://rcompanion.org/handbook/}.

tidymodels / modeldata Goto Github PK

modeldata's Issues

Recommend Projects

Recommend Topics

Recommend Org