tidymodels / modeldata Goto Github PK
View Code? Open in Web Editor NEWData Sets Used by tidymodels Packages
Home Page: https://modeldata.tidymodels.org/
License: Other
Data Sets Used by tidymodels Packages
Home Page: https://modeldata.tidymodels.org/
License: Other
I'm been working with {themis} documentation, and I had a hard time finding a good example data set that
for now, I'm using the credit_data
data set with the Home
variables but I need to do some prep work
credit_data0 <- credit_data %>%
filter(Home != "ignore") %>%
mutate(Home = as.character(Home))
For illustration, it works fine, but I worry about giving the wrong signal by balancing over an obvious non-response
From a quick look through the data sets, I couldn't find any that had categorical variables as encoded as character variables. It is good practice to have these variables encoded as factor variables, but it does make it harder to create good examples of turning characters variables into factor variables.
I don't know the fix since changing these variables could have annoying downstream problems
Prepare for release:
git pull
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::cloud_check()
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
git push
usethis::use_github_release()
usethis::use_dev_version()
git push
To get around CRAN's package size limit, we could try and have URLs that point to data sets which would live on github in this repo, and then cache them on the user's machine.
I imagine it would look like:
data_ames <- function() {
if (has_data_in_cache("ames")) {
get_data_from_cache("ames")
} else {
get_data_from_url_and_cache_it("ames")
}
}
We could follow the lead of pak, which uses the following function to determine where R's global permanent cache is:
https://github.com/r-lib/pak/blob/e65de1e9630dbfcaf1044718b742bf806486b107/R/utils.R#L84
and then we could save into <cache-path>/model-data/ames.rds
To be even faster, we would only load the data once per R session. Once we load it from the cache directory, we would store it in an environment internal to modeldata
and pull it from there each time data_ames()
is called. So it might look more like:
data_ames <- function() {
if (has_data_in_internal_environment("ames")) {
get_data_from_internal_environment("ames")
} else if (has_data_in_cache("ames")) {
get_data_from_cache("ames")
} else {
get_data_from_url_and_cache_it("ames")
}
}
The datasets themselves would actually live in a folder in this repo that would be .Rbuildignore
-d. For example: inst/data/ames.rds
and then ignore inst/data
Prepare for release:
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Prepare for release:
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::cloud_check()
cran-comments.md
Submit to CRAN:
usethis::use_version('major')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
2023
Necessary:
person(given = "Posit Software, PBC", role = c("cph", "fnd"))
use_mit_license()
use_tidy_logo()
usethis::use_tidy_coc()
usethis::use_tidy_github_actions()
Optional:
pak::pak("org/pkg")
over devtools::install_github("org/pkg")
in READMEuse_tidy_dependencies()
and/or replace compat files with use_standalone()
use_standalone("r-lib/rlang", "types-check")
instead of home grown argument checkersPrepare for release:
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Prepare for release:
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::cloud_check()
cran-comments.md
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Mainly to be used in examples and unit tests.
It would be nice if {Matrix} isn't needed to be added as suggest/import
I preface this issue by acknowledging that this package should be kept as small as possible.
It would be nice if all the data sets had str(dataset)
in examples like Chicago does. It makes it a little easier to quickly find a data set with the features you need.
Prepare for release:
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::cloud_check()
cran-comments.md
Submit to CRAN:
usethis::use_version('major')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Pre-history
usethis::use_readme_rmd()
usethis::use_roxygen_md()
usethis::use_github_links()
usethis::use_pkgdown_github_pages()
usethis::use_tidy_github_labels()
usethis::use_tidy_style()
usethis::use_tidy_description()
urlchecker::url_check()
2020
usethis::use_package_doc()
@importFrom
directives here.usethis::use_import_from()
is handy for this.usethis::use_testthat(3)
and upgrade to 3e, testthat 3e vignetteR/
files and test/
files for workflow happiness.usethis::rename_files()
can be helpful.2021
usethis::use_tidy_dependencies()
usethis::use_tidy_github_actions()
and update artisanal actions to use setup-r-dependencies
cran-comments.md
2022
usethis::use_tidy_coc()
development
is mode: auto
in pkgdown configmaster
--> main
issuesThere was some controversy with this dataset and OkCupid data in general, https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release and https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/ and in some ways okc_text
feels a little icky to use.
In addition, the data doesn't lend itself to modeling tasks well.
It would be nice if we could find a clearly public dataset that doesn't have these problems.
As far as I can see, {textrecipes} is the only package using it: https://github.com/search?p=1&q=okc_text&type=Code. So the repercussions from this change will mostly be on my shoulders.
CC @juliasilge
This wouldn't make a much of a difference when viewing help-files, but it would make the pkgdown site much more informative. e.g. when looking for a dataset with a date column or with a certain range of rows, I could quickly click through links in the Reference page rather than calling str()
on each dataset in the console:
vs.
as the title says. See https://modeldata.tidymodels.org/reference/grants.html
Hi,
Thanks for this package and all the work put in tmwr!
I noted I can't access a data set using ::
. There is a particular reason why datasets cannot be accessed using this way like other data packages?
library(modeldata)
modeldata::ames
#> Error: 'ames' is not an exported object from 'namespace:modeldata'
modeldata::credit_data
#> Error: 'credit_data' is not an exported object from 'namespace:modeldata'
# example 1
library(babynames)
babynames::babynames
#> # A tibble: 1,924,665 ร 5
#> year sex name n prop
#> <dbl> <chr> <chr> <int> <dbl>
#> 1 1880 F Mary 7065 0.0724
#> 2 1880 F Anna 2604 0.0267
#> 3 1880 F Emma 2003 0.0205
#> 4 1880 F Elizabeth 1939 0.0199
#> 5 1880 F Minnie 1746 0.0179
#> 6 1880 F Margaret 1578 0.0162
#> 7 1880 F Ida 1472 0.0151
#> 8 1880 F Alice 1414 0.0145
#> 9 1880 F Bertha 1320 0.0135
#> 10 1880 F Sarah 1288 0.0132
#> # โฆ with 1,924,655 more rows
# example 2
library(nycflights13)
nycflights13::airlines
#> # A tibble: 16 ร 2
#> carrier name
#> <chr> <chr>
#> 1 9E Endeavor Air Inc.
#> 2 AA American Airlines Inc.
#> 3 AS Alaska Airlines Inc.
#> 4 B6 JetBlue Airways
#> 5 DL Delta Air Lines Inc.
#> 6 EV ExpressJet Airlines Inc.
#> 7 F9 Frontier Airlines Inc.
#> 8 FL AirTran Airways Corporation
#> 9 HA Hawaiian Airlines Inc.
#> 10 MQ Envoy Air
#> 11 OO SkyWest Airlines Inc.
#> 12 UA United Air Lines Inc.
#> 13 US US Airways Inc.
#> 14 VX Virgin America
#> 15 WN Southwest Airlines Co.
#> 16 YV Mesa Airlines Inc.
Created on 2022-05-10 by the reprex package (v2.0.1)
sessioninfo::session_info()
#> โ Session info โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> setting value
#> version R version 4.2.0 (2022-04-22 ucrt)
#> os Windows 10 x64 (build 22000)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Spanish_Spain.utf8
#> ctype Spanish_Spain.utf8
#> tz America/Santiago
#> date 2022-05-10
#> pandoc 2.17.1.1 @ C:/Program Files/RStudio/bin/quarto/bin/ (via rmarkdown)
#>
#> โ Packages โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> package * version date (UTC) lib source
#> babynames * 1.0.1 2021-04-12 [1] CRAN (R 4.2.0)
#> cli 3.3.0 2022-04-25 [1] CRAN (R 4.2.0)
#> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.2.0)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.2.0)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.2.0)
#> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.2.0)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.2.0)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.2.0)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.2.0)
#> knitr 1.39 2022-04-26 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> modeldata * 0.1.1 2021-07-14 [1] CRAN (R 4.2.0)
#> nycflights13 * 1.0.2 2021-04-12 [1] CRAN (R 4.2.0)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.2.0)
#> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.2.0)
#> rmarkdown 2.14 2022-04-25 [1] CRAN (R 4.2.0)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.2.0)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.2.0)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.2.0)
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.2.0)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0)
#> vctrs 0.4.1 2022-04-13 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.30 2022-03-02 [1] CRAN (R 4.2.0)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.2.0)
#>
#> [1] C:/Users/jbkun/AppData/Local/R/win-library/4.2
#> [2] C:/Program Files/R/R-4.2.0/library
#>
#> โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Thanks so much in advance.
Kinkd regrads,
In the docs for crickets
, we have this:
Lines 13 to 14 in df553cb
That URL currently gives:
Error: Could not resolve host: rcompanion.org
I don't know at this point if it is down for good or not, but let's check it out before the next release.
Please consider adding a description of what the variable stem
in the okc
dataset stands for. I've looked at the paper and description but cannot figure it out.
Prepare for release:
git pull
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
revdepcheck::cloud_check()
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
Prepare for release:
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::cloud_check()
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
Prepare for release:
git pull
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::cloud_check()
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
git push
usethis::use_github_release()
usethis::use_dev_version()
git push
Description refer bug reports to tidymodels/tidymodels
Line 15 in 7774fd1
And some of the links (including the Github icon top right) also links to tidymodels/tidymodels
The entry is there in R/bivariate.R
but the data is not there in data/
Prepare for release:
devtools::build_readme()
urlchecker::url_check()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
cran-comments.md
Submit to CRAN:
usethis::use_version('patch')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version()
The master
branch of this repository will soon be renamed to main
, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.
That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master
--> main
change.
The purpose of this issue is to:
message id: euphoric_snowdog
Making LazyData https://github.com/tidymodels/modeldata/blob/master/DESCRIPTION#L11 FALSE
means that users have to write an extra line of code that they are probably out of the habit of writing.
Possible datasets to consider include:
Prepare for release:
git pull
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
revdepcheck::cloud_check()
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
I've also noticed that the description of "grants" dataset (https://modeldata.tidymodels.org/reference/grants.html) states "Ames Housing Data".
Originally posted by @jromanowska in #17 (comment)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.