Giter Club home page Giter Club logo

strapgod's Introduction

strapgod

Codecov test coverage Travis build status CRAN status Lifecycle: experimental

Introduction

The goal of strapgod is to create virtual groups on top of a tibble or grouped_df as a way of resampling the original data frame. You can then efficiently perform various dplyr operations on this resampled_df, like: summarise(), do(), group_map(), and more, to easily compute bootstrapped and resampled statistics.

Installation

You can install the released version of strapgod from CRAN with:

install.packages("strapgod")

Install the development version from GitHub with:

devtools::install_github("DavisVaughan/strapgod")

Learning about strapgod

If you aren’t already on the pkgdown site, I would encourage starting there. From there, you will be able to click on these two vignettes to learn about working with resampled tibbles.

  • vignette("virtual-bootstraps", "strapgod")

  • vignette("dplyr-support", "strapgod")

Example

Create resampled data frames with bootstrapify() or samplify(). Notice how we grouped by the virtual column, .bootstrap and there are still only 150 rows even though we bootstrapped this dataset 10 times.

library(strapgod)
library(dplyr)
set.seed(123)

bootstrapify(iris, 10)
#> # A tibble: 150 x 5
#> # Groups:   .bootstrap [10]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

You can feed a resampled_df into summarise() or group_map() to perform efficient bootstrapped computations.

iris %>%
  bootstrapify(10) %>%
  summarise(per_strap_mean = mean(Petal.Width))
#> # A tibble: 10 x 2
#>    .bootstrap per_strap_mean
#>         <int>          <dbl>
#>  1          1           1.20
#>  2          2           1.22
#>  3          3           1.23
#>  4          4           1.13
#>  5          5           1.20
#>  6          6           1.15
#>  7          7           1.18
#>  8          8           1.13
#>  9          9           1.31
#> 10         10           1.19

The original data can be grouped as well, and the bootstraps will be created for each group.

iris %>%
  group_by(Species) %>%
  bootstrapify(10) %>%
  summarise(per_strap_per_species_mean = mean(Petal.Width))
#> # A tibble: 30 x 3
#> # Groups:   Species [3]
#>    Species .bootstrap per_strap_per_species_mean
#>    <fct>        <int>                      <dbl>
#>  1 setosa           1                      0.25 
#>  2 setosa           2                      0.246
#>  3 setosa           3                      0.24 
#>  4 setosa           4                      0.238
#>  5 setosa           5                      0.252
#>  6 setosa           6                      0.274
#>  7 setosa           7                      0.238
#>  8 setosa           8                      0.258
#>  9 setosa           9                      0.252
#> 10 setosa          10                      0.256
#> # … with 20 more rows

Plotting bootstrapped results

A fun example of using strapgod is to create bootstrapped visualizations quickly and easily for hypothetical outcome plots.

set.seed(123)
library(ggplot2)

# without bootstrap
mtcars %>%
  ggplot(aes(hp, mpg)) + 
  geom_smooth(se = FALSE) +
  ylim(y = c(0, 40))

# with bootstrap
mtcars %>%
  bootstrapify(10) %>%
  collect() %>%
  ggplot(aes(hp, mpg, group = .bootstrap)) + 
  geom_smooth(se = FALSE) +
  ylim(y = c(0, 40))

In the wild

  • Claus Wilke has used strapgod to power some pieces of his ungeviz package for visualizing uncertainty.

  • You can watch Claus’s rstudio::conf 2019 talk to see ungeviz and strapgod in action.

strapgod's People

Contributors

davisvaughan avatar romainfrancois avatar

Stargazers

 avatar Andrew Allen Bruce avatar  avatar  avatar cswaters avatar Lucas França avatar Joshua Kravitz avatar José de Jesus Filho avatar Jimmy Briggs avatar Robb Fitzsimmons avatar Matt Cowgill avatar Roberto Salas avatar Michael Sumner avatar Young Ahn avatar David Ritzwoller avatar Maani Beigy avatar Ismaïl Lachheb avatar Robert Myles McDonnell avatar Owen Thompson avatar Dmitry Shkolnik avatar pysr1 avatar Joshua Kunst avatar Alexey Shiklomanov avatar Jamie Moon avatar Nathan Eastwood avatar Stephanie Hazlitt avatar Miles McBain avatar amrrs avatar Alex Hallam avatar Denis Roussel avatar rogerclark avatar  avatar Eduard Szöcs avatar geom_赵小赫 avatar Sean Fischer avatar Felipe avatar Bruna Wundervald avatar Daniel Falbel avatar Xiangyun Huang avatar Romain Lesur avatar Eric Leung avatar Athos Petri Damiani avatar Michael W. Kearney avatar Ben Marwick avatar Tyson Barrett avatar Chris Kennedy avatar Emil Hvitfeldt avatar Kamil Slowikowski avatar Scott Handley avatar Doug Friedman avatar Aurélien Ginolhac avatar Srikanth K S avatar Ryan Wesslen avatar Leo Lee avatar Pierre Formont avatar Kanishka avatar  avatar Devin Pastoor avatar Tyler Bradley avatar Shinya Uryu avatar Indrajeet Patil avatar Hiroaki Yutani avatar zane avatar Tung N avatar Steven V. Miller avatar

Watchers

James Cloos avatar Olaf avatar  avatar  avatar

strapgod's Issues

Release strapgod 0.0.1

Prepare for release:

  • Check that description is informative
  • Check licensing of included files
  • usethis::use_cran_comments()
  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version()
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news()
  • Update install instructions in README
  • Tweet

Generate indices 1 resample at a time

Would it be possible to add a argument or option that changes how the indices are generated? Instead of generating indices for all resamples ahead of time, generate indices for each resample just-in-time.

The motivation is to reduce the maximum memory usage, at the expense of increased memory churn. This is useful for memory-constrained resources, where out-of-memory errors occur.

This would be similar to setting simple = TRUE in boot::boot(). Their documentation states this might affect reproducibility - I'm not sure if that would also be an issue here?

Bootstrapped confidence intervals?

suppressPackageStartupMessages({
  library(dplyr)
  library(strapgod)
  library(tidyr)
  library(rsample) # devtools::install_github("rsample", ref = "confidence_intervals")
})
#> Warning: package 'tidyr' was built under R version 3.5.2

# the thing you want to compute at each replicate
# we delay computation of this and it gets passed to summarise()
# at the right time
boot_compute <- function(data, ...) {
  dots <- rlang::enquos(..., .named = TRUE)
  attr(data, "boot_dots") <- dots
  data
}

# how you want to summarise those things
# dots not used. only for explicit naming
boot_pctl <- function(data, ..., alpha = 0.05, times = 1000) {
  dots <- attr(data, "boot_dots")

  estimates <- data %>%
    bootstrapify(times = times) %>%
    summarise(!!! dots)

  estimate_cols <- setdiff(
    colnames(estimates),
    c(group_vars(estimates), ".bootstrap")
  )

  # using fanny's core code
  pctl_single_wrapper <- function(stats, alpha = 0.05) {
    list(rsample:::pctl_single(stats, alpha))
  }

  out_raw <- dplyr::summarise_at(
    estimates,
    estimate_cols,
    list(~pctl_single_wrapper(.))
  )

  # optional but good practice i think
  estimate_syms <- rlang::syms(estimate_cols)

  out <- out_raw %>%
    tidyr::gather(key = ".statistic", value = "value", !!!estimate_syms) %>%
    tidyr::unnest()

  out
}

# Compute bootstrapped estimates
# of anything you want using summarise()
# ish semantics
iris %>%
  boot_compute(
    mean = mean(Sepal.Width),
    mean2 = mean(Sepal.Length)
  ) %>%
  boot_pctl(times = 1000)
#> # A tibble: 2 x 6
#>   .statistic lower estimate upper alpha .method   
#>   <chr>      <dbl>    <dbl> <dbl> <dbl> <chr>     
#> 1 mean        2.99     3.06  3.13  0.05 percentile
#> 2 mean2       5.71     5.85  5.97  0.05 percentile

# immediate group support
iris %>%
  group_by(Species) %>%
  boot_compute(
    mean = mean(Sepal.Width),
    mean2 = mean(Sepal.Length)
  ) %>%
  boot_pctl(times = 1000)
#> # A tibble: 6 x 7
#>   Species    .statistic lower estimate upper alpha .method   
#>   <fct>      <chr>      <dbl>    <dbl> <dbl> <dbl> <chr>     
#> 1 setosa     mean        3.32     3.43  3.53  0.05 percentile
#> 2 versicolor mean        2.69     2.77  2.85  0.05 percentile
#> 3 virginica  mean        2.89     2.97  3.07  0.05 percentile
#> 4 setosa     mean2       4.90     5.00  5.1   0.05 percentile
#> 5 versicolor mean2       5.81     5.94  6.06  0.05 percentile
#> 6 virginica  mean2       6.41     6.59  6.77  0.05 percentile

Created on 2019-03-16 by the reprex package (v0.2.1.9000)

Issue with bootstrapify

Hi,
I'm getting an error when trying to run the example bootstrapify command.


> iris %>%
+     bootstrapify(10) %>%
+     summarise(
+         mean_length = mean(Sepal.Length)
+     )
Error in if (tail(names(groups), 1L) != ".rows") { : 
  missing value where TRUE/FALSE needed

Here's my version information, since I suspect it's some incompatibility there. I have the latest version of tidyverse (as of a week ago) and the devtools version of strapgod.

sessionInfo()
R version 4.2.0 (2022-04-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3

locale:
[1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2 tidyr_1.2.0
[7] tibble_3.1.7 ggplot2_3.3.6 tidyverse_1.3.1 strapgod_0.0.4.9000

loaded via a namespace (and not attached):
[1] cellranger_1.1.0 pillar_1.7.0 compiler_4.2.0 dbplyr_2.1.1 tools_4.2.0 jsonlite_1.8.0 lubridate_1.8.0
[8] lifecycle_1.0.1 gtable_0.3.0 pkgconfig_2.0.3 rlang_1.0.2 reprex_2.0.1 rstudioapi_0.13 DBI_1.1.2
[15] cli_3.3.0 haven_2.5.0 xml2_1.3.3 withr_2.5.0 httr_1.4.3 fs_1.5.2 generics_0.1.2
[22] vctrs_0.4.1 hms_1.1.1 grid_4.2.0 tidyselect_1.1.2 glue_1.6.2 R6_2.5.1 fansi_1.0.3
[29] readxl_1.4.0 tzdb_0.3.0 modelr_0.1.8 magrittr_2.0.3 backports_1.4.1 scales_1.2.0 ellipsis_0.3.2
[36] rvest_1.0.2 assertthat_0.2.1 colorspace_2.0-3 utf8_1.2.2 stringi_1.7.6 munsell_0.5.0 broom_0.8.0
[43] crayon_1.5.1

dplyr 1.0.0 rev dep

gives this:

[master] 90.6 MiB ❯ revdepcheck::revdep_details(revdep = "strapgod")
══ Reverse dependency check ══════════════════════════════════ strapgod 0.0.4 ══

Status: BROKEN

── Newly failing

✖ checking tests ...

── Before ──────────────────────────────────────────────────────────────────────
0 errors ✔ | 0 warnings ✔ | 0 notes ✔

── After ───────────────────────────────────────────────────────────────────────
❯ checking tests ...
  See below...

── Test failures ───────────────────────────────────────────────── testthat ────

> library(testthat)
> library(strapgod)

Attaching package: 'strapgod'

The following object is masked from 'package:stats':

    filter

>
> test_check("strapgod")
── 1. Error: add_count() (@test-dplyr-compat.R#293)  ───────────────────────────
replacement has 300 rows, data has 150
Backtrace:
 1. testthat::expect_equal(nrow(add_count(x)), 300)
 5. dplyr::add_count(x)
 9. base::`[[<-.data.frame`(...)

── 2. Failure: bind_rows() fails sadly (@test-dplyr-compat.R#341)  ─────────────
`bind_rows(x, iris)` did not throw an error.

── 3. Failure: bind_cols() works (@test-dplyr-compat.R#354)  ───────────────────
`x_bc_1` inherits from `tbl_df/tbl/data.frame` not `resampled_df`.

── 4. Failure: bind_cols() works (@test-dplyr-compat.R#366)  ───────────────────
nrow(collect(x_bc_1)) not equal to 300.
1/1 mismatches
[1] 150 - 300 == -150

── 5. Failure: bind_cols() works (@test-dplyr-compat.R#374)  ───────────────────
"tbl_df" %in% class(x_bc_2) isn't false.

══ testthat results  ═══════════════════════════════════════════════════════════
[ OK: 150 | SKIPPED: 0 | WARNINGS: 1 | FAILED: 5 ]
1. Error: add_count() (@test-dplyr-compat.R#293)
2. Failure: bind_rows() fails sadly (@test-dplyr-compat.R#341)
3. Failure: bind_cols() works (@test-dplyr-compat.R#354)
4. Failure: bind_cols() works (@test-dplyr-compat.R#366)
5. Failure: bind_cols() works (@test-dplyr-compat.R#374)

Error: testthat unit tests failed
Execution halted

1 error ✖ | 0 warnings ✔ | 0 notes ✔

Release strapgod 0.0.4

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version('patch')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

Release strapgod 0.0.3

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version('patch')
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

Support for sampling?

I have been thinking that it would be nice to have the equivalent to bootstrapify() for sampling. This would work like dplyr::sample_n() but with two modifications:

  1. Take a times argument for repeated sampling, just like bootstrapify().
  2. Set up virtual samples like bootstrapify().

Not sure what to call the function. samplefy()? samplify()?

`group_split()` lifecycle warnings

When I run bootstrapify() I get the warning:

 The `keep` argument of `group_split()` is deprecated as of dplyr 1.0.0.
Please use the `.keep` argument instead.

After looking at the source code I believe this results from the new dplyr methods calls the argument using the deprecated name.
I can submit a PR soon, I just wanted to flag this as an issue.

Release strapgod 0.0.2

Prepare for release:

  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • revdepcheck::revdep_check(num_workers = 4)
  • Polish NEWS
  • Polish pkgdown reference index

Submit to CRAN:

  • usethis::use_version()
  • Update cran-comments.md
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Tweet

Status of virtual bootstraps?

I was wondering what the status of virtual bootstraps is. The C++ check that prevented this code from running in the first place seems to have been removed in the currently released dplyr (0.7.8), but something else has changed and virtual groups still don't work. Are there plans to revisit this or is this currently not on the roadmap? I'm asking because I'd like to figure out how much I should push my own bootstrap solution forward.

library(strapgod)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>%
  bootstrapify(10)
#> # A tibble: 150 x 5
#> # Groups:   [?]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # ... with 140 more rows

iris %>%
  bootstrapify(10) %>%
  summarise(per_strap_mean = mean(Petal.Width))
#> # A tibble: 1 x 1
#> # Groups:   [?]
#>   per_strap_mean
#>            <dbl>
#> 1           1.20

iris %>%
  group_by(Species) %>%
  bootstrapify(10) %>%
  summarise(per_strap_species_mean = mean(Petal.Width))
#> Error: 'group_data' is not an exported object from 'namespace:dplyr'

Created on 2018-12-19 by the reprex package (v0.2.1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.