tidyverse / funs Goto Github PK

Collection of low-level functions for working with vctrs

License: Other

R 100.00%

funs's Introduction

tidyverse

Overview

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

If you’d like to learn how to use the tidyverse effectively, the best place to start is R for Data Science (2e).

Installation

# Install from CRAN
install.packages("tidyverse")

# Install the development version from GitHub
# install.packages("pak")
pak::pak("tidyverse/tidyverse")

If you’re compiling from source, you can run pak::pkg_system_requirements("tidyverse"), to see the complete set of system packages needed on your machine.

Usage

library(tidyverse) will load the core tidyverse packages:

ggplot2, for data visualisation.
dplyr, for data manipulation.
tidyr, for data tidying.
readr, for data import.
purrr, for functional programming.
tibble, for tibbles, a modern re-imagining of data frames.
stringr, for strings.
forcats, for factors.
lubridate, for date/times.

You also get a condensed summary of conflicts with other packages you have loaded:

library(tidyverse)
#> ── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0.9000 ──
#> ✔ dplyr     1.1.3     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You can see conflicts created later with tidyverse_conflicts():

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
tidyverse_conflicts()
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ✖ MASS::select()  masks dplyr::select()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

And you can check that all tidyverse packages are up-to-date with tidyverse_update():

tidyverse_update()
#> The following packages are out of date:
#>  * broom (0.4.0 -> 0.4.1)
#>  * DBI   (0.4.1 -> 0.5)
#>  * Rcpp  (0.12.6 -> 0.12.7)
#>  
#> Start a clean R session then run:
#> install.packages(c("broom", "DBI", "Rcpp"))

Packages

As well as the core tidyverse, installing this package also installs a selection of other packages that you’re likely to use frequently, but probably not in every analysis. This includes packages for:

Working with specific types of vectors:
- hms, for times.
Importing other types of data:
- feather, for sharing with Python and other languages.
- haven, for SPSS, SAS and Stata files.
- httr, for web apis.
- jsonlite for JSON.
- readxl, for .xls and .xlsx files.
- rvest, for web scraping.
- xml2, for XML.
Modelling
- modelr, for modelling within a pipeline
- broom, for turning models into tidy data

Code of Conduct

Please note that the tidyverse project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

funs's People

Contributors

Stargazers

Watchers

Forkers

romainfrancois davisvaughan jimsforks stjordanis elinw seanpm2001

funs's Issues

Make sure all vector tidyr functions have a pure vector equivalent

Like fill() and replace_na()

cc @jennybc

Implement near

`vec_recode()`

Following https://github.com/lionel-/recode/blob/master/R/recode.R

Specify the mapping of values with a tibble of .new and .old columns (here created with keys(), not sure about this helper). The new column of keys acts as a generalisation of names which preserves the types. Missing values can be encoded in the spec:

keys(0:2, c(4L, 6L, 8L))
#> # A tibble: 3 x 2
#>    .new  .old
#>   <int> <int>
#> 1     0     4
#> 2     1     6
#> 3     2     8

tibble::tribble(
  ~ .new, ~ .old
  0L, 4L,
  1L, 6L,
  2L, 8L
)
#> # A tibble: 3 x 2
#>    .key .value
#>   <int>  <int>
#> 1     0      4
#> 2     1      6
#> 3     2      8

Basic usage:

vec_recode(mtcars$cyl, keys(0:2, c(4L, 6L, 8L)))
#>  [1] 1 1 0 1 2 1 2 0 0 1 1 2 2 2 2 2 2 0 0 0 0 2 2 2 2 0 0 0 2 1 2 0

vec_recode(mtcars$cyl, keys(0:1, c(4L, 6L)))
#>  [1] 1 1 0 1 8 1 8 0 0 1 1 8 8 8 8 8 8 0 0 0 0 8 8 8 8 0 0 0 8 1 8 0

vec_recode(mtcars$cyl, keys(0:1, c(4L, 6L)), default = 1.5)
#>  [1] 1.0 1.0 0.0 1.0 1.5 1.0 1.5 0.0 0.0 1.0 1.0 1.5 1.5 1.5 1.5 1.5 1.5 0.0
#> [19] 0.0 0.0 0.0 1.5 1.5 1.5 1.5 0.0 0.0 0.0 1.5 1.0 1.5 0.0

vec_recode(mtcars$vs, keys(c("zero", "one"), 0:1))
#>  [1] "zero" "zero" "one"  "one"  "zero" "one"  "zero" "one"  "one"  "one"
#> [11] "one"  "zero" "zero" "zero" "zero" "zero" "zero" "one"  "one"  "one"
#> [21] "one"  "zero" "zero" "zero" "zero" "one"  "zero" "one"  "zero" "zero"
#> [31] "zero" "one"

spec <- keys(c("FOO", "missing"), c("foo", NA))
vec_recode( c("foo", "bar", NA, "foo"), spec, default = "default")
#> [1] "FOO"     "default" "missing" "FOO"

# Corresponding dplyr code:
dplyr::recode(mtcars$cyl, `4` = 0, `6` = 1, `8` = 2)
dplyr::recode(mtcars$cyl, `4` = 0, `6` = 1)
dplyr::recode(mtcars$cyl, `4` = 0, `6` = 1, .default = 1.5)
dplyr::recode(mtcars$vs, `0` = "zero", `1` = "one")
dplyr::recode(c("foo", "bar", NA, "foo"), `foo` = "FOO", .default = "default", .missing = "missing")

You can recode multiple values to a same key by supplying a list column in .old:

spec <- tibble::tribble(
  ~ .new, ~ .old,
  0,      c(4, 6),
  1,      8
)
vec_recode(mtcars$cyl, spec)
#>  [1] 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0

You can recode vectors to a tibble:

spec <- tibble::tibble(
  .new = tibble::tibble(
    x = c("foo", "bar"),
    y = c("quux", "foofy")
  ),
  .old = c(4L, 6L)
)
vec_recode(mtcars$cyl, spec, default = tibble::tibble(x = "plop", y = "plip"))
#> # A tibble: 32 x 2
#>    x     y
#>  * <chr> <chr>
#>  1 bar   foofy
#>  2 bar   foofy
#>  3 foo   quux
#>  4 bar   foofy
#>  5 plop  plip
#>  6 bar   foofy
#>  7 plop  plip
#>  8 foo   quux
#>  9 foo   quux
#> 10 bar   foofy
#> # … with 22 more rows

And you can recode tibbles to vectors:

spec <- tibble::tibble(
  .new = c("foo", "bar"),
  .old = tibble::tibble(
    x = c(1L, 2L),
    y = c(TRUE, FALSE)
  )
)
x <- tibble::tibble(x = c(1, 2, 2, 1), y = c(TRUE, TRUE, FALSE, TRUE))
vec_recode(x, spec, default = "baz")
#> [1] "foo" "baz" "bar" "foo"

In a data cleaning scripts, all specs can be neatly kept at the top of the file, then we use mutate() and mapping variants to recode variables one by one or in bulk.

Quantile variant that returns a tibble

tibble::as_tibble(as.list(quantile(1:5)))
#> # A tibble: 1 x 5
#>    `0%` `25%` `50%` `75%` `100%`
#>   <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1     1     2     3     4      5

^{Created on 2019-02-08 by the reprex package (v0.2.1.9000)}

Will need to think carefully about how the columns should be named.

modify_vector()

A vector version of modifyList() function would be very handy for, e.g., combining default http request headers with user-specified ones, where user-specified headers should trump the defaults. This is somewhat related to keep_last(), seen (twice, in fact!) in httr's utils.R, which might also be useful.

rthis()

From @jennybc:

guess you could call it rthis(), in the spirit of rnorm() et al. where input is a numeric vector of observed data. Then it generates n observations from some reasonable def'n of the empirical distribution. It could literally resample or do convex bootstrap or fit a kernel density estimate, etc.

Look into stata's recode

https://www.stata.com/manuals13/drecode.pdf#drecode

Hat tip @leeper

Implement cumany() and cumall()

vec_between()

I think vctrs has all the tools (vec_proxy_compare, ...) for a generic implementation of between, e.g.

library(vctrs)

vec_between <- function(x, left, right) {
  vec_compare(x, left) >= 0 & vec_compare(x, right) <= 0
}

vec_between(1:10, 0, 11)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
vec_between(1:10, 0, 11:20)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
vec_between(1:10, -(1:10), 11)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
vec_between(1:10, -(1:10), 11:20)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

vec_between(letters[11:20], "a", "z")
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

^{Created on 2018-12-18 by the reprex package (v0.2.1.9000)}

Revisit `dplyr::coalesce` with `across`

With dplyr 1.0.0 introducing c_across and across I was wondering if it was possible to revisit tidyverse/dplyr#3548, by allowing dplyr::coalesce to work more naturally with the new across or c_across functions.

After reading the row-wise article, I expected dplyr::coalesce to work like rowSums since it naturally works across rows, or at worst it would work like rowwise => sum.

However, coalesce doesn't seem to work with the across family at all, as you can see in the code below.

Would it be possible to make coalesce compatible with the new across workflow?

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:5, 
  w = c(10, NA, NA, NA, 14), 
  x = c(NA, 21, 22, 23, NA), 
  y = c(NA, NA, 32, 33, NA), 
  z = c(NA, NA, NA, 43, 44)
)

## Does coalesce work like rowSums, because
## they both naturally work across rows?
df %>%
  mutate(a = rowSums(across(-id), na.rm = TRUE))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    54
#> 4     4    NA    23    33    43    99
#> 5     5    14    NA    NA    44    58

# No: coalesce doesn't work like rowSums
df %>%
  mutate(a = coalesce(across(-id)))
#> # A tibble: 5 x 6
#>      id     w     x     y     z   a$w    $x    $y    $z
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10    NA    NA    NA
#> 2     2    NA    21    NA    NA    NA    21    NA    NA
#> 3     3    NA    22    32    NA    NA    22    32    NA
#> 4     4    NA    23    33    43    NA    23    33    43
#> 5     5    14    NA    NA    44    14    NA    NA    44



## Maybe it works like sum, since coalesce's argument is `...`
df %>%
  rowwise() %>%
  mutate(a = sum(c_across(-id), na.rm = TRUE))
#> # A tibble: 5 x 6
#> # Rowwise: 
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    54
#> 4     4    NA    23    33    43    99
#> 5     5    14    NA    NA    44    58

# No: coalesce doesn't work with rowwise
df %>%
  rowwise() %>%
  mutate(a = coalesce(c_across(-id)))
#> Error: `mutate()` argument `a` must be recyclable.
#> ℹ `a` is `coalesce(c_across(-id))`.
#> ℹ The error occured in row 1.
#> x `a` can't be recycled to size 1.
#> ℹ `a` must be size 1, not 4.
#> ℹ Did you mean: `a = list(coalesce(c_across(-id)))` ?



## coalesce works if you write out each by hand,
## but that goes against the spirit of the new `across` family
df %>%
  mutate(a = coalesce(w, x, y, z))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

# there is a work around suggested in tidyverse/dplyr#3548, but it's not very user friendly
# and requires a different package
library(tidyselect)
df %>%
  mutate(a = coalesce(!!!syms(vars_select(names(.), -id))))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

^{Created on 2020-04-14 by the reprex package (v0.3.0)}

Option to coalesce by column with data frames?

Using the vctrs definition of a "missing row" being a missing value for data frames, coalesce() might not do what you expect. Here, only the row with all missing values is updated. It might be nice to have a way to update each column separately.

You could map2() over the data frames, but that would require that you'd already casted them to the same data frame type, and I don't think it generalizes that nicely to >2 data frames

It is possible that we need an idea of vec_coalesce() and df_coalesce() for this new case

# devtools::install_github("r-lib/funs")

library(funs)

df1 <- data.frame(x = c(NA, 1, NA), y = c(1, NA, NA))
df2 <- data.frame(x = c(2, 2, 2), y = c(2, 2, 2))

df1
#>    x  y
#> 1 NA  1
#> 2  1 NA
#> 3 NA NA

coalesce(df1, df2)
#>    x  y
#> 1 NA  1
#> 2  1 NA
#> 3  2  2

^{Created on 2020-04-24 by the reprex package (v0.3.0)}

Inspired by
https://github.com/tidyverse/dplyr/pull/5142/files#diff-3680f0191de36a0e61d4b24cdb1ab150R149

rows_patch.data.frame <- function(x, y, by = NULL, ..., copy = FALSE, inplace = NULL) {
  y <- auto_copy(x, y, copy = copy)
  y_key <- df_key(y, by)
  x_key <- df_key(x, names(y_key))
  df_inplace(inplace)

  idx <- vctrs::vec_match(y[y_key], x[x_key])
  # FIXME: Check key in x? https://github.com/r-lib/vctrs/issues/1032

  # FIXME: Do we need vec_coalesce()
  new_data <- map2(x[idx, names(y)], y, coalesce)

  x[idx, names(y)] <- new_data
  x
}

prop

Shorter version of prop.table() with na.rm = TRUE

prop <- function(x) x / sum(x, na.rm = TRUE)

Extract ranking functions from dplyr

*_along

tidyverse/purrr#183

And extract out rep_along() and list_along().

Mode, in the statistical sense, at least for categorical variable

Mode of a categorical variable, in the statistical sense. I always feel embarrassed when I explain that R has no built-in way to compute the most frequent level of a factor. Here's one implementation from stackoverflow:

Mode <- function(x, na.rm = TRUE) {
  if(na.rm) {
    x = x[!is.na(x)]
  }
  ux <- unique(x)
  return(ux[which.max(tabulate(match(x, ux)))])
}
(x <- rep(1:5, c(1,2,3,2,1)))
#> [1] 1 2 2 3 3 3 4 4 5
Mode(x)
#> [1] 3
x[3] <- NA
Mode(x)
#> [1] 3

migrate plyr::mapvalues() to vctrs?

mapvalues() is very useful. I use it often. And I don't know of a good replacement.

As a diehard tidyverse user, this gets awkward; there are lots of posts about headaches from incorrectly loading plyr and dplyr together, and mapvalues currently stands officially outside the tidyverse as library(tidyverse) does not get you access to that function.

As plyr is slowly fading out and has been replaced by dplyr, increasingly more people will find it clunky to call that one great function from an otherwise deprecated package.

Would vctrs be the place for mapvalues, or a similar function, in the tidyverse?

Complete matrix and parallel functions

Vector	Summary	Cumulative	Parallel	Matrix
`+`	`sum`	`cumsum`		`rowSums`
`*`	`prod`	`cumprod`
`&`	`all`	`cumall`
`\|`	`any`	`cumany`
`smallest()`	`min`	`cummin`	`pmin`
`greatest()`	`max`	`cummax`	`pmax`

smallest <- function(x, y) if (x =< y) x else y
greatest <- function(x, y) if (x >= y) x else y

cf http://adv-r.had.co.nz/Functionals.html#function-family

It may be possible to avoid the matrix/row family by automatically vectorising over data frames and rows of matrices. OTOH that may be unappealing since it would mean the function sometimes summarised and sometimes transformed.

Implement ilag() and ilead()

Related to #34

These are variations on lead() and lag() that require an order_by argument, but also respect the "spacing" between order_by observations.

This is very useful for time series, and is a neat feature in Stata. See slides 10-13 https://www.princeton.edu/~otorres/TS101.pdf

Also think about

idiff()
Difference vs seasonal difference operator

Implementation:

library(vctrs)
library(rlang)

ilag_ilead_impl <- function(x, order_by, n, default, fn) {
  vec_assert(x)
  vec_assert(order_by)

  vec_assert(n, size = 1L)
  n <- vec_cast(n, integer(), x_arg = "n")

  x_size <- vec_size(x)
  order_by_size <- vec_size(order_by)

  if (x_size != order_by_size) {
    abort("`x` and `order_by` must have the same size.")
  }

  # vec_any_na()! vctrs#544
  if (any(vec_equal_na(order_by))) {
    abort("`order_by` cannot have `NA` values.")
  }

  if (x_size == 0L) {
    return(x)
  }

  order_by_shift <- fn(order_by, n)

  loc <- vec_match(order_by_shift, order_by)

  out <- vec_slice(x, loc)

  if (!is.null(default)) {
    na_loc <- vec_equal_na(loc)
    default <- vec_cast(default, x, x_arg = "default", to_arg = "x")

    vec_slice(out, na_loc) <- default
  }

  out
}

ilag <- function(x, order_by, n = 1L, default = NULL) {
  ilag_ilead_impl(x, order_by, n, default, `-`)
}

ilead <- function(x, order_by, n = 1L, default = NULL) {
  ilag_ilead_impl(x, order_by, n, default, `+`)
}

Usage:

library(dplyr)

df <- tibble(
  x = c(5, 6, 7, 8),
  i = as.Date("2019-01-01") + c(0, 1, 3, 4)
)

# Notice how the temporal spacing is respected
# We get an `NA` at 2019-01-04 because 2019-01-03 doesn't exist
df %>%
  mutate(
    x_lag = lag(x),
    x_ilag = ilag(x, i)
  )
#> # A tibble: 4 x 4
#>       x i          x_lag x_ilag
#>   <dbl> <date>     <dbl>  <dbl>
#> 1     5 2019-01-01    NA     NA
#> 2     6 2019-01-02     5      5
#> 3     7 2019-01-04     6     NA
#> 4     8 2019-01-05     7      7


# - lag()'s default doesn't respect ordering of any variable
# - lag(order_by) respects ordering but not spacing
# - ilag(order_by) respects ordering and spacing
df_rev <- arrange(df, desc(i))

df_rev %>%
  mutate(
    x_lag = lag(x),
    x_lag_ob = lag(x, order_by = i),
    x_ilag = ilag(x, i)
  )
#> # A tibble: 4 x 5
#>       x i          x_lag x_lag_ob x_ilag
#>   <dbl> <date>     <dbl>    <dbl>  <dbl>
#> 1     8 2019-01-05    NA        7      7
#> 2     7 2019-01-04     8        6     NA
#> 3     6 2019-01-02     7        5      5
#> 4     5 2019-01-01     6       NA     NA

One thought was to let lag() have a respect_spacing parameter, rather that creating a new function. But I think it needs to be a new function, because there are restrictions on the order_by of ilag() that require that it has to be integerish under the hood, which is not a restriction on lag(). Practically, if we had a respect_spacing parameter, a problem would show up with character order_by variables. It would be strange for the usage of respect_spacing to stop this from working:

lag(1:3, order_by = c("a", "b", "c"))
# [1] NA  1  2

lag(1:3, order_by = c("a", "b", "c"), respect_spacing = TRUE)
# Error in order_by - n : non-numeric argument to binary operator

CC @earowang for the original inspiration of the functions. I think you could keep keyed_lag(), which could call this internally. I was excited by your implementation, and thought that it could be useful outside of the tsibble / time series context as well.

Extract vector functions from dplyr

Window functions: lead/lag, rank etc
Cumulative: cumany cumall
Vectorised: if_else recode case_when near

all_same() - check all elements in vector are the same

Ran into this a few times in the last week and thought it might belong here:

all_same(1:10)         # FALSE
all_same(rep(1,10))    # TRUE

Some old discussion here on the mailing list regarding this. I don't know how efficient the suggested all(x == x[1]) or whether it is suitable across all modes, but has been very handy in dplyr chains, trying to work out all the elements of a variable in a grouped dataframe are the same:

all_same <- function(x) all(x == x[1])

mtcars %>% 
  group_by(carb) %>%
  summarise(all_same(am))

##  A tibble: 6 x 2
#    carb `all_same(am)`
#   <dbl>          <lgl>
# 1     1          FALSE
# 2     2          FALSE
# 3     3           TRUE
# 4     4          FALSE
# 5     6           TRUE
# 6     8           TRUE

colwise functions

To match #5

Would only be usable in mutate-like contexts, but would take a variable selection as the first argument.

df %>% summarise(col_mean(everything()))

Return a tibble, to work with tidyverse/dplyr#2326

Implement vec_fill

Single vector component of tidyr::fill()

sum(is.na()) helpers

n_absent() and n_present()

README, lifecycle, and CRAN availability expectations

Hello,

It would be very helpful for this package to have a README describing its purpose, and how it relates to other R packages, and when one could expect the functionality to be available. The DESCRIPTION file only says "Useful vectorised function" which is not that clear.

For example, it seems like the work done in this repository is some manner of future dependency for dplyr. I was directed to this repo when asking about dplyr::between support for character vectors (tidyverse/dplyr#5122). It looks like this has been implemented already and the associated issue is closed (#26), but it's not clear when this will make it back to dplyr.

Will this package make it to CRAN, and if so, roughly when?

Additionally, it would be helpful if there was a lifecycle designation like what the other tidyverse packages use.

Thank you!

group_map(x, g, f)

Something like this:

group_map <- function(g, x, f, ..., .ptype = NULL) {
  out <- vec_init(list(), length(g))
  for (i in seq_along(g)) {
    out[[i]] <- f(x[g[[i]]], ...)
  }
  vec_c(!!!out, .ptype = .ptype)
}

Most important would be to have a C++ version that would avoid allocation of intermediate vectors, assigning scalars directly into out. Probably could get away with just providing int and double versions for now.

sample() and diag()

Both sample() and diag() are so "flexible" they are hard to program with.

Re: sample(): this might be connected to rthis() (#13). Maybe the smooth bootstrap described there is rsmooth() and the simple resampling discussed here is rthis()?

Implement if_else

Start at https://vctrs.r-lib.org/articles/stability.html#ifelse

Atomic constructors

I'm not sure if {funs} is the right place for it, but it seems like the vec() and dbl() constructors at the very least could live here. Not sure about flatten_vec() and as_double().

I think we are all mainly on the same page about what dbl() should do, but I wanted to outline implementations for it, and how it would connect to map(). Essentially:

map_dbl() == as_double(map())

flat_map_dbl() == dbl(map())

I had implemented a rough draft of a new flatten() here, but I've since realized it is essentially rlang::flatten() in 99% of the cases, so I've used that below instead.

The semantics of dbl() here seem to be exactly the same as with rlang::dbl(), but it goes through vctrs.

There are 2 issues that need to be fixed first. I've added them at the end. One with {rlang} and one with {vctrs}.

library(purrr)
library(vctrs)
library(rlang, warn.conflicts = FALSE)

as_vector <- function(x, ptype) {
  vec_cast(x, ptype)
}

as_double <- function(x) {
  as_vector(x, double())
}

flatten_vec <- function(x, ptype = NULL) {
  x <- flatten(x)
  vec_c(!!! x, .ptype = ptype)
}

vec <- function(..., .ptype = NULL) {
  x <- list2(...)
  flatten_vec(x, .ptype)
}

dbl <- function(...) {
  vec(..., .ptype = double())
}


as_double(c(1L, 2L))
#> [1] 1 2

as_double(list(1, 2, 3))
#> [1] 1 2 3

as_double(list(1:2, 3))
#> Error: Lossy cast from <list> to <double>.
#> * Locations: 1


dbl(1:2, 3)
#> [1] 1 2 3

dbl(list(1, 2, 3))
#> [1] 1 2 3

dbl(list(1:2, 3))
#> [1] 1 2 3


# map_dbl() is map() + as_double()
as_double(map(1:5, ~.x))
#> [1] 1 2 3 4 5

# it is strict, elements must be size 1
as_double(map(1:5, ~c(.x, .x)))
#> Error: Lossy cast from <list> to <double>.
#> * Locations: 1, 2, 3, 4, 5


# flat_map_dbl() is map() + dbl()
# it is less strict on the element size restraint
dbl(map(1:5, ~c(.x, .x)))
#>  [1] 1 1 2 2 3 3 4 4 5 5


# This will be disallowed by:
# https://github.com/r-lib/rlang/issues/885
flatten_vec(data.frame(x = 1), integer())
#> x 
#> 1

# This will be disallowed by:
# https://github.com/r-lib/vctrs/issues/738
# We only want 1 layer of list auto-splicing
dbl(1, list(list(1)))
#> [1] 1 1

^{Created on 2020-01-09 by the reprex package (v0.3.0.9000)}

Implement case_when

Common prefix?

Do we expect functions to have a common prefix? My sense is no: this is sort of a dplyr equivalent for functions.

Faster quantile

tidyverse/dplyr#1183

Implement lead and lag

Start from @DavisVaughan

lag <- function (x, n = 1L) {
  vec_assert(x)
  n <- check_n(n)
  
  if (n == 0L) {
    return(x)
  }
  
  size <- vec_size(x)
  n <- pmin(n, size)
  
  new <- vec_init(x, n)
  old <- vec_slice(x, seq_len(size - n))
  
  vec_c(new, old)
}

#' @export
#' @rdname lag
lead <- function (x, n = 1L) {
  vec_assert(x)
  n <- check_n(n)
  
  if (n == 0L) {
    return(x)
  }
  
  size <- vec_size(x)
  n <- pmin(n, size)
  
  new <- vec_init(x, n)
  old <- vec_slice(x, -seq_len(n))
  
  vec_c(old, new)
}

check_n <- function(n) {
  n <- vec_cast(n, integer())
  vec_assert(n, size = 1L)
  
  if (n < 0L) {
    abort("`n` must be positive.")
  }
  n
}

Overall categorisation

Transformation

Flexible

if_else
recode/plyr::revalue
case_when
plyr::mapvalues

Combine

vec_modify (list_modify)
vec_merge

Continuous -> discrete

cut_interval
cut_number
cut_width

Numeric

near
between

trim
prop

Families

roll_
cum_
par_
row_ - is this still needed? or would just be row-vectorised?
i.e. what does min(data.frame) return?
or can it be parallel + splat?

Position

lead
lag

sample
rep_along

interleave

Ranking

row_number
ntile
min_rank
dense_rank
percent_rank
cume_dist

Equality

obj_equal
vec_equal
obj_identical
vec_identical

Missing

fill
replace_na
na_along

String

extract
separate

Summary

Position

first
last
nth

mode

Equality

vec_same == all(vec_equal(x, x[[1]]))

vec_expand()

Potentially useful vec_expand() for inserting NA (or other) values. It is like vec_slice(x, c(1:2, NA_integer_, 3:4)) but the way you specify it is a bit easier

library(tibble)
library(vctrs)
library(rlang)

vec_expand <- function(x, i, fill = NULL) {
  vec_assert(x)
  
  i <- vec_cast(i, integer())
  i <- vec_unique(i)
  
  if (any(i <= 0L)) {
    abort("`i` must be positive.")
  }
  
  size_i <- vec_size(i)
  size_x <- vec_size(x)
  size_out <- size_x + size_i
  
  slicer <- vec_init(integer(), size_out)
  pos_x_in_out <- seq_len(size_out)[-i]
  vec_slice(slicer, pos_x_in_out) <- seq_len(size_x)
  
  out <- vec_slice(x, slicer)
  
  if (is.null(fill)) {
    return(out)
  }
  
  vec_slice(out, i) <- fill
  
  out
}

vec_expand(1:5, c(2, 5))
#> [1]  1 NA  2  3 NA  4  5

df <- tibble(
  x = 1:5,
  y = 6:10
)

vec_expand(df, 2)
#> # A tibble: 6 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    NA    NA
#> 3     2     7
#> 4     3     8
#> 5     4     9
#> 6     5    10

vec_expand(df, -2)
#> `i` must be positive.

vec_expand(df, c(2, 2, 7, 4))
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    NA    NA
#> 3     2     7
#> 4    NA    NA
#> 5     3     8
#> 6     4     9
#> 7    NA    NA
#> 8     5    10

vec_expand(df, c(2, 7, 4), fill = data.frame(x = -1, y = -2))
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    -1    -2
#> 3     2     7
#> 4    -1    -2
#> 5     3     8
#> 6     4     9
#> 7    -1    -2
#> 8     5    10

vec_expand(df, c(2, 4), fill = data.frame(x = c(-1, -1), y = c(-2, -3)))
#> # A tibble: 7 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    -1    -2
#> 3     2     7
#> 4    -1    -3
#> 5     3     8
#> 6     4     9
#> 7     5    10

^{Created on 2019-10-04 by the reprex package (v0.2.1)}

Think about equality

Vectorised identical()? Vectorised all.equal()?

Bring back plyr::round_any()

cut helpers

Extract from ggplot2

dplyr:::replace_with function

Moving this from tidyverse/dplyr#2040

I think dplyr:::replace_with can be useful and should be exported...
if_else uses it, coalesce uses it and probably other internal functions as well. It's a useful way to recode data even though similar things can be achieved with if_else.

It should work with typed NAs though but #2038 covers that, I think.

Provide assignment versions of extraction functions

e.g. last<- - tidyverse/dplyr#3022

Shortcuts for particularly important/useful vctrs functions

v = vec_c()
s = simplify() (so that s(map()) is our sapply())

Implement first, last, nth

e.g.

first <- function(x, default = NA) {
  if (vec_size(x) == 0) {
    vec_assert(default, size = 1)
    vec_cast(default, x)
  } else {
    vec_slice(x, 1L)
  }
}

Implement n_distinct()

As simple wrapper around vec_unique_count(), that can easily take multiple vectors.

Note that tibble() is slow so need to think about this, and probably use a lower-level constructor from vctrs. Will need to use same approach in #40

Implement na_if()

Needs to be implemented in such a way that you can replace NaN values:

dplyr::na_if(c(NA, NaN, 1), 1)
#> [1]  NA NaN  NA

(from tidyverse/dplyr#4627)

Interleaving vectors

Just discovered this package. A package for handling vectors sounds very useful. Here’s a feature request (with code) for a vector operation I occasionally find use for.

A function for interleaving vectors would be nice. Here’s some code for such a function:

library(purrr)
interleave = function(...) {
  vecs = list(...)
  n_vecs = length(vecs)                    # Number of arguments/vectors
  max_n = vecs %>% map_int(length) %>% max # Max number of elements in a vector
  n_out = n_vecs * max_n                   # Number of elements in output vector
  x = vector(mode = mode(vecs[[1]]), n_vecs * max_n)
  if (n_out > 0) {
    for (i in seq_along(vecs))
    {
      x[seq(1, n_out, by = n_vecs) + i - 1] =
        rep(vecs[[i]], length.out = max_n)
    }
  }
  x
}

A few examples:

> # Some test data
> x = 1:4
> y = 10 * x
> z = 100 * (1:5)
> 
> # Interleaving two vectors
> interleave(x, y)
[1]  1 10  2 20  3 30  4 40
> 
> # Interleaving vectors with different
> # number of elements causes short
> # vectors to be recycled
> interleave(x, y, z)
 [1]   1  10 100   2  20 200   3  30 300   4  40 400   1  10 500
> 
> # Interleaving vectors of different classes/modes
> # causes class coercion
> interleave(x, LETTERS[1:4])
[1] "1" "A" "2" "B" "3" "C" "4" "D"
> 
> 
> ## A few edge cases
> 
> # Interleaving a single vector
> interleave(x)
[1] 1 2 3 4
> 
> # Interleaving empty vectors
> interleave(numeric(), y, 99)
 [1] NA 10 99 NA 20 99 NA 30 99 NA 40 99
> 
> # Interleaving a single empty vector
> interleave(logical())
logical(0)

vectorise()

vectorise <- function(fn, .ptype = NULL) {
  function(.x, ...) {
    map_vec(.x, fn, ..., .ptype = .ptype)
  }
}

coalesce

From #17, by @DavisVaughan

library(rlang)
library(vctrs)

vec_coalesce <- function(..., .ptype = NULL) {
  args <- list2(...)
  
  n_args <- vec_size(args) 
  
  if (n_args == 0L) {
    return(NULL)
  }
  
  if (n_args == 1L) {
    out <- args[[1L]]
    return(out)
  }
  
  args <- vec_cast_common(!!! args, .to = .ptype)
  args <- vec_recycle_common(!!! args)
  
  out <- args[[1L]]
  args <- args[-1L]
  
  for (arg in args) {
    is_na <- vec_equal_na(out)
    
    if (!any(is_na)) {
      break
    }
    
    vec_slice(out, is_na) <- vec_slice(arg, is_na)
  }
  
  out
}

vec_coalesce()
#> NULL

vec_coalesce(1, 0)
#> [1] 1

vec_coalesce(1, FALSE, .ptype = logical())
#> [1] TRUE

vec_coalesce(NA, 1)
#> [1] 1

vec_coalesce(c(1, NA, 2), 0L)
#> [1] 1 0 2

vec_coalesce(
  data.frame(x = c(1, NA, 3)), 
  data.frame(x = 2)
)
#>   x
#> 1 1
#> 2 2
#> 3 3

# a bit odd, but this technically makes sense
vec_coalesce(
  data.frame(x = c(1, NA)), 
  data.frame(x = 2, y = 3)
)
#>   x  y
#> 1 1 NA
#> 2 2  3

vec_coalesce(
  factor(c("x", "y", NA, "x", NA)),
  factor("MISSING!")
)
#> [1] x        y        MISSING! x        MISSING!
#> Levels: x y MISSING!

# Common size is used as the reference
vec_coalesce(
  1,
  c(1, 2, 3)
)
#> [1] 1 1 1

vec_coalesce(
  NA,
  c(1, 2, 3)
)
#> [1] 1 2 3

Should this repo live in tidyverse?

And does it need a better name?

Provide trim()

trim: a variant of pmin(pmax(x, min_val), max_val)

Original issue: tidyverse/dplyr#2108, CC @einarhjorleifsson

Optional vectorised arithmetic, comparison, and logical operators

Provide some way to use vec_compare(), vec_arith(), etc within a session.

Would use vctrs recycling and coercion rules.

need weighted_mean() to avoid common problem

I teach classes where students learn and use the tidyverse, and I've been noticing that a large proportion are getting wrong results from weighted.mean() without knowing it.

The students learn to use dplyr::count() with the wt argument. When they use weighted.mean(), they also use the wt argument. The weights argument for weighted.mean() is w not wt, and since is uses ..., those who use wt don't receive an error message. Instead, weighted.mean() returns the unweighted mean, which is almost certainly not what the user intended.

May I suggest a new tidyverse function weighted_mean() that uses wt for its weights argument to be consistent with count() and that gives an error message when used incorrectly.

tidyverse / funs Goto Github PK

funs's Introduction

tidyverse

Overview

Installation

Usage

Packages

Code of Conduct

funs's People

Contributors

Stargazers

Watchers

Forkers

funs's Issues

Transformation

Flexible

Combine

Continuous -> discrete

Numeric

Families

Position

Ranking

Equality

Missing

String

Summary

Position

Equality

Recommend Projects

Recommend Topics

Recommend Org