Giter Club home page Giter Club logo

funs's Introduction

tidyverse

CRAN status R-CMD-check Codecov test coverage

Overview

The tidyverse is a set of packages that work in harmony because they share common data representations and API design. The tidyverse package is designed to make it easy to install and load core packages from the tidyverse in a single command.

If you’d like to learn how to use the tidyverse effectively, the best place to start is R for Data Science (2e).

Installation

# Install from CRAN
install.packages("tidyverse")
# Install the development version from GitHub
# install.packages("pak")
pak::pak("tidyverse/tidyverse")

If you’re compiling from source, you can run pak::pkg_system_requirements("tidyverse"), to see the complete set of system packages needed on your machine.

Usage

library(tidyverse) will load the core tidyverse packages:

You also get a condensed summary of conflicts with other packages you have loaded:

library(tidyverse)
#> ── Attaching core tidyverse packages ─────────────────── tidyverse 2.0.0.9000 ──
#> ✔ dplyr     1.1.3     ✔ readr     2.1.4
#> ✔ forcats   1.0.0     ✔ stringr   1.5.0
#> ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

You can see conflicts created later with tidyverse_conflicts():

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
tidyverse_conflicts()
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ✖ MASS::select()  masks dplyr::select()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

And you can check that all tidyverse packages are up-to-date with tidyverse_update():

tidyverse_update()
#> The following packages are out of date:
#>  * broom (0.4.0 -> 0.4.1)
#>  * DBI   (0.4.1 -> 0.5)
#>  * Rcpp  (0.12.6 -> 0.12.7)
#>  
#> Start a clean R session then run:
#> install.packages(c("broom", "DBI", "Rcpp"))

Packages

As well as the core tidyverse, installing this package also installs a selection of other packages that you’re likely to use frequently, but probably not in every analysis. This includes packages for:

  • Working with specific types of vectors:

    • hms, for times.
  • Importing other types of data:

    • feather, for sharing with Python and other languages.
    • haven, for SPSS, SAS and Stata files.
    • httr, for web apis.
    • jsonlite for JSON.
    • readxl, for .xls and .xlsx files.
    • rvest, for web scraping.
    • xml2, for XML.
  • Modelling

    • modelr, for modelling within a pipeline
    • broom, for turning models into tidy data

Code of Conduct

Please note that the tidyverse project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

funs's People

Contributors

davisvaughan avatar hadley avatar romainfrancois avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

funs's Issues

`vec_recode()`

Following https://github.com/lionel-/recode/blob/master/R/recode.R

Specify the mapping of values with a tibble of .new and .old columns (here created with keys(), not sure about this helper). The new column of keys acts as a generalisation of names which preserves the types. Missing values can be encoded in the spec:

keys(0:2, c(4L, 6L, 8L))
#> # A tibble: 3 x 2
#>    .new  .old
#>   <int> <int>
#> 1     0     4
#> 2     1     6
#> 3     2     8

tibble::tribble(
  ~ .new, ~ .old
  0L, 4L,
  1L, 6L,
  2L, 8L
)
#> # A tibble: 3 x 2
#>    .key .value
#>   <int>  <int>
#> 1     0      4
#> 2     1      6
#> 3     2      8

Basic usage:

vec_recode(mtcars$cyl, keys(0:2, c(4L, 6L, 8L)))
#>  [1] 1 1 0 1 2 1 2 0 0 1 1 2 2 2 2 2 2 0 0 0 0 2 2 2 2 0 0 0 2 1 2 0

vec_recode(mtcars$cyl, keys(0:1, c(4L, 6L)))
#>  [1] 1 1 0 1 8 1 8 0 0 1 1 8 8 8 8 8 8 0 0 0 0 8 8 8 8 0 0 0 8 1 8 0

vec_recode(mtcars$cyl, keys(0:1, c(4L, 6L)), default = 1.5)
#>  [1] 1.0 1.0 0.0 1.0 1.5 1.0 1.5 0.0 0.0 1.0 1.0 1.5 1.5 1.5 1.5 1.5 1.5 0.0
#> [19] 0.0 0.0 0.0 1.5 1.5 1.5 1.5 0.0 0.0 0.0 1.5 1.0 1.5 0.0

vec_recode(mtcars$vs, keys(c("zero", "one"), 0:1))
#>  [1] "zero" "zero" "one"  "one"  "zero" "one"  "zero" "one"  "one"  "one"
#> [11] "one"  "zero" "zero" "zero" "zero" "zero" "zero" "one"  "one"  "one"
#> [21] "one"  "zero" "zero" "zero" "zero" "one"  "zero" "one"  "zero" "zero"
#> [31] "zero" "one"

spec <- keys(c("FOO", "missing"), c("foo", NA))
vec_recode( c("foo", "bar", NA, "foo"), spec, default = "default")
#> [1] "FOO"     "default" "missing" "FOO"

# Corresponding dplyr code:
dplyr::recode(mtcars$cyl, `4` = 0, `6` = 1, `8` = 2)
dplyr::recode(mtcars$cyl, `4` = 0, `6` = 1)
dplyr::recode(mtcars$cyl, `4` = 0, `6` = 1, .default = 1.5)
dplyr::recode(mtcars$vs, `0` = "zero", `1` = "one")
dplyr::recode(c("foo", "bar", NA, "foo"), `foo` = "FOO", .default = "default", .missing = "missing")

You can recode multiple values to a same key by supplying a list column in .old:

spec <- tibble::tribble(
  ~ .new, ~ .old,
  0,      c(4, 6),
  1,      8
)
vec_recode(mtcars$cyl, spec)
#>  [1] 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0

You can recode vectors to a tibble:

spec <- tibble::tibble(
  .new = tibble::tibble(
    x = c("foo", "bar"),
    y = c("quux", "foofy")
  ),
  .old = c(4L, 6L)
)
vec_recode(mtcars$cyl, spec, default = tibble::tibble(x = "plop", y = "plip"))
#> # A tibble: 32 x 2
#>    x     y
#>  * <chr> <chr>
#>  1 bar   foofy
#>  2 bar   foofy
#>  3 foo   quux
#>  4 bar   foofy
#>  5 plop  plip
#>  6 bar   foofy
#>  7 plop  plip
#>  8 foo   quux
#>  9 foo   quux
#> 10 bar   foofy
#> # … with 22 more rows

And you can recode tibbles to vectors:

spec <- tibble::tibble(
  .new = c("foo", "bar"),
  .old = tibble::tibble(
    x = c(1L, 2L),
    y = c(TRUE, FALSE)
  )
)
x <- tibble::tibble(x = c(1, 2, 2, 1), y = c(TRUE, TRUE, FALSE, TRUE))
vec_recode(x, spec, default = "baz")
#> [1] "foo" "baz" "bar" "foo"

In a data cleaning scripts, all specs can be neatly kept at the top of the file, then we use mutate() and mapping variants to recode variables one by one or in bulk.

Quantile variant that returns a tibble

tibble::as_tibble(as.list(quantile(1:5)))
#> # A tibble: 1 x 5
#>    `0%` `25%` `50%` `75%` `100%`
#>   <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1     1     2     3     4      5

Created on 2019-02-08 by the reprex package (v0.2.1.9000)

Will need to think carefully about how the columns should be named.

modify_vector()

A vector version of modifyList() function would be very handy for, e.g., combining default http request headers with user-specified ones, where user-specified headers should trump the defaults. This is somewhat related to keep_last(), seen (twice, in fact!) in httr's utils.R, which might also be useful.

rthis()

From @jennybc:

guess you could call it rthis(), in the spirit of rnorm() et al. where input is a numeric vector of observed data. Then it generates n observations from some reasonable def'n of the empirical distribution. It could literally resample or do convex bootstrap or fit a kernel density estimate, etc.

vec_between()

I think vctrs has all the tools (vec_proxy_compare, ...) for a generic implementation of between, e.g.

library(vctrs)

vec_between <- function(x, left, right) {
  vec_compare(x, left) >= 0 & vec_compare(x, right) <= 0
}

vec_between(1:10, 0, 11)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
vec_between(1:10, 0, 11:20)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
vec_between(1:10, -(1:10), 11)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
vec_between(1:10, -(1:10), 11:20)
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

vec_between(letters[11:20], "a", "z")
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Created on 2018-12-18 by the reprex package (v0.2.1.9000)

Revisit `dplyr::coalesce` with `across`

With dplyr 1.0.0 introducing c_across and across I was wondering if it was possible to revisit tidyverse/dplyr#3548, by allowing dplyr::coalesce to work more naturally with the new across or c_across functions.

After reading the row-wise article, I expected dplyr::coalesce to work like rowSums since it naturally works across rows, or at worst it would work like rowwise => sum.

However, coalesce doesn't seem to work with the across family at all, as you can see in the code below.

Would it be possible to make coalesce compatible with the new across workflow?

library(dplyr, warn.conflicts = FALSE)

df <- tibble(
  id = 1:5, 
  w = c(10, NA, NA, NA, 14), 
  x = c(NA, 21, 22, 23, NA), 
  y = c(NA, NA, 32, 33, NA), 
  z = c(NA, NA, NA, 43, 44)
)

## Does coalesce work like rowSums, because
## they both naturally work across rows?
df %>%
  mutate(a = rowSums(across(-id), na.rm = TRUE))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    54
#> 4     4    NA    23    33    43    99
#> 5     5    14    NA    NA    44    58

# No: coalesce doesn't work like rowSums
df %>%
  mutate(a = coalesce(across(-id)))
#> # A tibble: 5 x 6
#>      id     w     x     y     z   a$w    $x    $y    $z
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10    NA    NA    NA
#> 2     2    NA    21    NA    NA    NA    21    NA    NA
#> 3     3    NA    22    32    NA    NA    22    32    NA
#> 4     4    NA    23    33    43    NA    23    33    43
#> 5     5    14    NA    NA    44    14    NA    NA    44



## Maybe it works like sum, since coalesce's argument is `...`
df %>%
  rowwise() %>%
  mutate(a = sum(c_across(-id), na.rm = TRUE))
#> # A tibble: 5 x 6
#> # Rowwise: 
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    54
#> 4     4    NA    23    33    43    99
#> 5     5    14    NA    NA    44    58

# No: coalesce doesn't work with rowwise
df %>%
  rowwise() %>%
  mutate(a = coalesce(c_across(-id)))
#> Error: `mutate()` argument `a` must be recyclable.
#> ℹ `a` is `coalesce(c_across(-id))`.
#> ℹ The error occured in row 1.
#> x `a` can't be recycled to size 1.
#> ℹ `a` must be size 1, not 4.
#> ℹ Did you mean: `a = list(coalesce(c_across(-id)))` ?



## coalesce works if you write out each by hand,
## but that goes against the spirit of the new `across` family
df %>%
  mutate(a = coalesce(w, x, y, z))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

# there is a work around suggested in tidyverse/dplyr#3548, but it's not very user friendly
# and requires a different package
library(tidyselect)
df %>%
  mutate(a = coalesce(!!!syms(vars_select(names(.), -id))))
#> # A tibble: 5 x 6
#>      id     w     x     y     z     a
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1    10    NA    NA    NA    10
#> 2     2    NA    21    NA    NA    21
#> 3     3    NA    22    32    NA    22
#> 4     4    NA    23    33    43    23
#> 5     5    14    NA    NA    44    14

Created on 2020-04-14 by the reprex package (v0.3.0)

Option to coalesce by column with data frames?

Using the vctrs definition of a "missing row" being a missing value for data frames, coalesce() might not do what you expect. Here, only the row with all missing values is updated. It might be nice to have a way to update each column separately.

You could map2() over the data frames, but that would require that you'd already casted them to the same data frame type, and I don't think it generalizes that nicely to >2 data frames

It is possible that we need an idea of vec_coalesce() and df_coalesce() for this new case

# devtools::install_github("r-lib/funs")

library(funs)

df1 <- data.frame(x = c(NA, 1, NA), y = c(1, NA, NA))
df2 <- data.frame(x = c(2, 2, 2), y = c(2, 2, 2))

df1
#>    x  y
#> 1 NA  1
#> 2  1 NA
#> 3 NA NA

coalesce(df1, df2)
#>    x  y
#> 1 NA  1
#> 2  1 NA
#> 3  2  2

Created on 2020-04-24 by the reprex package (v0.3.0)

Inspired by
https://github.com/tidyverse/dplyr/pull/5142/files#diff-3680f0191de36a0e61d4b24cdb1ab150R149

rows_patch.data.frame <- function(x, y, by = NULL, ..., copy = FALSE, inplace = NULL) {
  y <- auto_copy(x, y, copy = copy)
  y_key <- df_key(y, by)
  x_key <- df_key(x, names(y_key))
  df_inplace(inplace)

  idx <- vctrs::vec_match(y[y_key], x[x_key])
  # FIXME: Check key in x? https://github.com/r-lib/vctrs/issues/1032

  # FIXME: Do we need vec_coalesce()
  new_data <- map2(x[idx, names(y)], y, coalesce)

  x[idx, names(y)] <- new_data
  x
}

prop

Shorter version of prop.table() with na.rm = TRUE

prop <- function(x) x / sum(x, na.rm = TRUE)

Mode, in the statistical sense, at least for categorical variable

Mode of a categorical variable, in the statistical sense. I always feel embarrassed when I explain that R has no built-in way to compute the most frequent level of a factor. Here's one implementation from stackoverflow:

Mode <- function(x, na.rm = TRUE) {
  if(na.rm) {
    x = x[!is.na(x)]
  }
  ux <- unique(x)
  return(ux[which.max(tabulate(match(x, ux)))])
}
(x <- rep(1:5, c(1,2,3,2,1)))
#> [1] 1 2 2 3 3 3 4 4 5
Mode(x)
#> [1] 3
x[3] <- NA
Mode(x)
#> [1] 3

migrate plyr::mapvalues() to vctrs?

mapvalues() is very useful. I use it often. And I don't know of a good replacement.

As a diehard tidyverse user, this gets awkward; there are lots of posts about headaches from incorrectly loading plyr and dplyr together, and mapvalues currently stands officially outside the tidyverse as library(tidyverse) does not get you access to that function.

As plyr is slowly fading out and has been replaced by dplyr, increasingly more people will find it clunky to call that one great function from an otherwise deprecated package.

Would vctrs be the place for mapvalues, or a similar function, in the tidyverse?

Complete matrix and parallel functions

Vector Summary Cumulative Parallel Matrix
+ sum cumsum rowSums
* prod cumprod
& all cumall
| any cumany
smallest() min cummin pmin
greatest() max cummax pmax
smallest <- function(x, y) if (x =< y) x else y
greatest <- function(x, y) if (x >= y) x else y

cf http://adv-r.had.co.nz/Functionals.html#function-family

It may be possible to avoid the matrix/row family by automatically vectorising over data frames and rows of matrices. OTOH that may be unappealing since it would mean the function sometimes summarised and sometimes transformed.

Implement ilag() and ilead()

Related to #34

These are variations on lead() and lag() that require an order_by argument, but also respect the "spacing" between order_by observations.

This is very useful for time series, and is a neat feature in Stata. See slides 10-13 https://www.princeton.edu/~otorres/TS101.pdf

Also think about

  • idiff()
  • Difference vs seasonal difference operator

Implementation:

library(vctrs)
library(rlang)

ilag_ilead_impl <- function(x, order_by, n, default, fn) {
  vec_assert(x)
  vec_assert(order_by)

  vec_assert(n, size = 1L)
  n <- vec_cast(n, integer(), x_arg = "n")

  x_size <- vec_size(x)
  order_by_size <- vec_size(order_by)

  if (x_size != order_by_size) {
    abort("`x` and `order_by` must have the same size.")
  }

  # vec_any_na()! vctrs#544
  if (any(vec_equal_na(order_by))) {
    abort("`order_by` cannot have `NA` values.")
  }

  if (x_size == 0L) {
    return(x)
  }

  order_by_shift <- fn(order_by, n)

  loc <- vec_match(order_by_shift, order_by)

  out <- vec_slice(x, loc)

  if (!is.null(default)) {
    na_loc <- vec_equal_na(loc)
    default <- vec_cast(default, x, x_arg = "default", to_arg = "x")

    vec_slice(out, na_loc) <- default
  }

  out
}

ilag <- function(x, order_by, n = 1L, default = NULL) {
  ilag_ilead_impl(x, order_by, n, default, `-`)
}

ilead <- function(x, order_by, n = 1L, default = NULL) {
  ilag_ilead_impl(x, order_by, n, default, `+`)
}

Usage:

library(dplyr)

df <- tibble(
  x = c(5, 6, 7, 8),
  i = as.Date("2019-01-01") + c(0, 1, 3, 4)
)

# Notice how the temporal spacing is respected
# We get an `NA` at 2019-01-04 because 2019-01-03 doesn't exist
df %>%
  mutate(
    x_lag = lag(x),
    x_ilag = ilag(x, i)
  )
#> # A tibble: 4 x 4
#>       x i          x_lag x_ilag
#>   <dbl> <date>     <dbl>  <dbl>
#> 1     5 2019-01-01    NA     NA
#> 2     6 2019-01-02     5      5
#> 3     7 2019-01-04     6     NA
#> 4     8 2019-01-05     7      7


# - lag()'s default doesn't respect ordering of any variable
# - lag(order_by) respects ordering but not spacing
# - ilag(order_by) respects ordering and spacing
df_rev <- arrange(df, desc(i))

df_rev %>%
  mutate(
    x_lag = lag(x),
    x_lag_ob = lag(x, order_by = i),
    x_ilag = ilag(x, i)
  )
#> # A tibble: 4 x 5
#>       x i          x_lag x_lag_ob x_ilag
#>   <dbl> <date>     <dbl>    <dbl>  <dbl>
#> 1     8 2019-01-05    NA        7      7
#> 2     7 2019-01-04     8        6     NA
#> 3     6 2019-01-02     7        5      5
#> 4     5 2019-01-01     6       NA     NA

One thought was to let lag() have a respect_spacing parameter, rather that creating a new function. But I think it needs to be a new function, because there are restrictions on the order_by of ilag() that require that it has to be integerish under the hood, which is not a restriction on lag(). Practically, if we had a respect_spacing parameter, a problem would show up with character order_by variables. It would be strange for the usage of respect_spacing to stop this from working:

lag(1:3, order_by = c("a", "b", "c"))
# [1] NA  1  2

lag(1:3, order_by = c("a", "b", "c"), respect_spacing = TRUE)
# Error in order_by - n : non-numeric argument to binary operator

CC @earowang for the original inspiration of the functions. I think you could keep keyed_lag(), which could call this internally. I was excited by your implementation, and thought that it could be useful outside of the tsibble / time series context as well.

all_same() - check all elements in vector are the same

Ran into this a few times in the last week and thought it might belong here:

all_same(1:10)         # FALSE
all_same(rep(1,10))    # TRUE

Some old discussion here on the mailing list regarding this. I don't know how efficient the suggested all(x == x[1]) or whether it is suitable across all modes, but has been very handy in dplyr chains, trying to work out all the elements of a variable in a grouped dataframe are the same:

all_same <- function(x) all(x == x[1])

mtcars %>% 
  group_by(carb) %>%
  summarise(all_same(am))

##  A tibble: 6 x 2
#    carb `all_same(am)`
#   <dbl>          <lgl>
# 1     1          FALSE
# 2     2          FALSE
# 3     3           TRUE
# 4     4          FALSE
# 5     6           TRUE
# 6     8           TRUE

colwise functions

To match #5

Would only be usable in mutate-like contexts, but would take a variable selection as the first argument.

df %>% summarise(col_mean(everything()))

Return a tibble, to work with tidyverse/dplyr#2326

README, lifecycle, and CRAN availability expectations

Hello,

It would be very helpful for this package to have a README describing its purpose, and how it relates to other R packages, and when one could expect the functionality to be available. The DESCRIPTION file only says "Useful vectorised function" which is not that clear.

For example, it seems like the work done in this repository is some manner of future dependency for dplyr. I was directed to this repo when asking about dplyr::between support for character vectors (tidyverse/dplyr#5122). It looks like this has been implemented already and the associated issue is closed (#26), but it's not clear when this will make it back to dplyr.

Will this package make it to CRAN, and if so, roughly when?

Additionally, it would be helpful if there was a lifecycle designation like what the other tidyverse packages use.

Thank you!

group_map(x, g, f)

Something like this:

group_map <- function(g, x, f, ..., .ptype = NULL) {
  out <- vec_init(list(), length(g))
  for (i in seq_along(g)) {
    out[[i]] <- f(x[g[[i]]], ...)
  }
  vec_c(!!!out, .ptype = .ptype)
}

Most important would be to have a C++ version that would avoid allocation of intermediate vectors, assigning scalars directly into out. Probably could get away with just providing int and double versions for now.

sample() and diag()

Both sample() and diag() are so "flexible" they are hard to program with.

Re: sample(): this might be connected to rthis() (#13). Maybe the smooth bootstrap described there is rsmooth() and the simple resampling discussed here is rthis()?

Atomic constructors

I'm not sure if {funs} is the right place for it, but it seems like the vec() and dbl() constructors at the very least could live here. Not sure about flatten_vec() and as_double().

I think we are all mainly on the same page about what dbl() should do, but I wanted to outline implementations for it, and how it would connect to map(). Essentially:

map_dbl() == as_double(map())

flat_map_dbl() == dbl(map())

I had implemented a rough draft of a new flatten() here, but I've since realized it is essentially rlang::flatten() in 99% of the cases, so I've used that below instead.

The semantics of dbl() here seem to be exactly the same as with rlang::dbl(), but it goes through vctrs.

There are 2 issues that need to be fixed first. I've added them at the end. One with {rlang} and one with {vctrs}.

library(purrr)
library(vctrs)
library(rlang, warn.conflicts = FALSE)

as_vector <- function(x, ptype) {
  vec_cast(x, ptype)
}

as_double <- function(x) {
  as_vector(x, double())
}

flatten_vec <- function(x, ptype = NULL) {
  x <- flatten(x)
  vec_c(!!! x, .ptype = ptype)
}

vec <- function(..., .ptype = NULL) {
  x <- list2(...)
  flatten_vec(x, .ptype)
}

dbl <- function(...) {
  vec(..., .ptype = double())
}


as_double(c(1L, 2L))
#> [1] 1 2

as_double(list(1, 2, 3))
#> [1] 1 2 3

as_double(list(1:2, 3))
#> Error: Lossy cast from <list> to <double>.
#> * Locations: 1


dbl(1:2, 3)
#> [1] 1 2 3

dbl(list(1, 2, 3))
#> [1] 1 2 3

dbl(list(1:2, 3))
#> [1] 1 2 3


# map_dbl() is map() + as_double()
as_double(map(1:5, ~.x))
#> [1] 1 2 3 4 5

# it is strict, elements must be size 1
as_double(map(1:5, ~c(.x, .x)))
#> Error: Lossy cast from <list> to <double>.
#> * Locations: 1, 2, 3, 4, 5


# flat_map_dbl() is map() + dbl()
# it is less strict on the element size restraint
dbl(map(1:5, ~c(.x, .x)))
#>  [1] 1 1 2 2 3 3 4 4 5 5


# This will be disallowed by:
# https://github.com/r-lib/rlang/issues/885
flatten_vec(data.frame(x = 1), integer())
#> x 
#> 1

# This will be disallowed by:
# https://github.com/r-lib/vctrs/issues/738
# We only want 1 layer of list auto-splicing
dbl(1, list(list(1)))
#> [1] 1 1

Created on 2020-01-09 by the reprex package (v0.3.0.9000)

Common prefix?

Do we expect functions to have a common prefix? My sense is no: this is sort of a dplyr equivalent for functions.

Implement lead and lag

Start from @DavisVaughan

lag <- function (x, n = 1L) {
  vec_assert(x)
  n <- check_n(n)
  
  if (n == 0L) {
    return(x)
  }
  
  size <- vec_size(x)
  n <- pmin(n, size)
  
  new <- vec_init(x, n)
  old <- vec_slice(x, seq_len(size - n))
  
  vec_c(new, old)
}

#' @export
#' @rdname lag
lead <- function (x, n = 1L) {
  vec_assert(x)
  n <- check_n(n)
  
  if (n == 0L) {
    return(x)
  }
  
  size <- vec_size(x)
  n <- pmin(n, size)
  
  new <- vec_init(x, n)
  old <- vec_slice(x, -seq_len(n))
  
  vec_c(old, new)
}

check_n <- function(n) {
  n <- vec_cast(n, integer())
  vec_assert(n, size = 1L)
  
  if (n < 0L) {
    abort("`n` must be positive.")
  }
  n
}

Overall categorisation

Transformation

Flexible

if_else
recode/plyr::revalue
case_when
plyr::mapvalues

Combine

vec_modify (list_modify)
vec_merge

Continuous -> discrete

cut_interval
cut_number
cut_width

Numeric

near
between

trim
prop

Families

roll_
cum_
par_
row_ - is this still needed? or would just be row-vectorised?
i.e. what does min(data.frame) return?
or can it be parallel + splat?

Position

lead
lag

sample
rep_along

interleave

Ranking

row_number
ntile
min_rank
dense_rank
percent_rank
cume_dist

Equality

obj_equal
vec_equal
obj_identical
vec_identical

Missing

fill
replace_na
na_along

String

extract
separate

Summary

Position

first
last
nth

mode

Equality

vec_same == all(vec_equal(x, x[[1]]))

vec_expand()

Potentially useful vec_expand() for inserting NA (or other) values. It is like vec_slice(x, c(1:2, NA_integer_, 3:4)) but the way you specify it is a bit easier

library(tibble)
library(vctrs)
library(rlang)

vec_expand <- function(x, i, fill = NULL) {
  vec_assert(x)
  
  i <- vec_cast(i, integer())
  i <- vec_unique(i)
  
  if (any(i <= 0L)) {
    abort("`i` must be positive.")
  }
  
  size_i <- vec_size(i)
  size_x <- vec_size(x)
  size_out <- size_x + size_i
  
  slicer <- vec_init(integer(), size_out)
  pos_x_in_out <- seq_len(size_out)[-i]
  vec_slice(slicer, pos_x_in_out) <- seq_len(size_x)
  
  out <- vec_slice(x, slicer)
  
  if (is.null(fill)) {
    return(out)
  }
  
  vec_slice(out, i) <- fill
  
  out
}

vec_expand(1:5, c(2, 5))
#> [1]  1 NA  2  3 NA  4  5

df <- tibble(
  x = 1:5,
  y = 6:10
)

vec_expand(df, 2)
#> # A tibble: 6 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    NA    NA
#> 3     2     7
#> 4     3     8
#> 5     4     9
#> 6     5    10

vec_expand(df, -2)
#> `i` must be positive.

vec_expand(df, c(2, 2, 7, 4))
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    NA    NA
#> 3     2     7
#> 4    NA    NA
#> 5     3     8
#> 6     4     9
#> 7    NA    NA
#> 8     5    10

vec_expand(df, c(2, 7, 4), fill = data.frame(x = -1, y = -2))
#> # A tibble: 8 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    -1    -2
#> 3     2     7
#> 4    -1    -2
#> 5     3     8
#> 6     4     9
#> 7    -1    -2
#> 8     5    10

vec_expand(df, c(2, 4), fill = data.frame(x = c(-1, -1), y = c(-2, -3)))
#> # A tibble: 7 x 2
#>       x     y
#>   <int> <int>
#> 1     1     6
#> 2    -1    -2
#> 3     2     7
#> 4    -1    -3
#> 5     3     8
#> 6     4     9
#> 7     5    10

Created on 2019-10-04 by the reprex package (v0.2.1)

dplyr:::replace_with function

Moving this from tidyverse/dplyr#2040

I think dplyr:::replace_with can be useful and should be exported...
if_else uses it, coalesce uses it and probably other internal functions as well. It's a useful way to recode data even though similar things can be achieved with if_else.

It should work with typed NAs though but #2038 covers that, I think.

Implement first, last, nth

e.g.

first <- function(x, default = NA) {
  if (vec_size(x) == 0) {
    vec_assert(default, size = 1)
    vec_cast(default, x)
  } else {
    vec_slice(x, 1L)
  }
}

Implement n_distinct()

As simple wrapper around vec_unique_count(), that can easily take multiple vectors.

Note that tibble() is slow so need to think about this, and probably use a lower-level constructor from vctrs. Will need to use same approach in #40

Interleaving vectors

Just discovered this package. A package for handling vectors sounds very useful. Here’s a feature request (with code) for a vector operation I occasionally find use for.

A function for interleaving vectors would be nice. Here’s some code for such a function:

library(purrr)
interleave = function(...) {
  vecs = list(...)
  n_vecs = length(vecs)                    # Number of arguments/vectors
  max_n = vecs %>% map_int(length) %>% max # Max number of elements in a vector
  n_out = n_vecs * max_n                   # Number of elements in output vector
  x = vector(mode = mode(vecs[[1]]), n_vecs * max_n)
  if (n_out > 0) {
    for (i in seq_along(vecs))
    {
      x[seq(1, n_out, by = n_vecs) + i - 1] =
        rep(vecs[[i]], length.out = max_n)
    }
  }
  x
}

A few examples:

> # Some test data
> x = 1:4
> y = 10 * x
> z = 100 * (1:5)
> 
> # Interleaving two vectors
> interleave(x, y)
[1]  1 10  2 20  3 30  4 40
> 
> # Interleaving vectors with different
> # number of elements causes short
> # vectors to be recycled
> interleave(x, y, z)
 [1]   1  10 100   2  20 200   3  30 300   4  40 400   1  10 500
> 
> # Interleaving vectors of different classes/modes
> # causes class coercion
> interleave(x, LETTERS[1:4])
[1] "1" "A" "2" "B" "3" "C" "4" "D"
> 
> 
> ## A few edge cases
> 
> # Interleaving a single vector
> interleave(x)
[1] 1 2 3 4
> 
> # Interleaving empty vectors
> interleave(numeric(), y, 99)
 [1] NA 10 99 NA 20 99 NA 30 99 NA 40 99
> 
> # Interleaving a single empty vector
> interleave(logical())
logical(0)

vectorise()

vectorise <- function(fn, .ptype = NULL) {
  function(.x, ...) {
    map_vec(.x, fn, ..., .ptype = .ptype)
  }
}

coalesce

From #17, by @DavisVaughan

library(rlang)
library(vctrs)

vec_coalesce <- function(..., .ptype = NULL) {
  args <- list2(...)
  
  n_args <- vec_size(args) 
  
  if (n_args == 0L) {
    return(NULL)
  }
  
  if (n_args == 1L) {
    out <- args[[1L]]
    return(out)
  }
  
  args <- vec_cast_common(!!! args, .to = .ptype)
  args <- vec_recycle_common(!!! args)
  
  out <- args[[1L]]
  args <- args[-1L]
  
  for (arg in args) {
    is_na <- vec_equal_na(out)
    
    if (!any(is_na)) {
      break
    }
    
    vec_slice(out, is_na) <- vec_slice(arg, is_na)
  }
  
  out
}

vec_coalesce()
#> NULL

vec_coalesce(1, 0)
#> [1] 1

vec_coalesce(1, FALSE, .ptype = logical())
#> [1] TRUE

vec_coalesce(NA, 1)
#> [1] 1

vec_coalesce(c(1, NA, 2), 0L)
#> [1] 1 0 2

vec_coalesce(
  data.frame(x = c(1, NA, 3)), 
  data.frame(x = 2)
)
#>   x
#> 1 1
#> 2 2
#> 3 3

# a bit odd, but this technically makes sense
vec_coalesce(
  data.frame(x = c(1, NA)), 
  data.frame(x = 2, y = 3)
)
#>   x  y
#> 1 1 NA
#> 2 2  3

vec_coalesce(
  factor(c("x", "y", NA, "x", NA)),
  factor("MISSING!")
)
#> [1] x        y        MISSING! x        MISSING!
#> Levels: x y MISSING!

# Common size is used as the reference
vec_coalesce(
  1,
  c(1, 2, 3)
)
#> [1] 1 1 1

vec_coalesce(
  NA,
  c(1, 2, 3)
)
#> [1] 1 2 3

need weighted_mean() to avoid common problem

I teach classes where students learn and use the tidyverse, and I've been noticing that a large proportion are getting wrong results from weighted.mean() without knowing it.

The students learn to use dplyr::count() with the wt argument. When they use weighted.mean(), they also use the wt argument. The weights argument for weighted.mean() is w not wt, and since is uses ..., those who use wt don't receive an error message. Instead, weighted.mean() returns the unweighted mean, which is almost certainly not what the user intended.

May I suggest a new tidyverse function weighted_mean() that uses wt for its weights argument to be consistent with count() and that gives an error message when used incorrectly.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.