Giter Club home page Giter Club logo

forcats's Introduction

forcats

CRAN status R-CMD-check Codecov test coverage

Overview

R uses factors to handle categorical variables, variables that have a fixed and known set of possible values. Factors are also helpful for reordering character vectors to improve display. The goal of the forcats package is to provide a suite of tools that solve common problems with factors, including changing the order of levels or the values. Some examples include:

  • fct_reorder(): Reordering a factor by another variable.
  • fct_infreq(): Reordering a factor by the frequency of values.
  • fct_relevel(): Changing the order of a factor by hand.
  • fct_lump(): Collapsing the least/most frequent values of a factor into β€œother”.

You can learn more about each of these in vignette("forcats"). If you’re new to factors, the best place to start is the chapter on factors in R for Data Science.

Installation

# The easiest way to get forcats is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just forcats:
install.packages("forcats")

# Or the the development version from GitHub:
# install.packages("pak")
pak::pak("tidyverse/forcats")

Cheatsheet

Getting started

forcats is part of the core tidyverse, so you can load it with library(tidyverse) or library(forcats).

library(forcats)
library(dplyr)
library(ggplot2)
starwars %>% 
  filter(!is.na(species)) %>%
  count(species, sort = TRUE)
#> # A tibble: 37 Γ— 2
#>    species      n
#>    <chr>    <int>
#>  1 Human       35
#>  2 Droid        6
#>  3 Gungan       3
#>  4 Kaminoan     2
#>  5 Mirialan     2
#>  6 Twi'lek      2
#>  7 Wookiee      2
#>  8 Zabrak       2
#>  9 Aleena       1
#> 10 Besalisk     1
#> # β„Ή 27 more rows
starwars %>%
  filter(!is.na(species)) %>%
  mutate(species = fct_lump(species, n = 3)) %>%
  count(species)
#> # A tibble: 4 Γ— 2
#>   species     n
#>   <fct>   <int>
#> 1 Droid       6
#> 2 Gungan      3
#> 3 Human      35
#> 4 Other      39
ggplot(starwars, aes(x = eye_color)) + 
  geom_bar() + 
  coord_flip()

starwars %>%
  mutate(eye_color = fct_infreq(eye_color)) %>%
  ggplot(aes(x = eye_color)) + 
  geom_bar() + 
  coord_flip()

More resources

For a history of factors, I recommend stringsAsFactors: An unauthorized biography by Roger Peng and stringsAsFactors = <sigh> by Thomas Lumley. If you want to learn more about other approaches to working with factors and categorical data, I recommend Wrangling categorical data in R, by Amelia McNamara and Nicholas Horton.

Getting help

If you encounter a clear bug, please file a minimal reproducible example on Github. For questions and other discussion, please use community.rstudio.com.

forcats's People

Contributors

808sandbr avatar batpigandme avatar eipi10 avatar gtm19 avatar hadley avatar jennybc avatar johngoldin avatar jonocarroll avatar jtr13 avatar kbodwin avatar krlmlr avatar lionel- avatar lwjohnst86 avatar malcolmbarrett avatar mgacc0 avatar mitchelloharawild avatar monicagerber avatar pursuitofdatascience avatar riinuots avatar robinsones avatar rtaph avatar ryo-n7 avatar salim-b avatar sinarueeger avatar tklebel avatar tslumley avatar vincentguyader avatar yimingli avatar yutannihilation avatar zhiiiyang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

forcats's Issues

Package not available as binary

The instructions suggest that the package can be loaded as a binary, however only source files were available in several French and US repositories I explored. Was I mistaken in the first place or is there a problem with the binary? The compatibility with older R versions should be OK according to the CRAN (R >= 2.10).

Feature Suggestion - Function to Remove Specific Unused Levels

Sometimes you want to drop one or two specific unused factor levels but you don't want to drop all of the unused factor levels. For example, you might have the following factor from survey data

    vote_intention <- factor(x = c("Democrat", "Republican"), 
                             levels = c("Democrat", "Republican", 
                                        "Independent", "Undecided"))

where you'd be interested in removing the 'Undecided' level but not the 'Independent' level, even though both are unused. In such a case, you would essentially want to use a version of fct_drop() that doesn't drop every unused level but instead allows finer control over which levels are dropped.

A simple example of such a function would be the following:

fct_drop_specific <- function(f, l) {
          factor(x = f, 
                 levels = setdiff(levels(f), l),
                 ordered = is.ordered(f))
}

fct_drop_specific(f = vote_intention, l = "Undecided")

Forcats seems like the best package to have a function that would clearly accomplish this. Unless this is already implemented elsewhere, perhaps this could be implemented using arguments to fct_drop() or as a new function in forcats.

Turn levels into numbers

levels(gss_cat$rincome)
#> [1] "No answer"      "Don't know"     "Refused"        "$25000 or more" "$20000 - 24999" "$15000 - 19999"
#> [7] "$10000 - 14999" "$8000 to 9999"  "$7000 to 7999"  "$6000 to 6999"  "$5000 to 5999"  "$4000 to 4999" 
#> [13] "$3000 to 3999"  "$1000 to 2999"  "Lt $1000"       "Not applicable"

Would be useful to have some moderately easy way to turn into a numeric vector. It wouldn't be magic (i.e. it wouldn't compute mean of values) - it would work like fct_revalue(), but create numeric vector, and unmentioned levels would become NAs.

Renaming `fct_collapse()`?

I find the name fct_collapse() somewhat misleading. For collapsing levels, one would typically use fct_recode(), not fct_collapse().

Example: You have a factor for β€˜Daily number of cigarettes smoked’, with these levels:

None
1–5
6–10
11–20
> 20

To collapse them into non-smoker, light smoker and heavy smoker (however you define those terms), or into non-smoker and smoker, you would use fct_recode().

Suggestion: Rename fct_collapse() into fct_lump() or fct_lump_other(). This would also be consistent with its description: β€˜lump rarest (or most common) levels into "other".β€˜

fct_recode and NULL vector

Currently fct_recode does this if you give a vector of values for NULL

x <- factor(c("apple", "bear", "banana", "dear"))
fct_recode(x, NULL = c("apple", "dear"), fruit = "banana")

[1] NULL1 bear  fruit NULL2
Levels: NULL1 fruit bear NULL2

If you want to put several levels to NULL, it appears that you should do this.
fct_recode(x, NULL = "apple", NULL = "dear", fruit = "banana")

Wouldn't it make a lot more sense that this just works?
fct_recode(x, NULL = c("apple", "dear"), fruit = "banana")

Feature request: argument in fct_recode to reorder based on sequence

I struggled with making a clear title of this feature request, but here is my idea. I really like using the fct_recode function to change the levels of factors (especially for plotting purposes) but I noticed that the new version of the factor keeps the same order of levels as the original version before the recoding using fct_recode even though I specified the order different in the arguments to fct_recode. It would be nice to have an extra parameter in fct_recode to let the order be dictated by how the levels are given in the call. Here's a minimal example using the forcats CRAN release of 0.1.1:

library(forcats)
data(diamonds)

# show order of cut
levels(diamonds$cut)
# [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"  

# recode levels using fct_recode
diamonds2 <- diamonds %>%
  mutate(cut2 = fct_recode(cut, "Very Great" = "Very Good", "Great" = "Good", "Really Ideal" = "Ideal", "Really Premium" = "Premium", "Bad" = "Fair"))

# show order of cut2
levels(diamonds2$cut2)
# [1] "Bad"            "Great"          "Very Great"     "Really Premium" "Really Ideal"  

Return an ordered factor with fct_inorder and fct_infreq

When using fct_inorder() or fct_infreq() to create a factor ordered by appearance or frequency, the functions create a factor with levels in the correct order, but the factor is not an official ordered factor, so the ordering does not carry over to ggplot plots, ordered logistic regression models, or any other situations where ordered factors matter.

Right now, these two functions have to be wrapped in ordered() to make them actual ordered factors. It would be nice if there were an ordered=TRUE/FALSE argument in fct_inorder() and fct_infreq() to avoid this extra step.

f <- factor(c("b", "b", "a", "c", "c", "c"))
f
#> [1] b b a c c c
#> Levels: a b c

# Unordered
fct_inorder(f)
#> [1] b b a c c c
#> Levels: b a c
fct_infreq(f)
#> [1] b b a c c c
#> Levels: c b a

# Ordered
ordered(fct_inorder(f))
#> [1] b b a c c c
#> Levels: b < a < c
ordered(fct_infreq(f))
#> [1] b b a c c c
#> Levels: c < b < a

# Would be nice
fct_inorder(f, ordered=TRUE)
#> Error in fct_inorder(f, ordered = TRUE): unused argument (ordered = TRUE)
fct_infreq(f, ordered=TRUE)
#> Error in fct_infreq(f, ordered = TRUE): unused argument (ordered = TRUE)

create flags and create from flags

For some situations, the categories are actual flag (0/1 variables, either INT or Logical) and one needs to take them and turn them into a single factor variable. Sometimes one needs to go the other way. This is actually a very specific implementation of dcast/melt and can be useful for going between modeling (each level is a dummy variable in a model) and ggplot. Sometimes one also needs to combine some less commonly used levels into an "other" group when doing this transformation.

Feature request: addNA with a name

It would be very useful to have a function fct_addNA that would add missing data to a named category.

For example

fct_addNA(var, 'Missing'')

would reproduce

var <- addNA(var)
levels(var)[length(levels(var))] <- 'Missing'

Feature request: Lump biggest possible minority into `Other` category

Here’s a use case I sometimes encounter. I have a factor with a large number of levels (e.g. medical diagnoses), but most of them occur very infrequently, and a few occur frequently (e.g. heart disease and cancer). Showing all of them (e.g. on a bar chart) takes up too much space, so I want to lump the infrequent ones into an Other category.

However, the number of high-frequent levels vary, and so does the proportion (sometimes it’s 1%, sometimes 5%), so I can’t use fct_collapse(). Here’s a method that works well for me: Lump all the low-count levels into an Other level so that the count of the Other level is less than (or perhaps less than or equal to) the lowest count of the rest of the levels. In other words, Lump the biggest possible minority into Other category. That’s a sensible definition of β€˜Other’, and the bar of the Other category in the bar chart will always be shorter than (or equal to) the rest of the bars, which looks good.

Example:

x = factor(rep(LETTERS[1:9], times=c(2, 82, 1, 51, 5, 62, 11, 1,90)))
count = sort(table(x))
count
#> x
#>  C  H  A  E  G  D  F  B  I 
#>  1  1  2  5 11 51 62 82 90

Here we would lump CHAEG into Other, because their total count is 20, which is less than the count of D, but if we add D, the count will increase to 20 + 51 = 71, which is greater than the count of the next level, F (62).

Here’s a simple proof-of-concept code to find the levels that needs to be merged into Other:

less = cumsum(head(count, -1)) <= count[-1] # Or <
names(count)[which(less)]                   # Lump these into β€˜Other’ category
#> [1] "C" "H" "A" "E" "G"

(It needs to be refined to work with some corner cases, e.g. when the the input or output has length 1, or when there are ties in the counts, but it shows the idea.)

Feature request: anonymise levels

Often in real-world data a set of levels are identifying (licence number, SSN, etc...) and need to be anonymised prior to data release - could we have a re-leveling that takes in a common prefix (perhaps with some sensible default) or a hash function, shuffles the levels, then recodes them to anonymous levels? Internally, it's potentially a combination of fct_shuffle() and fct_recode() plus a way to generate the new levels.

Care needs to be taken that the new levels don't follow an identifiable connecting pattern to the original levels (e.g. alphabetical -> sequential). sample() should take care of this just fine.

I'd do this as a PR but I get the feeling you're on a roll with the internal self-consistency you've got going.

Typo in main description

In the opening page of forcats' homepage it says:

# Alternatively, install just forcats:
install.packages("readr")

readr should be changed to forcats.

Improve fct_lump() behavior for factors with many rare levels

Example:

library(magrittr)
N <- 5
M <- 2 ** N
exp_factor <- factor(rep(1:(N + M), c(2 ** (N:1), rep(1, M))))
table(exp_factor)
## exp_factor
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
## 32 16  8  4  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 26 27 28 29 30 31 32 33 34 35 36 37 
##  1  1  1  1  1  1  1  1  1  1  1  1
forcats::fct_lump(exp_factor) %>% table
## .
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
## 32 16  8  4  2  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
## 26 27 28 29 30 31 32 33 34 35 36 37 
##  1  1  1  1  1  1  1  1  1  1  1  1
forcats::fct_lump(exp_factor, prop = 0.1) %>% table
## .
##     1     2 Other 
##    32    16    46

The automatic mode works as advertised (therefore doesn't lump at all), when specifying prop there's only a loose relation between the prop value and the actual mass of the non-Other and Other levels. Not sure how to tackle that best, perhaps a new cumprop argument?

Feature request: Convert specfied levels into NA values

Sometimes missing data isn’t properly coded as NA. Here’s a feature request for a function to convert one or more specified levels into NA values (and remove the levels). Proposed code:

fct_make_na = function(f, na_vals="NA") {
  f = forcats:::check_factor(f) # Also handle character vectors
  f[f %in% na_vals] = NA
  levs = levels(f)
  factor(f, levels = levs[!(levs %in% na_vals)])
}

Example:

> x = c("foo","foo","(missing)", "bar", "unknown", "-", "baz")
> x = factor(x, levels=c(unique(x), "extra"))
> x
[1] foo       foo       (missing) bar       unknown   -         baz      
Levels: foo (missing) bar unknown - baz extra

> fct_make_na(x, na_vals=c("-", "(missing)", "unknown"))
[1] foo  foo  <NA> bar  <NA> <NA> baz 
Levels: foo bar baz extra

Note that the β€˜missing’ levels are removed, but the extra level is retained, even if no element has the level extra.

Add new factor type to allow adding levels on assignment

You asked on the blog announcement about things that forcat don't fix, so here's one. It would be useful to offer a factor type different from the standard one and which would accept new levels without a warning. As far as I know, this isn't handled by forcats currently. Silly example which is terribly annoying in practice:

> df = data.frame(x=factor(c("1", "2")), y=factor(c("A", "B")))
> mutate(df, new=if_else(x == "1", x, y))
  x y  new
1 1 A    1
2 2 B <NA>
Warning message:
In `[<-.factor`(`*tmp*`, i, value = 2L) :
  invalid factor level, NA generated

(A special factor type would also allow protecting against the common mistake as.integer(x) vs. as.integer(as.character(x)), but that's a different issue.)

fct_c() should not require a list as an argument

Related to this issue.

If you have the following data,

ds <- factor(c(rep("Male", times=2), rep("Female", times=4), rep("Gender non-conforming", times=2)))
test <- ds[1:4]
train <- ds[5:8]

The base c() does not work as expected:

c(test, train)
[1] 3 3 1 1 1 1 2 2

but you can list() and unlist()

unlist(list(test, train))
[1] Male                  Male                  Female                Female                Female                Female               
[7] Gender non-conforming Gender non-conforming
Levels: Female Gender non-conforming Male

That's annoying! I like fct_c() better because then I don't have to type unlist(), but I still have to give a list.

fct_c(list(test, train))
[1] Male                  Male                  Female                Female                Female                Female               
[7] Gender non-conforming Gender non-conforming
Levels: Female Gender non-conforming Male

I wish I could just fct_c(test, train).

Documentation for fct_reorder

Seems something went awry in the doc for fct_reorder and the x, y arguments aren't documented very clearly

#' @param x,y fun The levels of \code{f} will be reordered so that \code{fun}
#'    apply to each group is in ascending order.
#' @param fun An summary function. It should take one vector for
#'   \code{fct_reorder}, and two vectors for \code{fct_reorder2}.

(I think I get that they're the vectors used in fun). A quick grammar sweep wouldn't hurt. Looks like it's untouched since first written.

modify fct_collpase

Modify fct_collapse() so it's possible to keep n/prop least common factors? Or even a range like prop=c(0.1,0.9)? Would be helpful if/when rare instances are of interest and other factors could be lumped (eg, spp abundance data).

Set factor levels

There are two forms:

  • Reordering existing levels, e.g. factor(x, levels = f)
  • Modifying existing levels, e.g. levels(x) <- f

So they need evocative names.

Any ideas @jennybc?

Automatic variable name in return value of fct_count()

fct_count() is a welcome implementation of dplyr::count() for a factor that does not live in a data frame. I wonder if it could do a bit better at naming the variable that holds the factor levels?

library(forcats)
library(dplyr)

## the very best
count(iris, Species)
#> # A tibble: 3 Γ— 2
#>      Species     n
#>       <fctr> <int>
#> 1     setosa    50
#> 2 versicolor    50
#> 3  virginica    50

## base can do it! with alot of help
as.data.frame(table(iris$Species, dnn = "Species"), responseName = "n")
#>      Species  n
#> 1     setosa 50
#> 2 versicolor 50
#> 3  virginica 50

## can fct_count do better?
fct_count(iris$Species)
#> # A tibble: 3 Γ— 2
#>            f     n
#>       <fctr> <int>
#> 1     setosa    50
#> 2 versicolor    50
#> 3  virginica    50

fct_reorder fails with unused factor levels

If a factor to reorder contains unused levels, the fct_reorder fails to reorder correctly and will return NA values.

dat <- data.frame(f = factor(c("a", "b", "c", "d"), 
                            levels = c("c", "b", "a", "d")), 
                 x = c(1, 2, 3, 4)) %>% 
  filter(f != "b") %>%
  mutate(f = fct_reorder(f, x)) 

dat$f

## [1] a    c    <NA>
## Levels: b c a

The reason is a factor(f) inside tapply, which drop unused levels, but factor() that returns uses levels from original f.

Feature suggestion: fct_others that move all non given levels to other

I often have to recode with a lot of not interesting levels that I want to replace by a single "Others" one. Because I'm lazy, I would like to give only the levels I want to keep...

I've made a prototype which does the job:

fct_others <- function(f, ..., other_level = "Others") {
  f <- forcats:::check_factor(f)
  levels <- levels(f)
  new_levels <- if_else(levels %in% list(...), levels, other_level)
  f <- lvls_revalue(f, new_levels)

  # Place other at end
  levels <- levels(f)
  other_back <- c(setdiff(levels, other_level), intersect(levels, other_level))
  lvls_reorder(f, match(other_back, levels))
}

Feature Request: Sort text with combined character numeric values

I work with factors that often contain text that mixes numeric and text data. Often the data is related to dosing medicines, so it may look like:

"Drug A 1 mg"
"Drug B 1 mg"
"Drug A 10 mg"
"Drug B 10 mg"
"Drug A 3 mg"
"Drug B 3 mg"

I have a set of functions I use to sort this by separating the text and numeric parts and sort the character strings by parts. As an example, the above would be split into the below parts (note the numbers are not quoted)

"Drug A ", 1, " mg"
"Drug B ", 1, " mg"
"Drug A ", 10, " mg"
"Drug B ", 10, " mg"
"Drug A ", 3, " mg"
"Drug B ", 3, " mg"

then sorted by each part separately either as a character or text, and I'd end up with the following for my ordered factor levels:

"Drug A 1 mg"
"Drug A 3 mg"
"Drug A 10 mg"
"Drug B 1 mg"
"Drug B 3 mg"
"Drug B 10 mg"

The code I use handles cases with an uneven number of groupings with un-aligned types of data (numeric vs. character) and with potentially missing data. Would you be interested in these functions as an extension of the fct_reorder function?

fct_lump prop to consider NA

Currently, prop_n is calculated as the count divided by sum of counts, which excludes NA.

If prop_n is calculated as count divided by length(f), the proportion would take into account NA values and reflect the true proportion of the level, not the proportion of the level among non-NA values.

Thanks for the great work.

fct_count() accept integer-ish input

Could fct_count() get a little more flexible?

set.seed(4561)
x <- sample(4, 20, replace = TRUE)
table(x)
#> x
#> 1 2 3 4 
#> 4 7 6 3
forcats::fct_count(x)
#> Error: `f` must be a factor (or character vector).
forcats::fct_count(as.character(x))
#> # A tibble: 4 Γ— 2
#>        f     n
#>   <fctr> <int>
#> 1      1     4
#> 2      2     7
#> 3      3     6
#> 4      4     3

fct_reorder chks example

Another super helpful package Hadley. chks example doesn't seem to have any effect (at least after the addition of labs(col = 'Chick')).

as_factor: Namespace conflict with haven

Since forcats 0.2.0, there is a conflict between haven and forcats packages on the as_factor function, this is not deseable in the tidyverse packages. forcats::as_factor doesn't have support for labelled and data.frame classes.

Enhance semantics of prop in fct_lump()

I think having a min number of occurrences rather than the minimal proportion is more useful.

So I would propose to generalize the meaning of prop argument. If -1 < prop < 1 it means minimal proportion. If prop >=1 get preserve values that appear at least prop times?

Error in fct_lump when providing large n

When there are ties at the tail, error occur.

# Expect to return Levels: A B C D Others
> x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))
> x %>% fct_lump(n=6)
Error: `idx` must contain one integer for each level of `f`

Sometimes during lapply, the number of level is unknown. It would be great if the original factor is returned if n > no. of level.

# Expect to return Levels: A B C D E F G H I
> x %>% fct_lump(n=10)
Error: `idx` must contain one integer for each level of `f`

Make comparable factors

i.e. take a list of factors and return a list of factors all of which have the same levels

Feature request: Order factor by (multiple functions of) multiple variables

For plotting and tables, it’s useful to reorder levels of a factor according to other variables, first by one variable, and then by other variables to break any ties. The summary function used for each variable may be different. Example:

d = data.frame(
  name = factor(c("A", "A", "B", "B", "C", "C", "C", "D"),
                levels = LETTERS[4:1]),
  quality = c(5, 3, 4, 4, 7, 7, 7, 7),
  year = c(2000, 2001, 2013, 2014, 2015, 2015, 2015, 2015),
  weight = c(50, 45, 60, 57, 47, 50, 500, 63)
)

I want to order the name factor by 1) average quality, then 2) the first year the product appeared, and then 3) its median weight. If any ties remain, keep the original label order for these ties. In this example, the levels would be ordered ABCD.

To do this reordering, I have to think backwards, reordering by the last tie-breaker first, using either fct_reorder() or reorder():

d$name = fct_reorder(d$name, d$weight, median)
d$name = fct_reorder(d$name, d$year, min)
d$name = fct_reorder(d$name, d$quality, mean)

I would be very convenient to be able to do this in one go, using something like this (I’m dropping the d$ prefix to make the code clearer):

name = fct_reordern(name,
                    vars = list(quality, year, weight),
                    funs = list(mean, min, median))

It would be even better if the functions were shown along with the variable names, e.g. something like this (if possible, or perhaps using some form of formula syntax?):

name = fcr_reordern(name,
                    mean(quality), min(year), median(weight))

For descending order, perhaps by using a desc() function, like this?

name = fcr_reordern(name,
                    desc(mean(quality)), min(year), median(weight))

Turn specific levels into NA values

There is not an easy way to coerce factor levels into NAs. For example:

> library(forcats)
> animals <- factor(c("dog", "cat", "missing"))
> fct_recode(animals, NA = "missing")
Error: unexpected '=' in "fct_recode(animals, NA ="

The desired effect can be achieved by making the desired levels explicit:

> factor(animals, levels = c("dog", "cat"))
[1] dog  cat  <NA>
Levels: dog cat

Or by removing the offending level:

> animals %>% factor(levels = setdiff(., "missing"))
[1] dog  cat  <NA>
Levels: dog cat

wrap text on factor levels

Hi Hadley, I'm really enjoying using the package. Thanks for making it!

I routinely find the need to wrap text on my factors before plotting. Sometimes the factor levels are long and extend the legend margins.

Here's an example of what I mean: Original compared to wrapped text.

Is this something you'd be interested in adding? I'd be happy to make a pull request.

Feature suggestion - extend fct_relevel( ) with after= argument

Hi Hadley, nice work on this package.

I'd like to suggest an extension to fct_relevel() - an after= argument (a single level specified as a character string or default NA) which specifies a level after which the other levels are to be placed, in the specified order. This would be much more flexible than only moving levels only to the beginning of the level set.

I have a movebefore function for re-ordering data frame columns which is surprisingly useful - often the desired ordering can be obtained with minimal typing with this function. (I use 'before' rather than 'after', but 'after' might fit in better with the current behaviour of fct_relevel()...).

Cheers,
Alec

fct_recode() with vectors of length > 1

Is this behavior intended? It's not documented in the examples.

fct_recode(letters, "vowel" = c("a", "e", "i", "o", "u"))
##  [1] vowel1 b      c      d      vowel2 f      g      h      vowel3 j      k      l     
## [13] m      n      vowel4 p      q      r      s      t      vowel5 v      w      x     
## [25] y      z     
## Levels: vowel1 b c d vowel2 f g h vowel3 j k l m n vowel4 p q r s t vowel5 v w x y z

If not, we should warn/throw, and perhaps direct the user to fct_collapse().

Wrong package to install on forcats website

Just noticed that there is a wrong package to install mentioned on the main page of the forcats website.

It says

# Alternatively, install just forcats:
install.packages("readr")

This should obviously be

# Alternatively, install just forcats:
install.packages("forcats")

Feature request: fct_stringsAsFactors() (Serious)

I came to this package expecting to find a convenient method equivalent to:

library(dplyr)
dat %>%
mutate_if(is.character, as.factor)

The situation where you want to do this is just after you have finished preprocessing your data and it's time to funnel it into a bunch of ML methods and start scoring some models. Unfortunately models like rf are lazy and don't convert strings to factors automagically.

I would have made a pull but you don't import or suggest dplyr here.

fct_reorder fails if fun equals col-name

Toy example/reprex:

library(forcats)
library(tidyverse)
#> Loading tidyverse: ggplot2
#> Loading tidyverse: tibble
#> Loading tidyverse: tidyr
#> Loading tidyverse: readr
#> Loading tidyverse: purrr
#> Loading tidyverse: dplyr
#> Conflicts with tidy packages ----------------------------------------------
#> filter(): dplyr, stats
#> lag():    dplyr, stats

test_df <- data_frame(
  group = rep(c("A", "B"), each = 2),
  var = 1:4
)

# works
test_df %>% 
  mutate(mean = mean(var),
         group = fct_reorder(group, mean, fun = median))
#> # A tibble: 4 Γ— 3
#>    group   var  mean
#>   <fctr> <int> <dbl>
#> 1      A     1   2.5
#> 2      A     2   2.5
#> 3      B     3   2.5
#> 4      B     4   2.5

# fails
test_df %>% 
  mutate(mean = mean(var),
         group = fct_reorder(group, mean, fun = mean))
#> Error: Objekt 'fun' mit Modus 'function' nicht gefunden
#> translation: object 'fun' with mode 'function' not found

factor lump for ordinal factors

Hi Hadley,

Do you have thoughts on creating an analogue (or generic) of fct_lump() for ordinal factors?

The utility I am looking for is the ability to keep contiguous levels together. A simple solution for this could be to lump ordinal levels directionally (either from the left or right), instead of by frequency.

Here is an example of the current behaviour:

set.seed(6)
f <- ordered(sample(letters[1:5], 10, replace = TRUE))
fct_count(f)
#> # A tibble: 5 Γ— 2
#>       f     n
#>   <ord> <int>
#> 1     a     1
#> 2     b     2
#> 3     c     1
#> 4     d     2
#> 5     e     4
fct_lump(f, 3) # non-contiguous lumping
#>  [1] d     e     b     b     e     e     e     d     Other Other
#> Levels: b < d < e < Other

I imagine that in many cases, lumping categories a and c such that they are placed above b, d, and e will not created a desired result.

Here is what I have in mind for ord_lump():

ord_lump(f, 3, from = "left")
#>  [1] Other Other b b Other Other Other Other c a
#> Levels: a < b < c < Other

If you think this is worth it, I can propose a PR. An alternative would be to work the change into fct_lump by adding type = c("freq", "left", "right") into the formals. Let me know what you think.

Cheers,

Rafael

Feature request: Name NA values in factors (i.e. turn them into levels)

I frequently have to rename missing data (NA values) in character or factor levels into something meaningful, so that they have nice labels in graphs and tables. For characters vectors, the following works fine:

x[is.na(x)] = "(Missing)"

But if x is a factor, I get this scary warning (and the operation fails):

Warning:
In `[<-.factor`(`*tmp*`, is.na(x), value = "(Missing)") :
  invalid factor level, NA generated

So this is a feature request for a β€˜turn NA values into a real level, with a custom name, and place it as the last (or optionally first) level’ function.

Proposed code:

fct_name_na = function(f,
                       na_name = "(Missing)",
                       pos = c("last", "first")) {
  f = forcats:::check_factor(f) # Also handle character vectors
  levs = levels(f)

  # Add the new level name if needed
  if (!(na_name %in% levs)) {
    f = lvls_expand(f, c(levs, na_name))
  } else if (any(is.na(f))) {
    warning(paste0("'", na_name, "' already exists as a level"))
  } # Only give a warning if the already existing level will be a *problem*

  # Replace NA values with the new level
  f[is.na(f)] = na_name

  # Optionally move the level to the front
  if (pos[1] == "first") {
    f = fct_relevel(f, na_name)
  }
  f
}

Example of use:

> set.seed(1)
> x = sample(iris$Species, 5)
> x[3:4] = NA
> x
[1] setosa     versicolor <NA>       <NA>       setosa    
Levels: setosa versicolor virginica

> x = fct_name_na(x)
> x
[1] setosa     versicolor (Missing)  (Missing)  setosa    
Levels: setosa versicolor virginica (Missing)

> x = fct_name_na(x) # No warning
> x
[1] setosa     versicolor (Missing)  (Missing)  setosa    
Levels: setosa versicolor virginica (Missing)

> x[5] = NA
> x = fct_name_na(x) # Warning, but still works
Warning message:
In fct_name_na(x) : '(Missing)' already exists as a level
> x
[1] setosa     versicolor (Missing)  (Missing)  (Missing) 
Levels: setosa versicolor virginica (Missing)

> x = fct_name_na(x, pos="first")
> x
[1] setosa     versicolor (Missing)  (Missing)  (Missing) 
Levels: (Missing) setosa versicolor virginica

fct_drop not to add NA as a factor level?

I'm working with factors a lot, so forcats is a very welcome new package to me; nice and easy-to-use functions for describing and visualizing factors. However, when manipulating factors for other types of use, I'm finding it difficult that NA is sometimes added as an additional factor level. E.g., I have a factor with multiple different codes for missingness that I'd like to convert into a numerical variable, combining all missing values into NA. After recoding of missing values, if I use fct_drop for dropping unused levels, it adds NA as a factor level, which is then given a number with as.numeric.

> my_factor <- factor(c(letters[1:4], NA, NA), levels = letters[1:6])
> my_factor
[1] a    b    c    d    <NA> <NA>
Levels: a b c d e f
> as.numeric(my_factor)
[1]  1  2  3  4 NA NA
> fct_drop(my_factor)
[1] a    b    c    d    <NA> <NA>
Levels: a b c d <NA>
> as.numeric(fct_drop(my_factor))
[1] 1 2 3 4 5 5

There's already the fct_explicit_na if you want to add missing values as a factor level, so would you consider removing the feature from fct_drop, or adding an argument for choosing either option?

If you like to keep drop_levels like this, then it would be better to change the description, as the functioning of droplevels and fct_drop are not the same in the case of missing values (the first has exclude = NA, and the latter exclude = NULL).

Feature request: fct_as_factor()/as_factor()

The base as.factor() returns levels in a platform/locale dependent manner due to sorting the levels. See: Why does as.factor() on unicode strings return different results for every operating system?.

This can get nasty in the context of reproducible RMarkdown reports, dashboards etc. I can provide entertaining anecdotal evidence of how this can and did go wrong, if you can't see what I mean.

As such, I propose a new version of as.factor() be included in forcats that returns a platform independent ordering of factor levels. It'll become the standard factor conversion for people who care about platform independence and reproducibility (everyone?). The choice of ordering I'm not fussy about, although I reckon the most obvious is as per fct_inorder().

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.