sfirke / janitor Goto Github PK

View Code? Open in Web Editor NEW

1.3K 36.0 130.0 6.71 MB

simple tools for data cleaning in R

Home Page: http://sfirke.github.io/janitor/

License: Other

R 100.00%

data-cleaning data-science data-analysis r pivot-tables dirty-data tabulations spss excel tidyverse

janitor's People

Contributors

Stargazers

Watchers

Forkers

chrishaid rogerfischer grunwald rgknight hydrosquall darioromero genomicsnx ddbs yiyio dsdesrosiers rlugojr rocksailor gvelasq fernandovmacedo zhaoxiaohe cmiranda16poncehealthsciencesuniversity lwjohnst86 sayuritakeda khunreus radovankavicky gapdata randomeffect mbcann01 brad-cannell xtmgah faadal aespar21 jiayiji kgilds guhjy nanaakwasiabayieboateng liangquanzhou garthtarr sisov ktaranov mhamine baifengbai billdenney nitesh009 petermwandi rathsandeep gvdr hughpham kylehaynes tazinho josiahparry raphaelduarte trongtrd nemochina2008 sandyyy123 jessewei christinatong ajb5d andrewbarros skolenik ktp-forked-repos minghao2016 jonleslie cstepper mustapha-wasseja chengjingfeng conradbm cs-nocturne congca khueyama sorenson-impact mattsq whatturn jonellevillar jzadra ari-nz krlmlr romainfrancois martinctc clinicopath arenberg319 stjordanis keguoh balharbi mirceasauciuc anthonytyler27 rmasiniexpert anekar henryn218 studentmicky mwaurapatrick bbolker anhnguyendepocen jimsforks xvzeng mattroumaya monahton gdutz mgacc0 eveyp ruth-moraa ndrewwm standardgalactic sergemayombo tjgusshy

janitor's Issues

top_2 loses variable name on result

it's called "new_vec" no matter what the input variable is called.

Ex:
top_2(as.factor(mtcars$wt)) %>% View

get_dupes should use all variables when none are specified

tabyl broke with dplyr 0.5.0

tabyl(mtcars$mpg)
Error: can't arrange by a matrix

tabyl() and crosstab() should return data.frame

So that all rows are visible when it prints to console.. As much as I love tibbles, for exploratory commands like this it's annoying not to see the bottom of the list.

tabyl() not working with a vector

tabyl(survey$q0038_0001) fails

add excel_numeric_to_date() example to readme

create use_first_valid_of function

For saying, use A if not NA, otherwise use B if not NA, otherwise... could just be a big case_when call (edit: case_when is probably not right given variable # of args, for loop may be simplest). Optionally create a new variable indicating which one was used.

Besides being more readable than nested ifelse, it would work for dates, which is vexing af if you run into this: http://stackoverflow.com/questions/6668963/how-to-prevent-ifelse-from-turning-date-objects-into-numeric-objects

In fact, have it print a warning if you feed it at least one input vector with class Date, suggesting that they specify force_class = "date.

factors to character

Sam, one thing that I've run into recently is output of other packages (especially the stuff that reads in Access files... ugh) that doesn't give fine control to the read.csv parameters - I've been having to take the data frame as it comes, and then flip factors back to character.

Is that something that should live in janitor? If I made a pull request, would you be likely to accept?

get_dupes should throw better error when wrong variable names are used

mtcars %>% get_dupes(., wt, TRUE)

Error: incorrect size (1), expecting : 5

Is a terrible error message, albeit on a rare case.

More common is:
get_dupes(mtcars, wt, cyll):

Error: unknown column 'cyll'

But I'm relying on later functions to throw that.

improve inputs of tabyl() - are dots needed?

I have the variable name as a parameter ... so that, when calling on a data.frame, the variable name doesn't need to be quoted and so autocomplete works in RStudio, e.g., iris %>% tabyl(Sep will bring up autocomplete suggestions.

But this SO answer states:

It's not a good idea to use the ... when you know each parameter in advance, however, as it adds some ambiguity and further complication to the argument string (and makes the function signature unclear to any other user).

So maybe I'm using it unnecessarily?

add vignette

Add .Rbuildignore file

Like https://github.com/trinker/reports/blob/master/.Rbuildignore. Got the idea from https://trinkerrstuff.wordpress.com/2013/08/31/github-package-ideas-i-stole/.

Add totals column and row with adorn_crosstab()

Again, mimicking Excel PivotTables.

I'm most of the way there, but unsure on:

Should I create add_totals_column and add_totals_row functions that will be exported for a user to call on any data.frame?
Right now in the dev branch, the totals column is added for row-wise %s. In Excel, the column added with row-wise %s is the sum of %s in the row, i.e., 100% in each row (intuitive, but useless). But also for row-wise %, it adds a meaningful row at the bottom showing the % that each column represents of the total - useful.

So per that 2nd bullet, I think I need to switch when the totals col or row is displayed. And maybe stick with just displaying one by default, the pure 100%s one in Excel is not useful.

crosstab() should fill with 0s instead of NAs

Suggested by @ffirke.

canonical dirty data data.frame

this really should exist - this paper links to dirty_iris, available on github

That is somewhat dirty, but it doesn't have a lot of the 'in the wild' features that I see all the time (off the top of my head: data that is supposed to be numeric but has a NA / missing code that R interprets as character).

The tests that you have here are great but it'd be great to have examples against a nasty data file that looks similar to something that might be seen in the wild.

You can assign me this issue if you'd like 😄

Truncate group names on top_2

Ex: top_2(as.factor(mtcars$wt)) has enormously long 2nd row name for middle group because there are so many categories. Truncate at ~20 chars? And/or have it prefaced with "middle group:", or have it always be "Middle group (N categories)" where N is dynamic.

Let tabyl() be called in a pipeline

Like #33, but with tabyl().

Update code
Add tests
Update documentation

leading whitespace in Ns on adorn_crosstab()

my call is yielding:

     webinar Began Training   Withdrew
1   Attended     72.2% (13) 27.8% ( 5)
2 Registered     25.0% ( 3) 75.0% ( 9)
3       <NA>     27.4% (29) 72.6% (77)

Changing paste_ns from n_matrix <- as.matrix(n_df)
to
n_matrix <- as.matrix(data.frame(lapply(n_df,as.character))) gets me this:

     webinar Began Training   Withdrew
1   Attended     72.2% (13)  27.8% (5)
2 Registered      25.0% (3)  75.0% (9)
3       <NA>     27.4% (29) 72.6% (77)

Which I don't like since it's crooked. I want:

     webinar Began Training   Withdrew
1   Attended     72.2% (13) 27.8%  (5)
2 Registered     25.0%  (3) 75.0%  (9)
3       <NA>     27.4% (29) 72.6% (77)

Basically, the spaces moved left to outside of the parentheses. Thinking I'll write a little regex replacement function to count the spaces after (, then replace the trailing spaces with preceding ones.

crosstab() should return factor levels on leftmost column, not underlying values

crosstab(survey$q0002, survey$q0038_0001)

gives 1, 2, 3

let crosstab be called in a pipeline

Update crosstab() code
Update vignette

Idea from @chrishaid

by creating another function. Here is a crude mockup:

crosstab_df <- function(dat, ...){
  names <- as.list(substitute(list(...)))[-1L]
  names <- unlist(lapply(names, deparse))
  trimmed <- dat %>% select_(.dots = names)
  crosstab(trimmed[[1]], trimmed[[2]])
}

Which works: mtcars %>% crosstab_df(cyl, am) (sort of, variable name is lost in the result)

Do this for tabyl() too as tabyl_df(), right now it's tedious to have a dplyr pipeline with a filter, etc. that I have to interrupt to use tabyl() (or use use_series() which isn't even an option for crosstab()).

dplyr has to be loaded for functions to work

get_dupes(mtcars, mpg gives an error if dplyr is not loaded:

Do I have to call magrittr::%>% ? or can/should I make it load dplyr - is that against CRAN principles?

create get_not_dupes() function

Ran into this today, where we want to see if anyone w/ the same ID had specified different values for race columns. Used get_dupes, then looked at IDs that were not in the duplicated tables.

The use case is for cleaning data, when all records should have a duplicate. I'm not sure how to handle records where there's only one instance of the unique ID. Should it return all unique rows of the specified variables, and thus those? Then for this use case you'd have to start by filtering the table for records where the ID appears at least twice. Makes for a simpler function, but if you always have to pre-filter for it to be useful, maybe I should bake that in.

Let's start simple: takes a df and variable names, returns a df of the rows that didn't share those variable combinations. The opposite of get_dupes() which is nice.

create a two-variable crosstab function

with percentages row-wise, col-wise, total, or neither.

If it can extend the current tabyl without making it too complex, great.

Otherwise have it take two vectors instead of a data.frame?

crosstab() takes lists and weird things happen

Can we somehow disable feeding lists as arguments to crosstabs? Not a huge deal in practice, as it's not an intended use, but I'd rather it error than produce a nonsensical result. Look into it.

clean_NA_variants should have an option to drop the default scrubbed values

To give users more control. Maybe a parameter common = TRUE?

have tabyl() retain ordering of factor levels

Particularly nice when calling from top_levels

crosstab.data.frame produces factor result, while crosstab.default produces character

Compare:

z_df <- crosstab(dat, v3, v1)
z <- crosstab(dat$v3, dat$v1)

Causing this test to fail:

test_that("crosstab.data.frame dispatches", {
  expect_equal(z_df,
               z %>% setNames(., c("v3", names(.)[-1]))) # compare to regular z above - they have different names[1] due to piping
})

Looks like it's due to needing stringsAsFactors = FALSE in crosstab.data.frame.

have tabyl() take multiple variables and compare them?

This might be better off as a separate function. The idea would be that mtcars %>% tabyl(cyl, gear) returns the result of:

full_join(tabyl(mtcars$cyl), tabyl(mtcars$gear), by = c("mtcars_cyl" = "mtcars_gear")) %>%
  setNames(c("value", "cyl_n", "cyl_percent", "gear_n", "gear_percent"))

# A tibble: 5 x 5
  value cyl_n cyl_percent gear_n gear_percent
  <dbl> <int>       <dbl>  <int>        <dbl>
1     4    11     0.34375     12      0.37500
2     6     7     0.21875     NA           NA
3     8    14     0.43750     NA           NA
4     3    NA          NA     15      0.46875
5     5    NA          NA      5      0.15625

Would it return n? %? Both? Have that be user-specified? If just n, it could take advantage of the adorn_crosstab function I'm working on.

chained repeated tabyl calls end in variable name conflict

I wanted to do this:

mtcars %>% tabyl(mpg) %>% tabyl(n)

But got error message:

Error: found duplicated column name: n

Compare to:

mtcars %>% count(mpg) %>% count(n)

tabyl() should print the non-NA N

It's a useful data point for annotations and for working with the numbers by hand, e.g., during a check

Can't crosstab() a variable with itself when piped

mtcars %>% crosstab(cyl, cyl) throws error:

Error in [.data.frame(x, 2) : undefined columns selected

But you can do crosstab(mtcars$cyl, mtcars$cyl) and frankly, I don't think this is a meaningful operation. Use tabyl() instead. Not sure it's worth producing a more helpful error message.

excel_numeric_to_date should run on a string

What's important is that the input is the digits like 41883 - okay if it's stored as a character.

tabyl() should further clean variable names that it returns

For instance tabyl(nchar(combined$Associate's Year Awarded)) yields:

   nchar(combined_`Associate's Year Awarded`)     n      percent valid_percent
                                        <int> <int>        <dbl>         <dbl>
1                                           1     1 8.212203e-05    0.01388889
2                                           3     9 7.390983e-04    0.12500000
3                                           4    27 2.217295e-03    0.37500000
4                                           5    29 2.381539e-03    0.40277778
5                                           9     1 8.212203e-05    0.01388889
6                                          11     2 1.642441e-04    0.02777778

IMO that should be combined_Associates_Year_Awarded for easier subsequent reference

sort parameter for tabyl() not working with factors

x <- factor(c("aaaaaaaa", "bbbbbbbb", "ccccccccddddddddd", "dddddddd", NA, "hhhhhhhh", "bbbbbbbb"), levels = c("dddddddd", "aaaaaaaa", "ccccccccddddddddd", "bbbbbbbb", "hhhhhhhh"))

tabyl(x, sort = TRUE)

Have crosstab print both percent and n?

Add appropriate number of 0s in percentages, e.g., 0.0% instead of 0%

Not useful for subsequent analysis but handy for examining, and sometimes for sharing. So something like specifying percent = "row", mode = "combined" and it would have in each cell a value like 10.4% (31).

Don't want to excessively complicate things, and it raises the issue of truncation preferences. Having parameters for percent, mode, and digits seems excessive.

I could have a separate function that handles all % related calculations, like:
crosstab(mtcars$cyl, mtcars$gear) %>% table_percentages(., percent = "row", show = "combined", digits = 1)
Though I don't like taking the simple percentage option out of crosstab(). But if I don't, there's redundancy.

Might just write this as a personal function. But I have to imagine it's widely used.

add clean_NA_variants function

Takes a data.frame, turns all instances of "N/A" and "#N/A" and "NA" into true NA values. Maybe it takes a parameter to either clean those exact terms, or grep them, filtering out say "N/A- I have not yet used or received this support/tool."

can tabyl() show empty factor levels?

The way table() does? Seems like a useful option. Would be nice for survey analysis. Lower priority, though.

replace NA values in tabyl() counts with 0?

Only affects calls where there's an unrepresented level in a factor variable, so this is a niche case.

Instead of:

> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
  sorted_with_na_and_fac$grp  n percent valid_percent
1                          c  2    0.50     0.6666667
2                          a  1    0.25     0.3333333
3                          b NA      NA            NA
4                       <NA>  1    0.25            NA

Return:

> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
  sorted_with_na_and_fac$grp n percent valid_percent
1                          c 2    0.50     0.6666667
2                          a 1    0.25     0.3333333
3                          b 0    0.00     0.0000000
4                       <NA> 1    0.25     0.0000000

Could likely be done with:

  # replace NA values with zeroes
  result[is.na(result)] <- 0

Though I think I'd want to retain the NA representation of row = NA and column = valid_percent, to show that missing values is not in fact 0% once they've been filtered out.

extend name truncation to all 3 groups in top_levels()

right now it truncates for the middle category, which is most likely to need it. But add a function to truncate for top and bottom levels - maybe at a greater character limit.

tabyl() doesn't work when called with lapply()

Want to be able to call tabyl() on many columns at once.

Compare lapply(mtcars, tabyl) to lapply(mtcars, table)`.

for surveys: create a function that makes a ggplot bar chart of a tabyl() output

for easy knitting

redo top_2 to run on factor levels

Then don't need to worry about topics besides agreement - "very confident" etc. Have it spit out top-2, bottom-2, and middle.

handle apostrophe-s in clean_names

the variable name employee's name should become employees_name not employee_s_name ... a small thing, but the latter is annoying to type

have tabyl() look for attr "label" and print it?

would be swell for SurveyMonkey -> SPSS -> R workflow, to know that q0038_0001 is the question you think it is.

Something like this at the start of the function:

  # print label attribute, if it exists - does not work
  if(!is.null(attr(dat %>% select(...), "label"))) {print(attr(dat %>% select(...), "label"))}

Maybe tabyl() - either the general function or one specialized for SPSS work - should take a labelled class vector and work with that, rather than a data.frame.

crosstab() result turns column values 1, 2, 3 into X1, X2, X3

I'm not sure if this is acceptable or not. It gives you a legal variable name in the result, useful for further operations. But it's not as readable, and doesn't make for nice direct presentation, say in knitr::kable().

generalize top_2() by giving it a lvls parameter

instead of top_2(factor_var) it should be top_levels(factor_var, lvls = 2)

tabyl() throws warning with some named vectors

j <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
                 TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), .Names = c("104", 
                                                                                         "114", "116", "120", "124", "126", "133", "144", "146", "149", 
                                                                                         "156", "160", "163", "167", "173", "174", "177", "181", "186"
                 ))

j %>% tabyl

104 n percent
1 TRUE 19 1
Warning message:
In names(result)[1] <- var_name :
number of items to replace is not a multiple of replacement length

To solve that, I'm reading the raw files in as character, and then doing type conversion myself.

But this seems like a janitor kind of job

Is this in scope/ out of scope? Any thoughts about how to move forward? @chrishaid would love your thoughts here as well