Giter Club home page Giter Club logo

janitor's People

Contributors

anthonytyler27 avatar bfgray3 avatar billdenney avatar chrishaid avatar danchaltiel avatar daniel-barnett avatar fernandovmacedo avatar francisbarton avatar garthtarr avatar gdutz avatar henryn218 avatar jasonaizkalns avatar jjsteph avatar jonleslie avatar josiahparry avatar jsta avatar juba avatar jzadra avatar kgilds avatar khueyama avatar krlmlr avatar lionel- avatar matanhakim avatar mattroumaya avatar nelson-gon avatar olivroy avatar rgknight avatar romainfrancois avatar sfirke avatar tazinho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

janitor's Issues

create use_first_valid_of function

For saying, use A if not NA, otherwise use B if not NA, otherwise... could just be a big case_when call (edit: case_when is probably not right given variable # of args, for loop may be simplest). Optionally create a new variable indicating which one was used.

Besides being more readable than nested ifelse, it would work for dates, which is vexing af if you run into this: http://stackoverflow.com/questions/6668963/how-to-prevent-ifelse-from-turning-date-objects-into-numeric-objects

In fact, have it print a warning if you feed it at least one input vector with class Date, suggesting that they specify force_class = "date.

factors to character

Sam, one thing that I've run into recently is output of other packages (especially the stuff that reads in Access files... ugh) that doesn't give fine control to the read.csv parameters - I've been having to take the data frame as it comes, and then flip factors back to character.

Is that something that should live in janitor? If I made a pull request, would you be likely to accept?

improve inputs of tabyl() - are dots needed?

I have the variable name as a parameter ... so that, when calling on a data.frame, the variable name doesn't need to be quoted and so autocomplete works in RStudio, e.g., iris %>% tabyl(Sep will bring up autocomplete suggestions.

But this SO answer states:

It's not a good idea to use the ... when you know each parameter in advance, however, as it adds some ambiguity and further complication to the argument string (and makes the function signature unclear to any other user).

So maybe I'm using it unnecessarily?

Add totals column and row with adorn_crosstab()

Again, mimicking Excel PivotTables.

I'm most of the way there, but unsure on:

  • Should I create add_totals_column and add_totals_row functions that will be exported for a user to call on any data.frame?
  • Right now in the dev branch, the totals column is added for row-wise %s. In Excel, the column added with row-wise %s is the sum of %s in the row, i.e., 100% in each row (intuitive, but useless). But also for row-wise %, it adds a meaningful row at the bottom showing the % that each column represents of the total - useful.

So per that 2nd bullet, I think I need to switch when the totals col or row is displayed. And maybe stick with just displaying one by default, the pure 100%s one in Excel is not useful.

canonical dirty data data.frame

this really should exist - this paper links to dirty_iris, available on github

That is somewhat dirty, but it doesn't have a lot of the 'in the wild' features that I see all the time (off the top of my head: data that is supposed to be numeric but has a NA / missing code that R interprets as character).

The tests that you have here are great but it'd be great to have examples against a nasty data file that looks similar to something that might be seen in the wild.

You can assign me this issue if you'd like ๐Ÿ˜„

Truncate group names on top_2

Ex: top_2(as.factor(mtcars$wt)) has enormously long 2nd row name for middle group because there are so many categories. Truncate at ~20 chars? And/or have it prefaced with "middle group:", or have it always be "Middle group (N categories)" where N is dynamic.

leading whitespace in Ns on adorn_crosstab()

my call is yielding:

     webinar Began Training   Withdrew
1   Attended     72.2% (13) 27.8% ( 5)
2 Registered     25.0% ( 3) 75.0% ( 9)
3       <NA>     27.4% (29) 72.6% (77)

Changing paste_ns from n_matrix <- as.matrix(n_df)
to
n_matrix <- as.matrix(data.frame(lapply(n_df,as.character))) gets me this:

     webinar Began Training   Withdrew
1   Attended     72.2% (13)  27.8% (5)
2 Registered      25.0% (3)  75.0% (9)
3       <NA>     27.4% (29) 72.6% (77)

Which I don't like since it's crooked. I want:

     webinar Began Training   Withdrew
1   Attended     72.2% (13) 27.8%  (5)
2 Registered     25.0%  (3) 75.0%  (9)
3       <NA>     27.4% (29) 72.6% (77)

Basically, the spaces moved left to outside of the parentheses. Thinking I'll write a little regex replacement function to count the spaces after (, then replace the trailing spaces with preceding ones.

let crosstab be called in a pipeline

  • Update crosstab() code
  • Update vignette

Idea from @chrishaid

by creating another function. Here is a crude mockup:

crosstab_df <- function(dat, ...){
  names <- as.list(substitute(list(...)))[-1L]
  names <- unlist(lapply(names, deparse))
  trimmed <- dat %>% select_(.dots = names)
  crosstab(trimmed[[1]], trimmed[[2]])
}

Which works: mtcars %>% crosstab_df(cyl, am) (sort of, variable name is lost in the result)

Do this for tabyl() too as tabyl_df(), right now it's tedious to have a dplyr pipeline with a filter, etc. that I have to interrupt to use tabyl() (or use use_series() which isn't even an option for crosstab()).

create get_not_dupes() function

Ran into this today, where we want to see if anyone w/ the same ID had specified different values for race columns. Used get_dupes, then looked at IDs that were not in the duplicated tables.

The use case is for cleaning data, when all records should have a duplicate. I'm not sure how to handle records where there's only one instance of the unique ID. Should it return all unique rows of the specified variables, and thus those? Then for this use case you'd have to start by filtering the table for records where the ID appears at least twice. Makes for a simpler function, but if you always have to pre-filter for it to be useful, maybe I should bake that in.

Let's start simple: takes a df and variable names, returns a df of the rows that didn't share those variable combinations. The opposite of get_dupes() which is nice.

create a two-variable crosstab function

with percentages row-wise, col-wise, total, or neither.

If it can extend the current tabyl without making it too complex, great.

Otherwise have it take two vectors instead of a data.frame?

crosstab() takes lists and weird things happen

Can we somehow disable feeding lists as arguments to crosstabs? Not a huge deal in practice, as it's not an intended use, but I'd rather it error than produce a nonsensical result. Look into it.

crosstab.data.frame produces factor result, while crosstab.default produces character

Compare:

z_df <- crosstab(dat, v3, v1)
z <- crosstab(dat$v3, dat$v1)

Causing this test to fail:

test_that("crosstab.data.frame dispatches", {
  expect_equal(z_df,
               z %>% setNames(., c("v3", names(.)[-1]))) # compare to regular z above - they have different names[1] due to piping
})

Looks like it's due to needing stringsAsFactors = FALSE in crosstab.data.frame.

have tabyl() take multiple variables and compare them?

This might be better off as a separate function. The idea would be that mtcars %>% tabyl(cyl, gear) returns the result of:

full_join(tabyl(mtcars$cyl), tabyl(mtcars$gear), by = c("mtcars_cyl" = "mtcars_gear")) %>%
  setNames(c("value", "cyl_n", "cyl_percent", "gear_n", "gear_percent"))

# A tibble: 5 x 5
  value cyl_n cyl_percent gear_n gear_percent
  <dbl> <int>       <dbl>  <int>        <dbl>
1     4    11     0.34375     12      0.37500
2     6     7     0.21875     NA           NA
3     8    14     0.43750     NA           NA
4     3    NA          NA     15      0.46875
5     5    NA          NA      5      0.15625

Would it return n? %? Both? Have that be user-specified? If just n, it could take advantage of the adorn_crosstab function I'm working on.

Can't crosstab() a variable with itself when piped

mtcars %>% crosstab(cyl, cyl) throws error:

Error in [.data.frame(x, 2) : undefined columns selected

But you can do crosstab(mtcars$cyl, mtcars$cyl) and frankly, I don't think this is a meaningful operation. Use tabyl() instead. Not sure it's worth producing a more helpful error message.

tabyl() should further clean variable names that it returns

For instance tabyl(nchar(combined$Associate's Year Awarded)) yields:

   nchar(combined_`Associate's Year Awarded`)     n      percent valid_percent
                                        <int> <int>        <dbl>         <dbl>
1                                           1     1 8.212203e-05    0.01388889
2                                           3     9 7.390983e-04    0.12500000
3                                           4    27 2.217295e-03    0.37500000
4                                           5    29 2.381539e-03    0.40277778
5                                           9     1 8.212203e-05    0.01388889
6                                          11     2 1.642441e-04    0.02777778

IMO that should be combined_Associates_Year_Awarded for easier subsequent reference

sort parameter for tabyl() not working with factors

x <- factor(c("aaaaaaaa", "bbbbbbbb", "ccccccccddddddddd", "dddddddd", NA, "hhhhhhhh", "bbbbbbbb"), levels = c("dddddddd", "aaaaaaaa", "ccccccccddddddddd", "bbbbbbbb", "hhhhhhhh"))

tabyl(x, sort = TRUE)

Have crosstab print both percent and n?

  • Add appropriate number of 0s in percentages, e.g., 0.0% instead of 0%

Not useful for subsequent analysis but handy for examining, and sometimes for sharing. So something like specifying percent = "row", mode = "combined" and it would have in each cell a value like 10.4% (31).

Don't want to excessively complicate things, and it raises the issue of truncation preferences. Having parameters for percent, mode, and digits seems excessive.

I could have a separate function that handles all % related calculations, like:
crosstab(mtcars$cyl, mtcars$gear) %>% table_percentages(., percent = "row", show = "combined", digits = 1)
Though I don't like taking the simple percentage option out of crosstab(). But if I don't, there's redundancy.

Might just write this as a personal function. But I have to imagine it's widely used.

add clean_NA_variants function

Takes a data.frame, turns all instances of "N/A" and "#N/A" and "NA" into true NA values. Maybe it takes a parameter to either clean those exact terms, or grep them, filtering out say "N/A- I have not yet used or received this support/tool."

replace NA values in tabyl() counts with 0?

Only affects calls where there's an unrepresented level in a factor variable, so this is a niche case.

Instead of:

> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
  sorted_with_na_and_fac$grp  n percent valid_percent
1                          c  2    0.50     0.6666667
2                          a  1    0.25     0.3333333
3                          b NA      NA            NA
4                       <NA>  1    0.25            NA

Return:

> tabyl(sorted_with_na_and_fac$grp, sort = TRUE)
  sorted_with_na_and_fac$grp n percent valid_percent
1                          c 2    0.50     0.6666667
2                          a 1    0.25     0.3333333
3                          b 0    0.00     0.0000000
4                       <NA> 1    0.25     0.0000000

Could likely be done with:

  # replace NA values with zeroes
  result[is.na(result)] <- 0

Though I think I'd want to retain the NA representation of row = NA and column = valid_percent, to show that missing values is not in fact 0% once they've been filtered out.

handle apostrophe-s in clean_names

the variable name employee's name should become employees_name not employee_s_name ... a small thing, but the latter is annoying to type

have tabyl() look for attr "label" and print it?

would be swell for SurveyMonkey -> SPSS -> R workflow, to know that q0038_0001 is the question you think it is.

Something like this at the start of the function:

  # print label attribute, if it exists - does not work
  if(!is.null(attr(dat %>% select(...), "label"))) {print(attr(dat %>% select(...), "label"))}

Maybe tabyl() - either the general function or one specialized for SPSS work - should take a labelled class vector and work with that, rather than a data.frame.

tabyl() throws warning with some named vectors

j <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, 
                 TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE), .Names = c("104", 
                                                                                         "114", "116", "120", "124", "126", "133", "144", "146", "149", 
                                                                                         "156", "160", "163", "167", "173", "174", "177", "181", "186"
                 ))

j %>% tabyl

104 n percent
1 TRUE 19 1
Warning message:
In names(result)[1] <- var_name :
number of items to replace is not a multiple of replacement length

clean_NA_vec() should be exported

Right now it's middle ground - you can't call it, but you can pull its help page with ?clean_NA_vec. I think it should be exported, in case someone only desires to clean a single vector.

function to compare data.frames that should be the same - variable names and col type consistency

here's my use case - I am writing functions to suck up n state export files.

I can't depend on read.csv's type hinting because I have situations where the nth data file will contain a data type (say, 'K' for grade, where grade had always been integer on the first 10 files) that doesn't play nicely.

To solve that, I'm reading the raw files in as character, and then doing type conversion myself.

But this seems like a janitor kind of job
image

Is this in scope/ out of scope? Any thoughts about how to move forward? @chrishaid would love your thoughts here as well

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.