nelson-gon / mde Goto Github PK

View Code? Open in Web Editor NEW

4.0 2.0 4.0 1.41 MB

mde: Missing Data Explorer

Home Page: https://nelson-gon.github.io/mde

License: GNU General Public License v3.0

R 100.00%

missing-data missing-values r-package r-stats data-analysis data-exploration r recode missing data-science

mde's Introduction

mde: Missing Data Explorer

2022-01-31

The goal of mde is to ease exploration of missingness.

Installation

CRAN release

install.packages("mde")

Stable Development version

devtools::install_github("Nelson-Gon/mde")


devtools::install_github("Nelson-Gon/mde",  build_vignettes=TRUE)

Unstable Development version

devtools::install_github("Nelson-Gon/mde@develop")

Loading the package

library(mde)
#> Welcome to mde. This is mde version 0.3.2.
#>  Please file issues and feedback at https://www.github.com/Nelson-Gon/mde/issues
#> Turn this message off using 'suppressPackageStartupMessages(library(mde))'
#>  Happy Exploration :)

Exploring missingness

To get a simple missingness report, use na_summary:

na_summary(airquality)
#>   variable missing complete percent_complete percent_missing
#> 1      Day       0      153        100.00000        0.000000
#> 2    Month       0      153        100.00000        0.000000
#> 3    Ozone      37      116         75.81699       24.183007
#> 4  Solar.R       7      146         95.42484        4.575163
#> 5     Temp       0      153        100.00000        0.000000
#> 6     Wind       0      153        100.00000        0.000000

To sort this summary by a given column :

na_summary(airquality,sort_by = "percent_complete")
#>   variable missing complete percent_complete percent_missing
#> 3    Ozone      37      116         75.81699       24.183007
#> 4  Solar.R       7      146         95.42484        4.575163
#> 1      Day       0      153        100.00000        0.000000
#> 2    Month       0      153        100.00000        0.000000
#> 5     Temp       0      153        100.00000        0.000000
#> 6     Wind       0      153        100.00000        0.000000

If one would like to reset (drop) row names, then one can set row_names to TRUE This may especially be useful in cases where rownames are simply numeric and do not have much additional use.

na_summary(airquality,sort_by = "percent_complete", reset_rownames = TRUE)
#>   variable missing complete percent_complete percent_missing
#> 1    Ozone      37      116         75.81699       24.183007
#> 2  Solar.R       7      146         95.42484        4.575163
#> 3      Day       0      153        100.00000        0.000000
#> 4    Month       0      153        100.00000        0.000000
#> 5     Temp       0      153        100.00000        0.000000
#> 6     Wind       0      153        100.00000        0.000000

To sort by percent_missing instead:

na_summary(airquality, sort_by = "percent_missing")
#>   variable missing complete percent_complete percent_missing
#> 1      Day       0      153        100.00000        0.000000
#> 2    Month       0      153        100.00000        0.000000
#> 5     Temp       0      153        100.00000        0.000000
#> 6     Wind       0      153        100.00000        0.000000
#> 4  Solar.R       7      146         95.42484        4.575163
#> 3    Ozone      37      116         75.81699       24.183007

To sort the above in descending order:

na_summary(airquality, sort_by="percent_missing", descending = TRUE)
#>   variable missing complete percent_complete percent_missing
#> 3    Ozone      37      116         75.81699       24.183007
#> 4  Solar.R       7      146         95.42484        4.575163
#> 1      Day       0      153        100.00000        0.000000
#> 2    Month       0      153        100.00000        0.000000
#> 5     Temp       0      153        100.00000        0.000000
#> 6     Wind       0      153        100.00000        0.000000

To exclude certain columns from the analysis:

na_summary(airquality, exclude_cols = c("Day", "Wind"))
#>   variable missing complete percent_complete percent_missing
#> 1    Month       0      153        100.00000        0.000000
#> 2    Ozone      37      116         75.81699       24.183007
#> 3  Solar.R       7      146         95.42484        4.575163
#> 4     Temp       0      153        100.00000        0.000000

To include or exclude via regex match:

na_summary(airquality, regex_kind = "inclusion",pattern_type = "starts_with", pattern = "O|S")
#>   variable missing complete percent_complete percent_missing
#> 1    Ozone      37      116         75.81699       24.183007
#> 2  Solar.R       7      146         95.42484        4.575163

na_summary(airquality, regex_kind = "exclusion",pattern_type = "regex", pattern = "^[O|S]")
#>   variable missing complete percent_complete percent_missing
#> 1      Day       0      153              100               0
#> 2    Month       0      153              100               0
#> 3     Temp       0      153              100               0
#> 4     Wind       0      153              100               0

To get this summary by group:

test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = c(rep(NA,4),"No"),ID2 = c("E","E","D","E","D"))

na_summary(test2,grouping_cols = c("ID","ID2"))
#> # A tibble: 2 x 7
#>   ID    ID2   variable missing complete percent_complete percent_missing
#>   <chr> <chr> <chr>      <dbl>    <dbl>            <dbl>           <dbl>
#> 1 B     D     Vals           1        1               50              50
#> 2 A     E     Vals           3        0                0             100

na_summary(test2, grouping_cols="ID")
#> Warning in na_summary.data.frame(test2, grouping_cols = "ID"): All non grouping
#> values used. Using select non groups is currently not supported
#> # A tibble: 4 x 6
#>   ID    variable missing complete percent_complete percent_missing
#>   <chr> <chr>      <dbl>    <dbl>            <dbl>           <dbl>
#> 1 A     Vals           3        0                0             100
#> 2 A     ID2            0        3              100               0
#> 3 B     Vals           1        1               50              50
#> 4 B     ID2            0        2              100               0

get_na_counts

This provides a convenient way to show the number of missing values column-wise. It is relatively fast(tests done on about 400,000 rows, took a few microseconds.)

To get the number of missing values in each column of airquality, we can use the function as follows:

get_na_counts(airquality)
#>   Ozone Solar.R Wind Temp Month Day
#> 1    37       7    0    0     0   0

The above might be less useful if one would like to get the results by group. In that case, one can provide a grouping vector of names in grouping_cols.

test <- structure(list(Subject = structure(c(1L, 1L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), res = c(NA, 1, 2, 3), ID = structure(c(1L, 
1L, 2L, 2L), .Label = c("1", "2"), class = "factor")), class = "data.frame", row.names = c(NA, 
-4L))

get_na_counts(test, grouping_cols = "ID")
#> # A tibble: 2 x 3
#>   ID    Subject   res
#>   <fct>   <int> <int>
#> 1 1           0     1
#> 2 2           0     0

percent_missing

This is a very simple to use but quick way to take a look at the percentage of data that is missing column-wise.

percent_missing(airquality)
#>      Ozone  Solar.R Wind Temp Month Day
#> 1 24.18301 4.575163    0    0     0   0

We can get the results by group by providing an optional grouping_cols character vector.

percent_missing(test, grouping_cols = "Subject")
#> # A tibble: 2 x 3
#>   Subject   res    ID
#>   <fct>   <dbl> <dbl>
#> 1 A          50     0
#> 2 B           0     0

To exclude some columns from the above exploration, one can provide an optional character vector in exclude_cols.

percent_missing(airquality,exclude_cols = c("Day","Temp"))
#>      Ozone  Solar.R Wind Month
#> 1 24.18301 4.575163    0     0

sort_by_missingness

This provides a very simple but relatively fast way to sort variables by missingness. Unless otherwise stated, this does not currently support arranging grouped percents.

Usage:

sort_by_missingness(airquality, sort_by = "counts")
#>   variable percent
#> 1     Wind       0
#> 2     Temp       0
#> 3    Month       0
#> 4      Day       0
#> 5  Solar.R       7
#> 6    Ozone      37

To sort in descending order:

sort_by_missingness(airquality, sort_by = "counts", descend = TRUE)
#>   variable percent
#> 1    Ozone      37
#> 2  Solar.R       7
#> 3     Wind       0
#> 4     Temp       0
#> 5    Month       0
#> 6      Day       0

To use percentages instead:

sort_by_missingness(airquality, sort_by = "percents")
#>   variable   percent
#> 1     Wind  0.000000
#> 2     Temp  0.000000
#> 3    Month  0.000000
#> 4      Day  0.000000
#> 5  Solar.R  4.575163
#> 6    Ozone 24.183007

Recoding as NA

recode_as_na

As the name might imply, this converts any value or vector of values to NA i.e. we take a value such as “missing” or “NA” (not a real NA according to R) and convert it to R’s known handler for missing values (NA).

To use the function out of the box (with default arguments), one simply does something like:

dummy_test <- data.frame(ID = c("A","B","B","A"), 
                         values = c("n/a",NA,"Yes","No"))
# Convert n/a and no to NA
head(recode_as_na(dummy_test, value = c("n/a","No")))
#>   ID values
#> 1  A   <NA>
#> 2  B   <NA>
#> 3  B    Yes
#> 4  A   <NA>

Great, but I want to do so for specific columns not the entire dataset. You can do this by providing column names to subset_cols.

another_dummy <- data.frame(ID = 1:5, Subject = 7:11, 
Change = c("missing","n/a",2:4 ))
# Only change values at the column Change
head(recode_as_na(another_dummy, subset_cols = "Change", value = c("n/a","missing")))
#>   ID Subject Change
#> 1  1       7   <NA>
#> 2  2       8   <NA>
#> 3  3       9      2
#> 4  4      10      3
#> 5  5      11      4

To recode columns using RegEx,one can provide pattern_type and a target pattern. Currently supported pattern_types are starts_with, ends_with, contains and regex. See docs for more details.:

# only change at columns that start with Solar
head(recode_as_na(airquality,value=190,pattern_type="starts_with",pattern="Solar"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41      NA  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

# recode at columns that start with O or S(case sensitive)
head(recode_as_na(airquality,value=c(67,118),pattern_type="starts_with",pattern="S|O"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36      NA  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

# use my own RegEx
head(recode_as_na(airquality,value=c(67,118),pattern_type="regex",pattern="(?i)^(s|o)"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36      NA  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

recode_as_na_if

This function allows one to deliberately introduce missing values if a column meets a certain threshold of missing values. This is similar to amputation but is much more basic. It is only provided here because it is hoped it may be useful to someone for whatever reason.

head(recode_as_na_if(airquality,sign="gt", percent_na=20))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    NA     190  7.4   67     5   1
#> 2    NA     118  8.0   72     5   2
#> 3    NA     149 12.6   74     5   3
#> 4    NA     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    NA      NA 14.9   66     5   6

recode_as_na_str

This allows recoding as NA based on a string match.

partial_match <- data.frame(A=c("Hi","match_me","nope"), B=c(NA, "not_me","nah"))

recode_as_na_str(partial_match,"ends_with","ME", case_sensitive=FALSE)
#>      A    B
#> 1   Hi <NA>
#> 2 <NA> <NA>
#> 3 nope  nah

recode_as_na_for

For all values greater/less/less or equal/greater or equal than some value, can I convert them to NA?!

Yes You Can! All we have to do is use recode_as_na_for:

head(recode_as_na_for(airquality,criteria="gt",value=25))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    NA      NA  7.4   NA     5   1
#> 2    NA      NA  8.0   NA     5   2
#> 3    12      NA 12.6   NA     5   3
#> 4    18      NA 11.5   NA     5   4
#> 5    NA      NA 14.3   NA     5   5
#> 6    NA      NA 14.9   NA     5   6

To do so at specific columns, pass an optional subset_cols character vector:

head(recode_as_na_for(airquality, value=40,subset_cols=c("Solar.R","Ozone"), criteria="gt"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    NA      NA  7.4   67     5   1
#> 2    36      NA  8.0   72     5   2
#> 3    12      NA 12.6   74     5   3
#> 4    18      NA 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

Recoding NA as

recode_na_as

Sometimes, for whatever reason, one would like to replace NAs with whatever value they would like. recode_na_as provides a very simple way to do just that.

head(recode_na_as(airquality))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5     0       0 14.3   56     5   5
#> 6    28       0 14.9   66     5   6

# use NaN

head(recode_na_as(airquality, value=NaN))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5   NaN     NaN 14.3   56     5   5
#> 6    28     NaN 14.9   66     5   6

As a “bonus”, you can manipulate the data only at specific columns as shown here:

head(recode_na_as(airquality, value=0, subset_cols="Ozone"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5     0      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

The above also supports custom recoding similar to recode_na_as:

head(mde::recode_na_as(airquality, value=0, pattern_type="starts_with",pattern="Solar"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA       0 14.3   56     5   5
#> 6    28       0 14.9   66     5   6

column_based_recode

Ever needed to change values in a given column based on the proportions of NAs in other columns(row-wise)?!. The goal of column_based_recode is to achieve just that. Let’s see how we could do this with a simple example:

head(column_based_recode(airquality, values_from = "Wind", values_to="Wind", pattern_type = "regex", pattern = "Solar|Ozone"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA  0.0   56     5   5
#> 6    28      NA 14.9   66     5   6

custom_na_recode

This allows recoding NA values with common stats functions such as mean,max,min,sd.

To use default values:

head(custom_na_recode(airquality))
#>      Ozone  Solar.R Wind Temp Month Day
#> 1 41.00000 190.0000  7.4   67     5   1
#> 2 36.00000 118.0000  8.0   72     5   2
#> 3 12.00000 149.0000 12.6   74     5   3
#> 4 18.00000 313.0000 11.5   62     5   4
#> 5 42.12931 185.9315 14.3   56     5   5
#> 6 28.00000 185.9315 14.9   66     5   6

To use select columns:

head(custom_na_recode(airquality,func="mean",across_columns=c("Solar.R","Ozone")))
#>      Ozone  Solar.R Wind Temp Month Day
#> 1 41.00000 190.0000  7.4   67     5   1
#> 2 36.00000 118.0000  8.0   72     5   2
#> 3 12.00000 149.0000 12.6   74     5   3
#> 4 18.00000 313.0000 11.5   62     5   4
#> 5 42.12931 185.9315 14.3   56     5   5
#> 6 28.00000 185.9315 14.9   66     5   6

To use a function from another package to perform replacements:

To perform a forward fill with dplyr’s lead:

# use lag for a backfill
head(custom_na_recode(airquality,func=dplyr::lead ))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    23      99 14.3   56     5   5
#> 6    28      19 14.9   66     5   6

To perform replacement by group:

some_data <- data.frame(ID=c("A1","A1","A1","A2","A2", "A2"),A=c(5,NA,0,8,3,4),B=c(10,0,0,NA,5,6),C=c(1,NA,NA,25,7,8))

head(custom_na_recode(some_data,func = "mean", grouping_cols = "ID"))
#> # A tibble: 6 x 4
#>   ID        A     B     C
#>   <chr> <dbl> <dbl> <dbl>
#> 1 A1      5    10       1
#> 2 A1      2.5   0       1
#> 3 A1      0     0       1
#> 4 A2      8     5.5    25
#> 5 A2      3     5       7
#> 6 A2      4     6       8

Across specific columns:

head(custom_na_recode(some_data,func = "mean", grouping_cols = "ID", across_columns = c("C", "A")))
#> # A tibble: 6 x 4
#>   ID        A     B     C
#>   <chr> <dbl> <dbl> <dbl>
#> 1 A1      5      10     1
#> 2 A1      2.5     0     1
#> 3 A1      0       0     1
#> 4 A2      8      NA    25
#> 5 A2      3       5     7
#> 6 A2      4       6     8

recode_na_if

Given a data.frame object, one can recode NAs as another value based on a grouping variable. In the example below, we replace all NAs in all columns with 0s if the ID is A2 or A3

some_data <- data.frame(ID=c("A1","A2","A3", "A4"), 
                        A=c(5,NA,0,8), B=c(10,0,0,1),
                        C=c(1,NA,NA,25))
                        
head(recode_na_if(some_data,grouping_col="ID", target_groups=c("A2","A3"),
           replacement= 0))   
#> # A tibble: 4 x 4
#>   ID        A     B     C
#>   <chr> <dbl> <dbl> <dbl>
#> 1 A1        5    10     1
#> 2 A2        0     0     0
#> 3 A3        0     0     0
#> 4 A4        8     1    25

Dropping NAs

drop_na_if

Suppose you wanted to drop any column that has a percentage of NAs greater than or equal to a certain value? drop_na_if does just that.

We can drop any columns that have greater than or equal(gteq) to 24% of the values missing from airquality:

head(drop_na_if(airquality, sign="gteq",percent_na = 24))
#>   Solar.R Wind Temp Month Day
#> 1     190  7.4   67     5   1
#> 2     118  8.0   72     5   2
#> 3     149 12.6   74     5   3
#> 4     313 11.5   62     5   4
#> 5      NA 14.3   56     5   5
#> 6      NA 14.9   66     5   6

The above also supports less than or equal to(lteq), equal to(eq), greater than(gt) and less than(lt).

To keep certain columns despite fitting the target percent_na criteria, one can provide an optional keep_columns character vector.

head(drop_na_if(airquality, percent_na = 24, keep_columns = "Ozone"))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    NA      NA 14.3   56     5   5
#> 6    28      NA 14.9   66     5   6

Compare the above result to the following:

head(drop_na_if(airquality, percent_na = 24))
#>   Solar.R Wind Temp Month Day
#> 1     190  7.4   67     5   1
#> 2     118  8.0   72     5   2
#> 3     149 12.6   74     5   3
#> 4     313 11.5   62     5   4
#> 5      NA 14.3   56     5   5
#> 6      NA 14.9   66     5   6

To drop groups that meet a set missingness criterion, we proceed as follows.

grouped_drop <- structure(list(ID = c("A", "A", "B", "A", "B"), 
          Vals = c(4, NA,  NA, NA, NA), Values = c(5, 6, 7, 8, NA)), 
          row.names = c(NA, -5L), class = "data.frame")
# Drop all columns for groups that meet a percent missingness of greater than or
# equal to 67
drop_na_if(grouped_drop,percent_na = 67,sign="gteq",
                                    grouping_cols = "ID")
#> # A tibble: 3 x 3
#>   ID     Vals Values
#>   <chr> <dbl>  <dbl>
#> 1 A         4      5
#> 2 A        NA      6
#> 3 A        NA      8

drop_row_if

This is similar to drop_na_if but does operations rowwise not columnwise. Compare to the example above:

# Drop rows with at least two NAs
head(drop_row_if(airquality, sign="gteq", type="count" , value = 2))
#> Dropped 2 rows.
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 6    28      NA 14.9   66     5   6
#> 7    23     299  8.6   65     5   7

To drop based on percentages:

# Drops 42 rows
head(drop_row_if(airquality, type="percent", value=16, sign="gteq",
                 as_percent=TRUE))
#> Dropped 42 rows.
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 7    23     299  8.6   65     5   7
#> 8    19      99 13.8   59     5   8

For more details, please see the documentation of drop_row_if.

drop_na_at

This provides a simple way to drop missing values only at specific columns. It currently only returns those columns with their missing values removed. See usage below. Further details are given in the documentation. It is currently case sensitive.

head(drop_na_at(airquality,pattern_type = "starts_with","O"))
#>   Ozone
#> 1    41
#> 2    36
#> 3    12
#> 4    18
#> 5    28
#> 6    23

drop_all_na

This drops columns where all values are missing.

test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = c(4,rep(NA, 4))) 
drop_all_na(test2, grouping_cols="ID")
#> # A tibble: 3 x 2
#>   ID     Vals
#>   <chr> <dbl>
#> 1 A         4
#> 2 A        NA
#> 3 A        NA

Alternatively, we can drop groups where all variables are all NA.

test2 <- data.frame(ID= c("A","A","B","A","B"), Vals = rep(NA, 5)) 

head(drop_all_na(test, grouping_cols = "ID"))
#> # A tibble: 4 x 3
#>   Subject   res ID   
#>   <fct>   <dbl> <fct>
#> 1 A          NA 1    
#> 2 A           1 1    
#> 3 B           2 2    
#> 4 B           3 2

dict_recode

If one would like to recode column values using a “dictionary”, dict_recode provides a simple way to do that. For example, if one would like to convert NA values in Solar.R to 520 and those in Ozone to 42, one simply calls the following:

head(dict_recode(airquality, use_func="recode_na_as",
                 patterns = c("solar", "ozone"),
                 pattern_type="starts_with", values = c(520,42)))
#>   Ozone Solar.R Wind Temp Month Day
#> 1    41     190  7.4   67     5   1
#> 2    36     118  8.0   72     5   2
#> 3    12     149 12.6   74     5   3
#> 4    18     313 11.5   62     5   4
#> 5    42     520 14.3   56     5   5
#> 6    28     520 14.9   66     5   6

Please note that the mde project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For further exploration, please browseVignettes("mde").

To raise an issue, please do so here

Thank you, feedback is always welcome :)

mde's People

Contributors

Stargazers

Watchers

Forkers

romainfrancois jimsforks minghao2016 shahronak47

mde's Issues

Control over warning messages

In recode_as_na and other functions that force coercion, there should be an option to turn off the coercion warning(not pretty but someone might prefer not to see the warnings).

Exclude columns by RegEx match.

Description

For functions that use exclusion, it would be great to exclude via a regular expression or wildcard.

Similar Features

This is almost similar to pattern_type plus pattern except that this would exclude not include.

Feature Details

In shinymde, there's an exclude columns option when summarising missingness. This is tedious if you have thousands of columns. A simple RegEx match would save much more time.

Proposed Implementation

Support regular expression entries in the exclude_columns argument.

Rethink na_summary output

The current output from na_summary looks messy especially with respect to naming.

Provide warning and convert factors to character in recode_as_na_str

Description

When using recode_as_na_str, the result will return factor levels.

Similar Features

This is similar to recode_as_na_str

Feature Details

I would like to have a warning that tells me that factors have been converted to character during the recoding process.

Proposed Implementation

if (is.factor(x)) warning("X has been converted to character"). Proceed as usual.

Support dates in missingness reports

Description

Extend functionality for objects of class POSIXct or Date.

Similar Features

get_na_*

Feature Details

Sufficiently described.

Proposed Implementation

Write functions such as get_na_means.POSIXct or "inherit" from other classes.

Exclude certain columns when dropping NAs

In drop_na_if, one should be able to drop_na_if only for certain columns i.e subset and drop_at.

This can probably be done using drop_na_at as the "backend"(behind-the-scenes)

Conditionally drop rows

Description

Drop rows with x% missing

Similar Features

This may be similar to column_based_recode or drop_na_if but for rows not columns.

Feature Details

Given a data.frame:


df <- data.frame(A=rep(NA,4), B=c(rep(NA,3),1))

I would like to keep rows that only have an x% of observed values.

Proposed Implementation

None yet.

Conditional NA recoding based on other columns

See this post for more.

Moving to `dplyr` 1.0.0.

Please see this issue

Support grouping in drop_na_if

Description

I would like to drop groups that have x% missing.

Similar Features

This is similar to drop_na_if which currently doesn't support grouping.

Feature Details

The provided detail is sufficient.

Proposed Implementation

Add a grouping_cols argument to drop_na_if
Calculate percent missingness and drop as required.

Support mean, mode, max, sd, sem, etc

In recoding values, I would like to be able to use common functions like mean, mode, sd, max, min or my own user defined equation.

Allow recoding based on rows

Dealing with row names in na_summary

Description

I would like to preserve "reorder" row names when sorting in na_summary.

Similar Features

This is related to na_summary when sorted.

Feature Details

Given a data.frame object, running na_summary on this data works as expected except the returned rows are in their original order. Example:

df <- data.frame(A=1:5,B=c(NA,NA,25,24,53), C=c(NA,1,2,3,4))

na_summary(df,sort_by="variable",descending=TRUE)                 
  variable missing complete percent_complete percent_missing
3        C       1        4               80              20
2        B       2        3               60              40
1        A       0        5              100               0

In the above result, we could instead change 3 to 1 to 1 to 3 as per the new numbering.

Proposed Implementation

Change row.names to 1:nrow(df). This might be fine for numeric rownames but not non-numeric indices. Say we had some names, it might be problematic to change these to numeric indices. Perhaps add a warning/argument to ask users what they would like to do with the indices?

Recode character NA as another value

Description

Given a character NA such as "na", I would like to convert this to 420 for example.

Similar Features

This is similar to recode_as_na, recode_na_as, recode_na_as_str. The only difference is that we would take any value and convert it to some other value.

Feature Details

Admittedly, the above may slightly deviate from the package's name as that is a recode that may not necessarily be missing. However, for "na" it should be possible to do that.
Proposed Implementation

Currently, this can be achieved via a two-step process.

to_na<- recode_as_na(df,value=c("na"/"null"),...)
from_na <- recode_na_as(df, value = 0, ...)

This was inspired by the Stackoverflow question https://stackoverflow.com/q/70385221/10323798

Support multiple patterns

In recode_*_*, I would like to be able to use multiple patterns and/or use regex. In this example,I would like to do something like:
recode_na_as(df,value=0,pattern_type="starts_with",pattern="this|col") as opposed to:
recode_na_as(df,value=0,pattern_type="starts_with",pattern="this") %>% recode_na_as(value=0,pattern_type="starts_with",pattern="col")

Conditionally Recode NA based on Percent Missingness

Description

I would like to recode_as_na if and only if the percentage of missing values meets some target criterion.

Similar Features

This is similar to drop_na_if except we're looking to keep not drop values.

Feature Details

I have described it in enough detail above.

Proposed Implementation

I currently have no proposed implementation.

Recode as NA if a value satisfies some condition

In recode_na_as and recode_as_na, is it possible to recode values that meet some conditions for example gteq,lteq,lt,e.tc.?

Topic Based Vignettes

Description

I would like to have vignettes that deal with a single topic.

Similar Features

This is similar to the general package vignette.

Feature Details

I would like to have different vignettes for the following topics

Exploring missingness
Recoding

Proposed Implementation

Extend vignettes as proposed above.

drop_na_if returns percentages data frame instead

When using drop_na_if, it returns the percentages/decimals data.frame object instead of the original data.frame with columns dropped.

This is clearly a bug and is not what one would expect to happen.

Support functions in recode_na_if

drop_na_at should return the entire dataset

drop_na_at currently drops NAs and returns only columns for which missing values have been dropped. This might be less useful if one would like to do the analysis at once.

The package does not focus on imputation(just exploration) so it would be great to keep the entire dataset intact. Stated differently, one should drop_na_at if such a drop results in equal number of rows(highly unlikely).

Improve grouped_sort in na_summary

na_counts and na_summary fail for logical vectors

When a data frame contains logical vectors, both na_counts and na_summary will fail with the error:

No applicable methods for na_counts applied to an object of class logical
To Reproduce

xdupe <- as.logical(c("T", "F", "F", "F", "T", "T", "F"))
ydupe <- as.logical(c("T", "F", "F", "F", "F", "T", "T"))
cities <- c("Knox", "Whiteville", "Madison", "York", "Paris", "Corona", "Bakersfield")
df <- data.frame(cities, xdupe, ydupe)
df$cities <- as.character(df$cities)
mde::na_summary(df)

Expected behavior

Expected to get a summary of missingness

Unexpected behavior
See above

System Details

R 4.3.1 mde 0.3.2

Support grouped recoding

Given a data.frame object, it is possible that one would like to recode_na_as a given value only for specific "individuals"/groups. Is there a way to therefore support grouped replacements and importantly "subsetting" these groups.

Example:

some_data <- data.frame(ID=c("A1","A2","A3", "A4"), 
                        A=c(5,NA,0,8), B=c(10,0,0,1),
                        C=c(1,NA,NA,25))

For the above data, I would like to replace NAs with the value 233 or 420 only for IDs corresponding to A1 and A2

Support case (in)sensitivity in recoding

Issues with descending order

na_summary's descending argument ascends instead.

Dictionary Style Missing Data Recode

Description

I would like to replace NAs for several columns at once.

Similar Features

This could be done several times with recode_na_as but I would rather do it once.

Feature Details

Given a dataset:

df <- structure(list(A = c(3L, 4L, NA, 9L, NA), B = c(9L, NA, NA, 2L, 
12L)), class = "data.frame", row.names = c(NA, -5L))

Recode A as 0 and B as 2020. Do this at once instead of:

mde::recode_na_as(df, pattern_type = "ends_with", pattern="A",value=0)
mde::recode_na_as(df, pattern_type = "ends_with", pattern="B",value=2020)

Proposed Implementation

None yet.

Issues with recoding as NA

In using recode_as_na, character vectors are coerced to ~~factor levels~~ integer instead.

Example:

df <- data.frame(col_1 = c(45, 23, 89, "this", "and"),
                 col_2 = c(5,6,7,0,"this")) 
  col_1 col_2
1    45     5
2    23     6
3    89     7
4  this     0
5   and  this

Unexpected behavior


mde::recode_as_na(df,c("this","and","this"))
  col_1 col_2
1     2     2
2     1     3
3     3     4
4    NA     1
5    NA    NA

Compared to using characters:

df <- data.frame(col_1 = c(45, 23, 89, "this", "and"),
                col_2 = c(5,6,7,0,"this"), stringsAsFactors = FALSE)
 mde::recode_as_na(df,c("this","and","this"))
  col_1 col_2
1    45     5
2    23     6
3    89     7
4  <NA>     0
5  <NA>  <NA>

Recode as NA based on a partial match

Description

I would like to recode_as_na based on string matching.

Similar Features

This is similar to recode_as_na

Feature Details

Given a data.frame:

partial_match <- data.frame(A=c("Hi","match_me"), B=c(NA, "not_me"))

I would like to change all values that contain me to NA .

Proposed Implementation

None, yet.

Expand tests

There is a large portion of functions that seem to have no tests at all or sketchy tests. recode_na_for for instance. This will make future updates time consuming. Perhaps use coverage?

Allow grouping in na_summary

In na_summary, I would like to get this summary by group.

na_summary should be able to specify columns to include or exclude

Description

The function na_summary() should be able to specify a subset of columns to include or exclude.

Similar Features

This would enhance the na_summary() function. The percent_missing() function already has this feature in the argument exclude_cols. The argument ... in the `na_summary()' function might have the feature but there is no documentation of example usage.

Feature Details

The documented usage is
na_summary(df, grouping_cols = NULL, sort_by = NULL, descending = FALSE, ...)
where ... is Arguments to other functions.
But here ... fails. That is, the exclude_cols argument, which is an argument to the percent_missing() function, fails with an error when used in the na_summary() function. Here is an MWE:

> library(mde)
Welcome to mde. This is mde version 0.2.1.
 Please file issues and feedback at https://www.github.com/Nelson-Gon/mde/issues
Turn this message off using 'suppressPackageStartupMessages(library(mde))'
 Happy Exploration :)
> na_summary(airquality)
  variable missing complete percent_complete percent_missing
1      Day       0      153        100.00000        0.000000
2    Month       0      153        100.00000        0.000000
3    Ozone      37      116         75.81699       24.183007
4  Solar.R       7      146         95.42484        4.575163
5     Temp       0      153        100.00000        0.000000
6     Wind       0      153        100.00000        0.000000
> percent_missing(airquality, exclude_cols = c("Day","Temp"))
     Ozone  Solar.R Wind Month
1 24.18301 4.575163    0     0
> na_summary(airquality, exclude_cols = c("Day","Temp"))
Error in na_summary.data.frame(airquality, exclude_cols = c("Day", "Temp")) : 
  Binding of datasets failed. Please check using percent_missing and get_na_counts first

Proposed Implementation

I can think of at least four possible implementations:
A. Provide example documentation of how to use the argument exclude_cols of the percent_missing() function as the argument ... in the na_summary() function.
or
B. Add the argument exclude_cols or include_cols to the na_summary() function for variables you want to exclude or include.
or
C. Add a boolean argument for missingness versus completeness so that you will either get a completeness report (with the statistics columns complete and percent_complete) or a minimal missingness report (with the statistics columns missing and percent_missing).
or
D. Add the argument exclude_stats or include_stats to the na_summary() function for statistics you want to exclude or include.

I propose A. if the issue is only a matter of improving the documentation. Otherwise it is a feature request which can be implemented in several different ways such as B, C, or D.

Support tidy select in recode_na_as and recode_as_na

Currently, recode_na_as and recode_as_na's subset_cols argument is a bit less versatile. Could there be a way to extend support to use tidy select like features? (Think contains,starts_with,end_with)

`na_summary` fails for `data.frame` objects with logical columns

Describe the bug

na_summary fails for logical columns.

To Reproduce

test_df <- data.frame(A= 1:4, B= as.logical(1, NA, 2, 4))
mde::na_summary(test_df)

Expected behavior

An output from na_summary.

Unexpected behavior

Error: Problem with summarise() input ..1.
i ..1 = across(everything(), ~get_na_means(.)).
x no applicable method for 'get_na_means' applied to an object of class "logical"

System Details

Developer version https://github.com/Nelson-Gon/mde/tree/221eeab7c0ba19dd66a3186bfeb344c3a4032cb9

Allow multiple values as replacements

Given a data.frame object with NAs , I would like to be able to use recode_na_as to replace these with my own defined values, with one trick: These values should be recycled across the data.frame. Currently, this is not possible and will only use the first value with the warning:

In x[list] <- values :
number of items to replace is not a multiple of replacement length

Drop rows based on missingness counts

Description

I would like to drop rows that contain missing values based on counts.

Similar Features

This is similar to drop_row_if except it would use counts not percents.

Feature Details

Given an example data set:

df <- data.frame(A=1:5, B=c(1,NA,NA,2, 3), C= c(1,NA,NA,2,3))

I would like to drop rows that have x number of NAs.

Proposed Implementation

Use drop_row_if but provide an argument for counts too.

Support sorted output in na_summary

Fix CRAN

Patch for checks Nelson-Gon/manymodelr#17 and r-lib/testthat#1051

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.