eltoulemonde / datapreparation Goto Github PK

View Code? Open in Web Editor NEW

31.0 2.0 10.0 5.31 MB

Data preparation for data science projects.

License: GNU General Public License v3.0

R 100.00%

data-preparation data-science date-conversion variable-selection variable-elimination r data-preprocessing speed

datapreparation's People

Contributors

Stargazers

Watchers

Forkers

xavierfontaine nishantsbi earino dotrdata guhjy prashorst minghao2016 ldebb statunizaga bjm88620

datapreparation's Issues

Make dependency on tcl/tk optional

tcl/tk is only used to display the progress bar. We have RStudio server (tk is not installed because we never establish X connections to the server) and the package is unusable. There are many packages such as https://github.com/r-lib/progress to create progress bars in text mode.
Thanks.

Scale Back

we have function "fastScale" to use "build_scales" and transform the data to standardise it.
is there another function which scales back the transformed variable ?

as your code is written on references, most of your functions changes the original object.

[New Feature]: Binning.... (Variable Discretization).

Hi Enmanuel,

It could be interesting also to add a new set of functionality in your package to dicretize variables. Most of the packages I know that include this feature are quite slow. So integrating data.table like your package could be a real advantage.

I remember seeing something similar in another package that finally did not progress to CRAN that could be a good reference to start with o even reuse their code. This other package is here:

https://github.com/awstringer/modellingTools/blob/master/vignettes/modellingTools.md

As you will see that project has been inactive for quite a long time... more than one year.

Well some ofther features that this package includes (descriptive statistics, etc) could be incorporated later to your package.

Thanks!
Carlos.

Not able to read mm/dd/yy dates correctly

In the findAndTransformDates function of this package,i am not able to deal with dates having format
mm/dd/yy.

My file has date like
1/1/2012
2/1/2012
3/1/2012
4/1/2012
5/1/2012
6/1/2012
7/1/2012
8/1/2012
9/1/2012
which are getting read as
2001-01-12
2002-01-12
2003-01-12
......

So basically the year and month part are getting exchanged here. Is there a way to tackle this already,or maybe i missed out the solution

[New_Feature]: Target Encoding.

Hello Emmanuel-Lin,

How are you?
Long time without proposing to you new ideas to enhance your very good package.
This time what I would like to propose is how to encode as numeric a categorical variable based on the value of a target column. The target column could be the one we use for binary, regression or multiclass.

The idea is reflected in this code:

library(data.table)

dt_train <- data.table(
                  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
                  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE),
                  target = sample( c(0,1), 15, replace = TRUE)
)

dt_test <- data.table(
                  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
                  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train
dt_train[ , te_x1 := mean(target), by=x1]
dt_train[ , te_x2 := mean(target), by=x2]

dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, te_x1)])
x2_vals <- unique(dt_train[, .(x2, te_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

The main issue is what you see for the dt_test set. Tipycally it hasn't the target variable, so the idea is to use the encoding calculated in the dt_train set and apply it to the dt_test. So to apply this transformation the function should require the test set in the workspace.

After applying the encoding the dt_test would be something like this:

> dt_test
    x2 x1     te_x1     te_x2
 1: dd  b 0.2500000 0.5000000
 2: aa  b 0.2500000 0.3333333
 3: dd  b 0.2500000 0.5000000
 4: dd  b 0.2500000 0.5000000
 5: bb  b 0.2500000 0.3333333
 6: aa  a 0.6666667 0.3333333
 7: cc  a 0.6666667 0.4000000
 8: bb  c 0.3750000 0.3333333

Also, you can encode with several functions, I used mean().

This feature is also applied in many dataset transformations as a _feature engineering_ option. Very recently H2O included in their framework:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/target-encoding.html

Thank you very much for you work with your package,
Carlos.

[New_Feature]: Target Encoding - Binary - Ratio + WoE

Hi Emmanuel-Lin,

These are the two enhancements I anticipated to you.
Basically the idea is to create two possible transformations for a categorial variable when the target variable is binary (1/0):

A ratio encoding P(1)/P(0).
Calculate the WoE (Weight of Evidence) which is the ln(P(1)/P(0))

And this is a reference code for each case.

For the ratio:

library(data.table)
set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , re_x1   := ifelse(te_x1_0 == 0, 1, te_x1_1 / te_x1_0)]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , re_x2   := ifelse(te_x2_0 == 0, 1, te_x2_1 / te_x2_0)]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, re_x1)])
x2_vals <- unique(dt_train[, .(x2, re_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

And this for the WoE:

library(data.table)

set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , woe_x1  := ifelse(te_x1_0 == 0, 0, log(te_x1_1 / te_x1_0)) ]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , woe_x2  := ifelse(te_x2_0 == 0, 0, log(te_x2_1 / te_x2_0)) ]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, woe_x1)])
x2_vals <- unique(dt_train[, .(x2, woe_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

Thanks!
Carlos.

About One hot encoder

Is there a reason why "one_hot_encoder" creates the columns in integer format instead of logical ?

I need help to find an efficient way to convert the columns I created using "one_hot_encoder" into lgl format without expliciting writing out the list of columns.

Thanks!

[New_Feature]: Count occurences in high cardinality variable.

Hello Enmanuel-Lin,

I have just seen your very interesting packages announced in CRAN.
Your package offers very interesting possibilities and by using data.table it guarantees high processing speed with low RAM consumption.

Another typical transformation when dealing with datasets is how to reduce the cardinality of some variables (the number of levels they have). With high cardinality it makes almost impossible to use some model techniques.

I have seen that one way to deal with that is to generate another column that represents the number of occurences that each level has in the variable.

This is a reproducible expample:

library(data.table)
library(randNames)

dat_nam <- rand_names(100, gender = "male", nationality = "FR")
head(dat_nam)

> head(dat_nam)
# A tibble: 6 x 25
  gender                       email                 dob          registered          phone           cell   nat name.title name.first name.last
   <chr>                       <chr>               <chr>               <chr>          <chr>          <chr> <chr>      <chr>      <chr>     <chr>
1   male    fabien.morel@example.com 1951-12-01 08:23:52 2005-09-24 05:52:23 01-89-87-09-90 06-63-54-70-63    FR         mr     fabien     morel
2   male      loan.henry@example.com 1987-08-31 16:42:02 2008-04-23 18:54:45 03-73-25-83-77 06-90-69-04-55    FR         mr       loan     henry
3   male mathias.garnier@example.com 1983-06-17 03:06:36 2011-11-08 06:17:31 04-24-58-56-01 06-90-89-85-50    FR         mr    mathias   garnier
4   male  valentin.lucas@example.com 1955-01-02 20:43:15 2008-10-31 05:21:31 02-46-30-81-64 06-72-11-22-39    FR         mr   valentin     lucas
5   male      nathan.rey@example.com 1986-10-02 23:24:36 2010-10-13 02:35:53 05-03-40-18-08 06-44-25-63-64    FR         mr     nathan       rey
6   male  matthieu.petit@example.com 1984-10-22 03:10:07 2003-06-02 07:53:58 05-55-19-38-12 06-69-76-48-63    FR         mr   matthieu     petit
# ... with 15 more variables: location.street <chr>, location.city <chr>, location.state <chr>, location.postcode <int>, login.username <chr>,
#   login.password <chr>, login.salt <chr>, login.md5 <chr>, login.sha1 <chr>, login.sha256 <chr>, id.name <chr>, id.value <chr>,
#   picture.large <chr>, picture.medium <chr>, picture.thumbnail <chr>

dat_nam <- as.data.table(dat_nam)
dat_red <- dat_nam[, .(name.first, name.last, location.city)]
head(dat_red)

  name.first name.last    location.city
1:     fabien     morel  rueil-malmaison
2:       loan     henry          orléans
3:    mathias   garnier      montpellier
4:   valentin     lucas          orléans
5:     nathan       rey aulnay-sous-bois
6:   matthieu     petit           nantes

dat_red[, loc_num := .N, by=location.city]
head(dat_red)

   name.first name.last    location.city loc_num
1:     fabien     morel  rueil-malmaison       2
2:       loan     henry          orléans       3
3:    mathias   garnier      montpellier       4
4:   valentin     lucas          orléans       3
5:     nathan       rey aulnay-sous-bois       1
6:   matthieu     petit           nantes       4

In the last data.frame a new column loc_num is added based on the number of occurences of each location.city.

I think it could be a nice functionality to include in your package.

Thanks,
Carlos.

Using fastHandleNA with duplicated column names

It looks like your fastHandleNa does not work when there are duplicate column names. It might be worth handling NAs for duplicate column names possible or showing a warning message to change the column names. If it is only an issuse in my computer please let me know i so that I can figure out what is happening. Thanks!

site <- c("A", "B", "C", "D", "E", "B")
D01 <- c(1, 0, 0, 0, 1, 0)
D01 <- c(1, 1, 0, 1, 1, 1)
D02 <- c(1, 0, 1, 0, 1, 0)
D02 <- c(0, 1, 0, 0, 1, 1)
D03 <- c(1, 1, 0, 0, 0, 1)
D03 <- c(0, 1, 0, 0, 1, 1)
D04 <- c(NA, NA, 0, 1, 1, NA)
D04 <- c(NA, 0, NA, 1, 1, 0)

df1 <- data.table(site, D01, D01, D02, D02, D03, D03, D04, D04, check.names = FALSE)
df2 <- data.table(site, D01, D01, D02, D02, D03, D03, D04, D04, check.names = TRUE)

df1 <- dataPreparation::fastHandleNa(df1, set_num = 0) 
df2 <- dataPreparation::fastHandleNa(df2, set_num = 0)

[Possibly 'Help wanted'] FindandTransformDateTimes()

How feasible would it be to create function like FindandTransformDates() that handles badly structured datetime data?

[New Feature]: Transform dates in numeric.

Hello Enmanuel,

Another possible transformation would be to move from dates to integers.
The alternatives I am thinking of are:

Count the number of days between a date in the past to each date.
Count the number of days between each date in the data.frame and todays date.

Is an equivalent to the high cardinality transformation for string.
These transformations would happen after unifying the formats as you have already considered in findAndTransformDates function.

Thanks,
Carlos.

Enhancements Issues while using FindandTransformDates

The function FindandTransformDates doesn't consider

single digits for day or month in date e.g. 1/12/2017 or 25/1/2017. Is it because of using the standard R date formats. This becomes a problem while reading dates from text files
first letter of month in lowercase e.g. "01/jan/2017", "01/feb/2017", "01/mar/2017", again I think same reason.

Also any reason why it reads the dates as character instead of as factors, because if it reads as factors then aggregation by dates becomes easier post reading them which is not so easy if read as character

Problem installing package - Linux

When I try to install the package on my machine (Cent OS server), it gives me this error:
"Warning in fun(libname, pkgname) : couldn't connect to display ":0""

and when I try to use the library to convert all numerical variables
DT <- findAndTransformNumerics(DT)
it gives me this error:
"Error in findAndTransformNumerics(DT1_first_launch_rewards) :
could not find function "findAndTransformNumerics"

On my windows machine it works fine so probably it's an error related to Linux?

Thanks
Paul

Suggestion

It would also be great to have outlier removal/imputation based on the columns

6 σ

trainData[, `:=`(mean_dv = mean(dv), sd_dv = sd(dv))]
trainData <- trainData[dv >= (mean_dv - (6*sd_dv)) & (dv <= mean_dv + (6*sd_dv))]
trainData[, c('mean_dv', 'sd_dv'):=NULL]

percentile

removeOneOutliersFunc <- function(trainData, colName, outlierVec = c(0.0001,0.9999)){
  vec       <- trainData[[colName]]
  values    <- as.numeric(quantile(vec, outlierVec, na.rm = TRUE))
  trainData <- trainData[vec >= values[1] & vec <= values[2]]
  return(trainData)
}

one_hot_encoder in Test

how do we use one_hot_encoder in test data ?
lets say some new value got added in some column
it will add extra column in test-dataset, which is not a problem,

but let's say some values are missing in the test-data
and it will drop that column in one_hot_encoder
and that might create a problem while scoring

Python interface

A nice thing to build would be to make it usable from python.

Using rpy2 interface (?), it would be nice to be able call the package to prepare a data set and then go back to python using sklearn for example.

I don't know what form it should take: maybe a script or a tutorial.
A question would also be: what classes are accepted by pandas (i think of R class factor for example).

If anyone did something like that before, help would be more than appreciated :)

Vignette availability....

Hi,

It could be useful to have a vignette pdf available here in GitHub although you cannot include in your package due to CRAN restrictions.
The old vignette you had, was useful to understand the different functionalities your package provides.

Thanks!
Carlos.

eltoulemonde / datapreparation Goto Github PK

datapreparation's People

Contributors

Stargazers

Watchers

Forkers

datapreparation's Issues

Recommend Projects

Recommend Topics

Recommend Org