Giter Club home page Giter Club logo

datapreparation's People

Contributors

earino avatar eltoulemonde avatar xavierfontaine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

datapreparation's Issues

Scale Back

we have function "fastScale" to use "build_scales" and transform the data to standardise it.
is there another function which scales back the transformed variable ?

as your code is written on references, most of your functions changes the original object.

[New Feature]: Binning.... (Variable Discretization).

Hi Enmanuel,

It could be interesting also to add a new set of functionality in your package to dicretize variables. Most of the packages I know that include this feature are quite slow. So integrating data.table like your package could be a real advantage.

I remember seeing something similar in another package that finally did not progress to CRAN that could be a good reference to start with o even reuse their code. This other package is here:

https://github.com/awstringer/modellingTools/blob/master/vignettes/modellingTools.md

As you will see that project has been inactive for quite a long time... more than one year.

Well some ofther features that this package includes (descriptive statistics, etc) could be incorporated later to your package.

Thanks!
Carlos.

Not able to read mm/dd/yy dates correctly

In the findAndTransformDates function of this package,i am not able to deal with dates having format
mm/dd/yy.

My file has date like
1/1/2012
2/1/2012
3/1/2012
4/1/2012
5/1/2012
6/1/2012
7/1/2012
8/1/2012
9/1/2012
which are getting read as
2001-01-12
2002-01-12
2003-01-12
......

So basically the year and month part are getting exchanged here. Is there a way to tackle this already,or maybe i missed out the solution

[New_Feature]: Target Encoding.

Hello Emmanuel-Lin,

How are you?
Long time without proposing to you new ideas to enhance your very good package.
This time what I would like to propose is how to encode as numeric a categorical variable based on the value of a target column. The target column could be the one we use for binary, regression or multiclass.

The idea is reflected in this code:

library(data.table)

dt_train <- data.table(
                  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
                  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE),
                  target = sample( c(0,1), 15, replace = TRUE)
)

dt_test <- data.table(
                  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
                  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train
dt_train[ , te_x1 := mean(target), by=x1]
dt_train[ , te_x2 := mean(target), by=x2]

dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, te_x1)])
x2_vals <- unique(dt_train[, .(x2, te_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

The main issue is what you see for the dt_test set. Tipycally it hasn't the target variable, so the idea is to use the encoding calculated in the dt_train set and apply it to the dt_test. So to apply this transformation the function should require the test set in the workspace.

After applying the encoding the dt_test would be something like this:

> dt_test
    x2 x1     te_x1     te_x2
 1: dd  b 0.2500000 0.5000000
 2: aa  b 0.2500000 0.3333333
 3: dd  b 0.2500000 0.5000000
 4: dd  b 0.2500000 0.5000000
 5: bb  b 0.2500000 0.3333333
 6: aa  a 0.6666667 0.3333333
 7: cc  a 0.6666667 0.4000000
 8: bb  c 0.3750000 0.3333333

Also, you can encode with several functions, I used mean().

This feature is also applied in many dataset transformations as a _feature engineering_ option. Very recently H2O included in their framework:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/target-encoding.html

Thank you very much for you work with your package,
Carlos.

[New_Feature]: Target Encoding - Binary - Ratio + WoE

Hi Emmanuel-Lin,

These are the two enhancements I anticipated to you.
Basically the idea is to create two possible transformations for a categorial variable when the target variable is binary (1/0):

  1. A ratio encoding P(1)/P(0).
  2. Calculate the WoE (Weight of Evidence) which is the ln(P(1)/P(0))

And this is a reference code for each case.

For the ratio:

library(data.table)
set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , re_x1   := ifelse(te_x1_0 == 0, 1, te_x1_1 / te_x1_0)]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , re_x2   := ifelse(te_x2_0 == 0, 1, te_x2_1 / te_x2_0)]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, re_x1)])
x2_vals <- unique(dt_train[, .(x2, re_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

And this for the WoE:

library(data.table)

set.seed(77777)
dt_train <- data.table(
  x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
  target = sample( c(0,1), 150, replace = TRUE)
)

dt_test <- data.table(
  x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
  x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)

# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , woe_x1  := ifelse(te_x1_0 == 0, 0, log(te_x1_1 / te_x1_0)) ]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , woe_x2  := ifelse(te_x2_0 == 0, 0, log(te_x2_1 / te_x2_0)) ]
dt_train

# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, woe_x1)])
x2_vals <- unique(dt_train[, .(x2, woe_x2)])

dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test

Thanks!
Carlos.

About One hot encoder

Is there a reason why "one_hot_encoder" creates the columns in integer format instead of logical ?

I need help to find an efficient way to convert the columns I created using "one_hot_encoder" into lgl format without expliciting writing out the list of columns.

Thanks!

[New_Feature]: Count occurences in high cardinality variable.

Hello Enmanuel-Lin,

I have just seen your very interesting packages announced in CRAN.
Your package offers very interesting possibilities and by using data.table it guarantees high processing speed with low RAM consumption.

Another typical transformation when dealing with datasets is how to reduce the cardinality of some variables (the number of levels they have). With high cardinality it makes almost impossible to use some model techniques.

I have seen that one way to deal with that is to generate another column that represents the number of occurences that each level has in the variable.

This is a reproducible expample:

library(data.table)
library(randNames)

dat_nam <- rand_names(100, gender = "male", nationality = "FR")
head(dat_nam)
> head(dat_nam)
# A tibble: 6 x 25
  gender                       email                 dob          registered          phone           cell   nat name.title name.first name.last
   <chr>                       <chr>               <chr>               <chr>          <chr>          <chr> <chr>      <chr>      <chr>     <chr>
1   male    fabien.morel@example.com 1951-12-01 08:23:52 2005-09-24 05:52:23 01-89-87-09-90 06-63-54-70-63    FR         mr     fabien     morel
2   male      loan.henry@example.com 1987-08-31 16:42:02 2008-04-23 18:54:45 03-73-25-83-77 06-90-69-04-55    FR         mr       loan     henry
3   male mathias.garnier@example.com 1983-06-17 03:06:36 2011-11-08 06:17:31 04-24-58-56-01 06-90-89-85-50    FR         mr    mathias   garnier
4   male  valentin.lucas@example.com 1955-01-02 20:43:15 2008-10-31 05:21:31 02-46-30-81-64 06-72-11-22-39    FR         mr   valentin     lucas
5   male      nathan.rey@example.com 1986-10-02 23:24:36 2010-10-13 02:35:53 05-03-40-18-08 06-44-25-63-64    FR         mr     nathan       rey
6   male  matthieu.petit@example.com 1984-10-22 03:10:07 2003-06-02 07:53:58 05-55-19-38-12 06-69-76-48-63    FR         mr   matthieu     petit
# ... with 15 more variables: location.street <chr>, location.city <chr>, location.state <chr>, location.postcode <int>, login.username <chr>,
#   login.password <chr>, login.salt <chr>, login.md5 <chr>, login.sha1 <chr>, login.sha256 <chr>, id.name <chr>, id.value <chr>,
#   picture.large <chr>, picture.medium <chr>, picture.thumbnail <chr>
dat_nam <- as.data.table(dat_nam)
dat_red <- dat_nam[, .(name.first, name.last, location.city)]
head(dat_red)
  name.first name.last    location.city
1:     fabien     morel  rueil-malmaison
2:       loan     henry          orléans
3:    mathias   garnier      montpellier
4:   valentin     lucas          orléans
5:     nathan       rey aulnay-sous-bois
6:   matthieu     petit           nantes
dat_red[, loc_num := .N, by=location.city]
head(dat_red)
   name.first name.last    location.city loc_num
1:     fabien     morel  rueil-malmaison       2
2:       loan     henry          orléans       3
3:    mathias   garnier      montpellier       4
4:   valentin     lucas          orléans       3
5:     nathan       rey aulnay-sous-bois       1
6:   matthieu     petit           nantes       4

In the last data.frame a new column loc_num is added based on the number of occurences of each location.city.

I think it could be a nice functionality to include in your package.

Thanks,
Carlos.

Using fastHandleNA with duplicated column names

It looks like your fastHandleNa does not work when there are duplicate column names. It might be worth handling NAs for duplicate column names possible or showing a warning message to change the column names. If it is only an issuse in my computer please let me know i so that I can figure out what is happening. Thanks!

site <- c("A", "B", "C", "D", "E", "B")
D01 <- c(1, 0, 0, 0, 1, 0)
D01 <- c(1, 1, 0, 1, 1, 1)
D02 <- c(1, 0, 1, 0, 1, 0)
D02 <- c(0, 1, 0, 0, 1, 1)
D03 <- c(1, 1, 0, 0, 0, 1)
D03 <- c(0, 1, 0, 0, 1, 1)
D04 <- c(NA, NA, 0, 1, 1, NA)
D04 <- c(NA, 0, NA, 1, 1, 0)

df1 <- data.table(site, D01, D01, D02, D02, D03, D03, D04, D04, check.names = FALSE)
df2 <- data.table(site, D01, D01, D02, D02, D03, D03, D04, D04, check.names = TRUE)

df1 <- dataPreparation::fastHandleNa(df1, set_num = 0) 
df2 <- dataPreparation::fastHandleNa(df2, set_num = 0) 

[New Feature]: Transform dates in numeric.

Hello Enmanuel,

Another possible transformation would be to move from dates to integers.
The alternatives I am thinking of are:

  • Count the number of days between a date in the past to each date.
  • Count the number of days between each date in the data.frame and todays date.

Is an equivalent to the high cardinality transformation for string.
These transformations would happen after unifying the formats as you have already considered in findAndTransformDates function.

Thanks,
Carlos.

Enhancements Issues while using FindandTransformDates

The function FindandTransformDates doesn't consider

  1. single digits for day or month in date e.g. 1/12/2017 or 25/1/2017. Is it because of using the standard R date formats. This becomes a problem while reading dates from text files

  2. first letter of month in lowercase e.g. "01/jan/2017", "01/feb/2017", "01/mar/2017", again I think same reason.

Also any reason why it reads the dates as character instead of as factors, because if it reads as factors then aggregation by dates becomes easier post reading them which is not so easy if read as character

Problem installing package - Linux

Hi

When I try to install the package on my machine (Cent OS server), it gives me this error:
"Warning in fun(libname, pkgname) : couldn't connect to display ":0""

and when I try to use the library to convert all numerical variables
DT <- findAndTransformNumerics(DT)
it gives me this error:
"Error in findAndTransformNumerics(DT1_first_launch_rewards) :
could not find function "findAndTransformNumerics"

On my windows machine it works fine so probably it's an error related to Linux?

Thanks
Paul

Suggestion

It would also be great to have outlier removal/imputation based on the columns

  • 6 σ
trainData[, `:=`(mean_dv = mean(dv), sd_dv = sd(dv))]
trainData <- trainData[dv >= (mean_dv - (6*sd_dv)) & (dv <= mean_dv + (6*sd_dv))]
trainData[, c('mean_dv', 'sd_dv'):=NULL]
  • percentile
removeOneOutliersFunc <- function(trainData, colName, outlierVec = c(0.0001,0.9999)){
  vec       <- trainData[[colName]]
  values    <- as.numeric(quantile(vec, outlierVec, na.rm = TRUE))
  trainData <- trainData[vec >= values[1] & vec <= values[2]]
  return(trainData)
}

one_hot_encoder in Test

how do we use one_hot_encoder in test data ?
lets say some new value got added in some column
it will add extra column in test-dataset, which is not a problem,

but let's say some values are missing in the test-data
and it will drop that column in one_hot_encoder
and that might create a problem while scoring

Python interface

A nice thing to build would be to make it usable from python.

Using rpy2 interface (?), it would be nice to be able call the package to prepare a data set and then go back to python using sklearn for example.

I don't know what form it should take: maybe a script or a tutorial.
A question would also be: what classes are accepted by pandas (i think of R class factor for example).

If anyone did something like that before, help would be more than appreciated :)

Vignette availability....

Hi,

It could be useful to have a vignette pdf available here in GitHub although you cannot include in your package due to CRAN restrictions.
The old vignette you had, was useful to understand the different functionalities your package provides.

Thanks!
Carlos.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.