eltoulemonde / datapreparation Goto Github PK
View Code? Open in Web Editor NEWData preparation for data science projects.
License: GNU General Public License v3.0
Data preparation for data science projects.
License: GNU General Public License v3.0
tcl/tk is only used to display the progress bar. We have RStudio server (tk is not installed because we never establish X connections to the server) and the package is unusable. There are many packages such as https://github.com/r-lib/progress to create progress bars in text mode.
Thanks.
we have function "fastScale" to use "build_scales" and transform the data to standardise it.
is there another function which scales back the transformed variable ?
as your code is written on references, most of your functions changes the original object.
Hi Enmanuel,
It could be interesting also to add a new set of functionality in your package to dicretize variables. Most of the packages I know that include this feature are quite slow. So integrating data.table
like your package could be a real advantage.
I remember seeing something similar in another package that finally did not progress to CRAN that could be a good reference to start with o even reuse their code. This other package is here:
https://github.com/awstringer/modellingTools/blob/master/vignettes/modellingTools.md
As you will see that project has been inactive for quite a long time... more than one year.
Well some ofther features that this package includes (descriptive statistics, etc) could be incorporated later to your package.
Thanks!
Carlos.
In the findAndTransformDates function of this package,i am not able to deal with dates having format
mm/dd/yy.
My file has date like
1/1/2012
2/1/2012
3/1/2012
4/1/2012
5/1/2012
6/1/2012
7/1/2012
8/1/2012
9/1/2012
which are getting read as
2001-01-12
2002-01-12
2003-01-12
......
So basically the year and month part are getting exchanged here. Is there a way to tackle this already,or maybe i missed out the solution
Hello Emmanuel-Lin,
How are you?
Long time without proposing to you new ideas to enhance your very good package.
This time what I would like to propose is how to encode as numeric a categorical variable based on the value of a target column. The target column could be the one we use for binary, regression or multiclass.
The idea is reflected in this code:
library(data.table)
dt_train <- data.table(
x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE),
target = sample( c(0,1), 15, replace = TRUE)
)
dt_test <- data.table(
x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)
# Target Encoding Train
dt_train[ , te_x1 := mean(target), by=x1]
dt_train[ , te_x2 := mean(target), by=x2]
dt_train
# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, te_x1)])
x2_vals <- unique(dt_train[, .(x2, te_x2)])
dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test
The main issue is what you see for the dt_test
set. Tipycally it hasn't the target
variable, so the idea is to use the encoding calculated in the dt_train
set and apply it to the dt_test
. So to apply this transformation the function should require the test
set in the workspace.
After applying the encoding the dt_test
would be something like this:
> dt_test
x2 x1 te_x1 te_x2
1: dd b 0.2500000 0.5000000
2: aa b 0.2500000 0.3333333
3: dd b 0.2500000 0.5000000
4: dd b 0.2500000 0.5000000
5: bb b 0.2500000 0.3333333
6: aa a 0.6666667 0.3333333
7: cc a 0.6666667 0.4000000
8: bb c 0.3750000 0.3333333
Also, you can encode with several functions, I used mean()
.
This feature is also applied in many dataset transformations as a _feature engineering_
option. Very recently H2O
included in their framework:
http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-munging/target-encoding.html
Thank you very much for you work with your package,
Carlos.
Hi Emmanuel-Lin,
These are the two enhancements I anticipated to you.
Basically the idea is to create two possible transformations for a categorial variable when the target variable is binary (1/0):
And this is a reference code for each case.
For the ratio:
library(data.table)
set.seed(77777)
dt_train <- data.table(
x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
target = sample( c(0,1), 150, replace = TRUE)
)
dt_test <- data.table(
x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)
# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , re_x1 := ifelse(te_x1_0 == 0, 1, te_x1_1 / te_x1_0)]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , re_x2 := ifelse(te_x2_0 == 0, 1, te_x2_1 / te_x2_0)]
dt_train
# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, re_x1)])
x2_vals <- unique(dt_train[, .(x2, re_x2)])
dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test
And this for the WoE:
library(data.table)
set.seed(77777)
dt_train <- data.table(
x1 = sample(c('a','b', 'c'), 150, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 150, replace = TRUE),
target = sample( c(0,1), 150, replace = TRUE)
)
dt_test <- data.table(
x1 = sample(c('a','b', 'c'), 15, replace = TRUE),
x2 = sample(c('aa', 'bb', 'cc', 'dd'), 15, replace = TRUE)
)
# Target Encoding Train - Ratio P(1)/P(0)
dt_train[ , te_x1_1 := mean(target), by = c('x1') ]
dt_train[ , te_x1_0 := 1-te_x1_1]
dt_train[ , woe_x1 := ifelse(te_x1_0 == 0, 0, log(te_x1_1 / te_x1_0)) ]
dt_train[ , te_x2_1 := mean(target), by = c('x2') ]
dt_train[ , te_x2_0 := 1-te_x2_1]
dt_train[ , woe_x2 := ifelse(te_x2_0 == 0, 0, log(te_x2_1 / te_x2_0)) ]
dt_train
# Target Encoding on Test
x1_vals <- unique(dt_train[, .(x1, woe_x1)])
x2_vals <- unique(dt_train[, .(x2, woe_x2)])
dt_test <- merge(dt_test, x1_vals, by.x = 'x1', by.y = 'x1', sort = FALSE)
dt_test <- merge(dt_test, x2_vals, by.x = 'x2', by.y = 'x2', sort = FALSE)
dt_test
Thanks!
Carlos.
Is there a reason why "one_hot_encoder" creates the columns in integer format instead of logical ?
I need help to find an efficient way to convert the columns I created using "one_hot_encoder" into lgl
format without expliciting writing out the list of columns.
Thanks!
Hello Enmanuel-Lin,
I have just seen your very interesting packages announced in CRAN.
Your package offers very interesting possibilities and by using data.table it guarantees high processing speed with low RAM consumption.
Another typical transformation when dealing with datasets is how to reduce the cardinality of some variables (the number of levels they have). With high cardinality it makes almost impossible to use some model techniques.
I have seen that one way to deal with that is to generate another column that represents the number of occurences that each level has in the variable.
This is a reproducible expample:
library(data.table)
library(randNames)
dat_nam <- rand_names(100, gender = "male", nationality = "FR")
head(dat_nam)
> head(dat_nam)
# A tibble: 6 x 25
gender email dob registered phone cell nat name.title name.first name.last
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 male fabien.morel@example.com 1951-12-01 08:23:52 2005-09-24 05:52:23 01-89-87-09-90 06-63-54-70-63 FR mr fabien morel
2 male loan.henry@example.com 1987-08-31 16:42:02 2008-04-23 18:54:45 03-73-25-83-77 06-90-69-04-55 FR mr loan henry
3 male mathias.garnier@example.com 1983-06-17 03:06:36 2011-11-08 06:17:31 04-24-58-56-01 06-90-89-85-50 FR mr mathias garnier
4 male valentin.lucas@example.com 1955-01-02 20:43:15 2008-10-31 05:21:31 02-46-30-81-64 06-72-11-22-39 FR mr valentin lucas
5 male nathan.rey@example.com 1986-10-02 23:24:36 2010-10-13 02:35:53 05-03-40-18-08 06-44-25-63-64 FR mr nathan rey
6 male matthieu.petit@example.com 1984-10-22 03:10:07 2003-06-02 07:53:58 05-55-19-38-12 06-69-76-48-63 FR mr matthieu petit
# ... with 15 more variables: location.street <chr>, location.city <chr>, location.state <chr>, location.postcode <int>, login.username <chr>,
# login.password <chr>, login.salt <chr>, login.md5 <chr>, login.sha1 <chr>, login.sha256 <chr>, id.name <chr>, id.value <chr>,
# picture.large <chr>, picture.medium <chr>, picture.thumbnail <chr>
dat_nam <- as.data.table(dat_nam)
dat_red <- dat_nam[, .(name.first, name.last, location.city)]
head(dat_red)
name.first name.last location.city
1: fabien morel rueil-malmaison
2: loan henry orléans
3: mathias garnier montpellier
4: valentin lucas orléans
5: nathan rey aulnay-sous-bois
6: matthieu petit nantes
dat_red[, loc_num := .N, by=location.city]
head(dat_red)
name.first name.last location.city loc_num
1: fabien morel rueil-malmaison 2
2: loan henry orléans 3
3: mathias garnier montpellier 4
4: valentin lucas orléans 3
5: nathan rey aulnay-sous-bois 1
6: matthieu petit nantes 4
In the last data.frame
a new column loc_num
is added based on the number of occurences of each location.city
.
I think it could be a nice functionality to include in your package.
Thanks,
Carlos.
It looks like your fastHandleNa does not work when there are duplicate column names. It might be worth handling NAs for duplicate column names possible or showing a warning message to change the column names. If it is only an issuse in my computer please let me know i so that I can figure out what is happening. Thanks!
site <- c("A", "B", "C", "D", "E", "B")
D01 <- c(1, 0, 0, 0, 1, 0)
D01 <- c(1, 1, 0, 1, 1, 1)
D02 <- c(1, 0, 1, 0, 1, 0)
D02 <- c(0, 1, 0, 0, 1, 1)
D03 <- c(1, 1, 0, 0, 0, 1)
D03 <- c(0, 1, 0, 0, 1, 1)
D04 <- c(NA, NA, 0, 1, 1, NA)
D04 <- c(NA, 0, NA, 1, 1, 0)
df1 <- data.table(site, D01, D01, D02, D02, D03, D03, D04, D04, check.names = FALSE)
df2 <- data.table(site, D01, D01, D02, D02, D03, D03, D04, D04, check.names = TRUE)
df1 <- dataPreparation::fastHandleNa(df1, set_num = 0)
df2 <- dataPreparation::fastHandleNa(df2, set_num = 0)
How feasible would it be to create function like FindandTransformDates() that handles badly structured datetime data?
Hello Enmanuel,
Another possible transformation would be to move from dates to integers.
The alternatives I am thinking of are:
Is an equivalent to the high cardinality transformation for string.
These transformations would happen after unifying the formats as you have already considered in findAndTransformDates
function.
Thanks,
Carlos.
The function FindandTransformDates doesn't consider
single digits for day or month in date e.g. 1/12/2017 or 25/1/2017. Is it because of using the standard R date formats. This becomes a problem while reading dates from text files
first letter of month in lowercase e.g. "01/jan/2017", "01/feb/2017", "01/mar/2017", again I think same reason.
Also any reason why it reads the dates as character instead of as factors, because if it reads as factors then aggregation by dates becomes easier post reading them which is not so easy if read as character
Hi
When I try to install the package on my machine (Cent OS server), it gives me this error:
"Warning in fun(libname, pkgname) : couldn't connect to display ":0""
and when I try to use the library to convert all numerical variables
DT <- findAndTransformNumerics(DT)
it gives me this error:
"Error in findAndTransformNumerics(DT1_first_launch_rewards) :
could not find function "findAndTransformNumerics"
On my windows machine it works fine so probably it's an error related to Linux?
Thanks
Paul
It would also be great to have outlier removal/imputation based on the columns
trainData[, `:=`(mean_dv = mean(dv), sd_dv = sd(dv))]
trainData <- trainData[dv >= (mean_dv - (6*sd_dv)) & (dv <= mean_dv + (6*sd_dv))]
trainData[, c('mean_dv', 'sd_dv'):=NULL]
removeOneOutliersFunc <- function(trainData, colName, outlierVec = c(0.0001,0.9999)){
vec <- trainData[[colName]]
values <- as.numeric(quantile(vec, outlierVec, na.rm = TRUE))
trainData <- trainData[vec >= values[1] & vec <= values[2]]
return(trainData)
}
how do we use one_hot_encoder in test data ?
lets say some new value got added in some column
it will add extra column in test-dataset, which is not a problem,
but let's say some values are missing in the test-data
and it will drop that column in one_hot_encoder
and that might create a problem while scoring
A nice thing to build would be to make it usable from python.
Using rpy2 interface (?), it would be nice to be able call the package to prepare a data set and then go back to python using sklearn for example.
I don't know what form it should take: maybe a script or a tutorial.
A question would also be: what classes are accepted by pandas (i think of R class factor for example).
If anyone did something like that before, help would be more than appreciated :)
Hi,
It could be useful to have a vignette pdf available here in GitHub although you cannot include in your package due to CRAN restrictions.
The old vignette you had, was useful to understand the different functionalities your package provides.
Thanks!
Carlos.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.