tuanle618 / aeda Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 3.0 3.22 MB

AEDA - Automated Data Exploratory Analysis in R

License: GNU General Public License v3.0

R 99.64% Rebol 0.36%

data-science eda eda-report exploratory-data-analysis r

aeda's Introduction

Hi there 👋

aeda's People

Contributors

Stargazers

Watchers

Forkers

drroad lenamax2355 rmasiniexpert

aeda's Issues

Flaws in AEDA plots in datasets with many features

IF a data set has lost of features then some of the plots are messed up.
Dataset: https://www.openml.org/d/5
Features: 279

Priority of unit tests?

What is the priority for unit testing and setting up travis?

check all analysis' if NAs are in dataframe

I noticed that cluster analysis does not work with NAs. I will randomly insert NAs in testthat/base_finishReport file and check, if all analysis' can handle NAs or not.

In general: How should we handle missing Values? Omit them or do some kind of NA Imputation? Like columns average... ?

Add dontrun to examples

Some examples are taking quite some time to execute. We should add \dontrun{} to them
See: https://stackoverflow.com/a/12038225

Note: Check if this causes a Latex error on Travis

ggplot2 instead of plot for naSummary

https://github.com/ptl93/AEDA/blob/td_summaryNA/R/naSummary.R
Change plot to ggplot2

makeCorr

For now makeCorr function creats a new class
https://github.com/ptl93/AEDA/blob/94bce2688384269d11390b5fce11d76aa0e5d881/R/makeCorr.R#L27-L29
but it probably would be better to let the new class inherit vom the task object

Pimp AEDA GitHub Community Profile

We should start editing the github page with examples and make it "beautiful".

I think we should create a function makeTask which has some features like the current makeCorrTask (basically only ID and data and we can add getDataTypes as well)
Finally assign this as ReportTask or something like that. From that all other tasks inherit from it.

Make in-body args to formal-defaults

In some functions, like makeClusterAnalysis, there are args which should be accesable for the user (for dbscan method the eps arg) but since they are set fix the user cant override them. If move them to the defaults (functions formals) then there should be freely accesable.
So we should check the functions for such problems.

Add check for installed packages

At the moment packages are loaded via require to prevent loading the same package multiple times.
-> If a package ist not installed user will get error: function foo noz found. So it would be better to check if the package is availabe( installed) and if not throw a more meaningful error

Condense commands

At the moment we need multiple commands to produce one finished rmd file:

my.report.task = makeReportTask(id = "test.report", data = airquality, target = "Wind")
basic.report = makeBasicReport(my.report.task, data = airquality)

Here there are two commands

my.creport.task = makeCorrTask(id = "corr.report", data = airquality)
my.creport = makeCorr(my.creport.task)
corr.report = makeCorrReport(my.creport, type = "CorrPlot")

And here are 3 commands needed.
Should we condense it so we always need one? Or does this has low priority?

Integer are always numerics

https://github.com/ptl93/AEDA/blob/9e66cfdccdc088c157b2d6c4bdafc57d3e6762b3/R/getDataType.R#L36-L39

If you first check for numeric und after than for integer you will never get an integer detected because a integer will always be detected as numeric.(all integers are numerics)
If you want to detect an integer you have to swap the order of the checks. First integer and then numeric.

Add generic text to summary reports

I tried starting to add generic text to numeric, categorical and cluster analysis reports.
Please run text examples for all reports and think what we might add. I believe we might even add some background for the methods applied (MDS, PCA, Factor Analysis).
@daryabusen In PCA please add more generic text. Method applied and some background what PCA does in general.

Setting target = NULL

https://github.com/ptl93/AEDA/blob/9e66cfdccdc088c157b2d6c4bdafc57d3e6762b3/R/getDataType.R#L16-L17

Why do you check if target is NULL and if it already NULL you set it again NULL?

Store datasets in AEDA package

For example reason, should we store a few datasets in our package?

Child Structure + Plots

I think it would be better to organize the child so that the each plot has its own section:

\```{r}
plot1
\```
\```{r}
plot2
\```

This way it is easier to add title and generic text.

We should think about a proper object to handle this issue

Opening tabs for html-tables when executing kable() function in rmd within loop

see Pull-Request #29

factor and ordered

https://github.com/ptl93/AEDA/blob/9e66cfdccdc088c157b2d6c4bdafc57d3e6762b3/R/getDataType.R#L40
Why aren't you seperate the check for ordered and factor variables?

INFO: Weird random.seed for MDS and PCA Reports

When trying to run fastReport I noticed, that the ID for PCA and MDS Reports are the same. I believe that those functions (for pca prcomp() and for mds cmdscale() but also isoMDS and maybe the other methods in makeMDSTask() might as well set the seed after execution to the same seed as the pca. Because when calling makeReport(pca.result) and makeReport(mds.result) both report.ids are the same. I investigated this further and found out that when applying another report between mds and pca, like numsum for example and then after pca another report like catsum, the id for numsum and catsum are the same. This I believe confirms my believe, that somehow after the makeMDS and makePCA which are right before the makeReport step set the seed to the same number.

Reproducible error:

#start with clean R-session CTRL+SHIFT+F10

devtools::load_all()

set.seed(1)

my.mds.task = makeMDSTask(id = "swiss", data = swiss)
mds.analysis = makeMDSAnalysis(my.mds.task)
mds.report = makeReport(mds.analysis)

cluster.task = makeClusterTask(id = "iris", data = iris,
  method = "cluster.kmeans")
cluster.analysis = makeClusterAnalysis(cluster.task)
cluster.report = makeReport(cluster.analysis)

pca.task = makePCATask(id = "iris.test", data = iris, center = TRUE, target = "Species")
pca.result = makePCA(pca.task)
pca.report = makeReport(pca.result)

#compare IDs
cluster.report$report.id
#[1] "T6cG3IC7CQJg3pcu"

mds.report$report.id
#[1] "oWz26cG3IC7CQJg3"

pca.report$report.id
#[1] "oWz26cG3IC7CQJg3"

#remove workspace
rm(list = ls())
#start with new session, CTRL+SHIFT+F10

devtools::load_all()

set.seed(1)

my.mds.task = makeMDSTask(id = "swiss", data = swiss)
mds.analysis = makeMDSAnalysis(my.mds.task)
mds.report = makeMDSAnalysisReport(mds.analysis)

pca.task = makePCATask(id = "iris.test", data = iris, center = TRUE, target = "Species")
pca.result = makePCA(pca.task)
pca.report = makePCAReport(pca.result)

cluster.task = makeClusterTask(id = "iris", data = iris,
  method = "cluster.kmeans")
cluster.analysis = makeClusterAnalysis(cluster.task)
cluster.report = makeClusterAnalysisReport(cluster.analysis)

mds.report$report.id
#[1] "oWz26cG3IC7CQJg3"

pca.report$report.id
#[1] "oWz26cG3IC7CQJg3"

cluster.report$report.id
#[1] "T6cG3IC7CQJg3pcu"

###try even more reports:
rm(list=ls())


#clean r session

devtools::load_all()

#try different seed
set.seed(10)

#for MDS try even another method
my.mds.task = makeMDSTask(id = "swiss", data = swiss, method = "isoMDS")
mds.analysis = makeMDSAnalysis(my.mds.task)
mds.report = makeReport(mds.analysis)

num.sum.task = makeNumSumTask("iris.test", iris, target = "Species")
num.sum = makeNumSum(num.sum.task)
num.sum.report = makeReport(num.sum)

pca.task = makePCATask(id = "iris.test", data = iris, center = TRUE, target = "Species")
pca.result = makePCA(pca.task)
pca.report = makeReport(pca.result)

cat.sum.task = makeCatSumTask("iris.test", iris, target = "Species")
cat.sum = makeCatSum(cat.sum.task)
cat.sum.report = makeReport(cat.sum)

cluster.task = makeClusterTask(id = "iris", data = iris,
  method = "cluster.kmeans")
cluster.analysis = makeClusterAnalysis(cluster.task)
cluster.report = makeReport(cluster.analysis)

mds.report$report.id
#[1] "oWz26cG3IC7CQJg3"

num.sum.report$report.id
#[1] "mcu73QN9ORHKrj73"

pca.report$report.id
#[1] "oWz26cG3IC7CQJg3" ---> SAME

cat.sum.report$report.id
#[1] "mcu73QN9ORHKrj73" ---> now catsum has the same report ID like num sum, which right after mds #was called

cluster.report$report.id
#[1] "T6cG3IC7CQJg3pcu"

As of now I set the seed to 89 in makeReport.PCAObj and makePCAReport to manually set another seed and fix the issue.

makeNumSum output description

Maybe we should provide a description for the columns? mean, min, max, ... are clear but for example lower/upper bound arent without looking into the code.