zmjones / edarf Goto Github PK
View Code? Open in Web Editor NEWexploratory data analysis using random forests
License: MIT License
exploratory data analysis using random forests
License: MIT License
randomForest(X, y)
, which calls randomForest.default
is broken. maybe we should have a separate method for this? needs to be added to the tests as well
as can be seen in the tests. think fix_columns
could fix this but haven't done it yet
library(devtools)
test() # from root of package
in var_est.RandomForest
i call R_predictRF_weights
which is a C function in party. R CMD check
generates a note for this reason and will mean we can't submit to CRAN even if randomForestCI
is there and randomForest
gets updated as well. if we want to fix this we can either write some C++ that does this given ensemble
(i.e. reimplement R_predictRF_weights
) but for one tree at a time (might be the best performance) or modify the party code and hope Hothorn puts it in.
I'm not 100% sure. But I think in
pred <- predict(fit, newdata = df, outcome = "test")$predicted.oob
the outcome argument should be 'train'. The part in the documentation is a bit cryptic:
"If outcome="test", the predictor is calculated by using y-outcomes from the test data (outcome information must be present). In this case, the terminal nodes from the grow-forest are recalculated using the y-outcomes from the test set. This yields a modified predictor in which the topology of the forest is based solely on the training data, but where the predicted value is based on the test data. Error rates and VIMP are calculated by bootstrapping the test data and using out-of-bagging to ensure unbiased estimates. See the examples for illustration"
But I don't think we want to recalculate the terminal nodes from the original forest, but just drop down the test data and get a prediction. I came to it bc i got different and counterintuitive results from it with the outcome='test' but matching and intuitive results with outcome='train'.
I changed it in the package code to do some tests.
use the faster, stripped down predict methods
i think you can handle multivariate pd plots in the same way that you handle class probs or interactions. the only difficulties i can think of is when a interaction plot for multivariate outcomes is requested
we need to give all the output from partial dependence a simple s3 class, and then there should be a plot generic for the bivariate pd output. the output from partial_dependence should still operate like a normal data.frame though.
have been working on sorting all of this out the past day or two. see 5afbba8
one error i haven't solved yet is
In addition: Warning messages:
1: replacing previous import by 'party::proximity' when loading 'non-package environment'
2: replacing previous import by 'party::varimp' when loading 'non-package environment'
3: replacing previous import by 'party::varimpAUC' when loading 'non-package environment'
the googling i've done suggests this is from reimporting things, but so far as i can tell i am only doing so once. this problem was somehow introduced in the past day or so by me.
all of the tests are now passing. new tests were added too.
for the S3 methods to be consistent they all have to have the same arguments as the generic. there are some cases where this is nonsensical, e.g., one of the methods can take additional arguments but the other two can't, so the generic has ... which is unused in the two methods. not sure how to document that.
especially plot_pd needs this, very messy. probably we can make this simple as in plotPartialPrediction in mlr. plot_imp probably needs a refactor as well
additionally contour/tile plots for regression and survival tasks should be implemented (code again in plotPartialPrediction)
i am just going to put them all in one function with some simple switching. i may get to this today
implement party survival in partial_dependence.RandomForest
it will return partial dependence on one predictor (an option in the call to randomForest), but doesn't do multivariate partial dependence.
there needs to be a multivariate example
am pulling class probabilities from predict.randomForest
. this is not consistent with the behavior of classification via party and maybe not for randomForestSRC. the package should either only produce one (argmax(pi_c)) or have options to choose. if the latter option than this has to have an option for the plotting function. the current bar plot option for class probabilities is pretty lame.
https://github.com/openjournals/joss-reviews/files/525302/10.21105.joss.00092.pdf
Moreover the joss paper is different to the paper in your github repo.
Something with "gridsize" is wrong:
> library(randomForest)
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
> library(edarf)
>
> data(iris)
> data(swiss)
>
> ## classification
> fit <- randomForest(Species ~ ., iris)
> pd <- partial_dependence(fit, iris, c("Petal.Width", "Sepal.Length"))
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) :
Objekt 'gridsize' nicht gefunden
> pd_int <- partial_dependence(fit, iris, c("Petal.Width", "Sepal.Length"), interaction = TRUE)
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) :
Objekt 'gridsize' nicht gefunden
>
> ## Regression
> fit <- randomForest(Fertility ~ ., swiss)
> pd <- partial_dependence(fit, swiss, "Education")
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) :
Objekt 'gridsize' nicht gefunden
> pd_int <- partial_dependence(fit, swiss, c("Education", "Catholic"), interaction = TRUE)
Fehler in ifelse(gridsize >= nunique, nunique, gridsize) :
Objekt 'gridsize' nicht gefunden
when doing binary classification with type = "prob" in partial_dependence
we should drop one factor level, (how we know which idk, an option?), change the prob
attribute as FALSE
and assign the column for the other level the name of the target variable. This will get handled like regression in plot_pd
.
some classes are still output incorrectly by partial_dependence
. with the iris data in some instances numeric variables are output as integers. haven't figured out why this is the case.
Hi Zack!
Congratulations for your package! It's very indeed very convenient.
I was testing the code today and found a bug. I ran the examples you provided here and I got this message:
plot_twoway_partial(pd_reg$Education, pd_reg$pred, smooth = TRUE)
Error in noNA(grid) : argument "grid" is missing, with no default
I'm pretty sure it's an easy fix, but I'm not sure how to do it (and maybe you want to update the example too). I'm using Revolution R Open 8.0 and R 3.1.1 on Ubuntu 14.04, if that matters.
Thanks!
is simply outputting the chf sufficient?
i don't think that it is, we should try to use every possible output if possible. we should also do that with party.
should give easily understood errors, should check all arguments
multivariate regression/classification using party is not currently tested. should be easy to create a small simulated dataset and test it (esp. since it is only possible w/ party).
mclapply with detect cores doe not work on windows bc it detects several cores but can't use them. I don't completely understand the stuff you wrote yet so I don't want to mess with it but we could do:
if(.Platform$OS.type=='windows') CORES <- 1
that should work.
Just installed edarf on a new machine and the automatic installation of dependencies failed. Could however install the dependencies manually (specifically, in order: 'zoo', 'sandwich', 'strucchange' (Dependencies to party?)).
Not a big deal but weird and not sure if systme specific
help apparently doesn't distinguish between the two, so only one shoes up (RandomForest).
Some things I noticed while looking at the help files:
data(swiss)
fit <- randomForest(Fertility ~ ., swiss)
pd <- partial_dependence(fit, swiss, "Education")
plot(pd)
gives bad output
we should add the option to use error bars instead of a ribbon. it is somewhat misleading with the ribbon since it is a point-wise confidence interval.
I am not completely sure about this, but maybe you should.
If a variable has a lot of unique values and empirical = TRUE, then it can happen, that pd is only calculated for a snippet of the range.
Dear mr. Jones,
Thank you for developing edarf; it seems very useful but I am getting errors with both variable_importance and partial_dependence when using a model with categorical variables. It is possible that the problem has to do with the fact that several of the levels of these categorical variables are dropped from the model, because they do not have any observations in the training data. For variable_importance the error is:
variable_importance(ranger_model, vars=c("paper", "genus"), data=training_data)
Error in variable_importance(ranger_model, vars = c("paper", "genus"), :
Assertion on 'y' failed: Must have length 1, but has length 3.
For partial_dependence, the error is:
Warning messages:
1: In names(mp)[ncol(mp)] = target :
number of items to replace is not a multiple of replacement length
2: In names(mp)[ncol(mp)] = target :
number of items to replace is not a multiple of replacement length
The graphical functions do not work in the presence of these errors. Is there anything I can do to prevent them?
Sincerely,
Caspar
Just a quick note that when the variable importance is underpinned by the party package (at least the current version 1.0-25) the OOB parameter does not work, i.e. you cannot get the code to use out of bag samples.
This is because the party "predict" code that you (eventually) call (line 138 from https://github.com/cran/party/blob/R-3.0.3/R/RandomForest.R) eventually forces OOB to false if you are passing new data (which you are since you permute the data). Modifying RandomForest.R in the party code to prevent this forced behaviour seems to fix this, but I've not looked into why this behaviour is there in the first place to ensure that the code subsequently logically does the right thing. In any case since this is not your code perhaps it's just worth noting that the OOB parameter doesn't work in this case in the documentation?
the data must be checked for this
You have two times extract_proximity
in the readme. Moreover you have randomforest_distance
, but the function is randomforest_dist
.
You do not mention randomforest_dist
in the paper, but maybe this is not a problem. I really do not know what happens in randomforest_dist
, you could add a reference or better explanation in the help.
Also references in the help of the other functions.
Sorry for being very critical, but I think this could really improve the documentation.
using the example from the help:
library(randomForest) ; library(edarf)
data(iris); data(swiss)
`## classification` `fit = randomForest(Species ~ ., iris)`
## WORKING ##
pd = partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, -ncol(iris)])
## NOT WORKING ##
pd = partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, -ncol(iris)], ,uniform = FALSE)
## Error in partial_dependence(fit, c("Sepal.Width", "Sepal.Length"), data = iris[, :
## Assertion on 'y' failed: Must be of type 'data.frame', not 'double'.
Cheers, thanks for having written this useful package !
was thinking either the marijuana data or some of the human rights data
partial_dependence
returns a dataframe of lists when the number of variables is greater than 1.
It is very interesting approach to data analysis to use complex machine learning models for exploratory data analysis.
There are some works that tries to make tree ensembles interpretable by extracting simple and relevant rules from them.
I think these works are also useful for exploratory data analysis.
Are you interested in integrating these methods in edarf package?
need to test arguments
to use the variance estimator with cforest we need the patched version of cforest. the patch needs to party needs to be accepted, we need to find a workaround, or this should be removed
we should have the ability to discretize a variable that we want to use for facetting in the partial dependence plots. we could do this in the plot method or in the pd methods
for rfsrc the loop could be written in C++. this is in ci.R
error when running the iris example with class probability output
library(randomForest)
library(edarf)
data(iris)
fit <- randomForest(Species ~ ., iris)
pd <- partial_dependence(fit, iris, "Petal.Width", type = "prob")
plot(pd)
Error in as.character(x$label) :
cannot coerce type 'closure' to vector of type 'character'
I have to check this:
- License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
You have a license file, but not with the contents of an OSI approved software license.
First of all, your package is awesome! I merely have a friendly feature request.
The ranger package, which you are probably familiar with, have huge speed improvements compared to randomForest and randomForestSRC. I often find the latter two to be too slow for large (or even medium-sized) data sets.
Do you have any plans to implement support for random forest fit by ranger, or do you know if anybody else is working on that?
Hi there - read your paper. Interesting stuff. Would like to play around with your package, but I'm getting an install error on my mac using devtools. Have you encountered this?
devtools::install_github('zmjones/edarf')
Downloading github repo zmjones/edarf@master
Installing edarf
Installing dependencies for edarf:
RcppArmadillo
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 10 1239k 10 130k 0 0 168k 0 0:00:07 --:--:-- 0:00:07 168k100 1239k 100 1239k 0 0 1356k 0 --:--:-- --:--:-- --:--:-- 1356k
The downloaded binary packages are in
/var/folders/29/dj4wm1714q7flrdgvg7src8h0000gn/T//RtmpxEpsRy/downloaded_packages
'/Library/Frameworks/R.framework/Resources/bin/R' --vanilla CMD INSTALL
'/private/var/folders/29/dj4wm1714q7flrdgvg7src8h0000gn/T/RtmpxEpsRy/devtoolsa4220b25ed7/zmjones-edarf-e39a4c5'
--library='/Library/Frameworks/R.framework/Versions/3.1/Resources/library' --install-tests
First of all: I ❤️ your package! :-)
When using plot_pd, no lines are drawn if the predictor is an ordered factor. See the interaction plot in this example using the BreastCancer
data set from mlbench
: section 5.2.
Or is this a feature and not a bug? That is, is this an assumption in the ggplot2-philosophy that points from ordered factors shouldn't be connected?
This could possibly be addressed using the group
argument in aes
? See here.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.