zachmayer / caretensemble Goto Github PK

View Code? Open in Web Editor NEW

227.0 227.0 76.0 12.87 MB

caret models all the way down :turtle:

License: Other

R 100.00%

caretensemble's People

Contributors

Stargazers

Watchers

caretensemble's Issues

Classification using caretEnsemble

I am using the mlbench data example for classification given on the site . While building the models I get the following error for all the models:-
Error in train.default(X[train, ], Y[train], method = "mlpWeightDecay", :
final tuning parameters could not be determined
This message if for mlpweightdecay, but the error final tuning parameters could not be determined is for all the algorithms. Can somebbody help me on this?

trim.caretList, trim.caretEnsemble, and trim.caretStack methods

I'd like to see us add a function that prunes the train models stored in the caretEnsemble object. There is a lot of stuff stored in these individual train objects that takes up a lot of space and is not necessary for doing predictions or diagnostics. This is especially a problem when training models with a large number of observations (50k+).

This is an example of how much can be saved from a glm object and an approach we might want to take here: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r

buildModels fails because of an error in seed dimensions from setSeeds

For some model methods the approximation used by setSeeds fails to produce a seed object of the correct dimensions. Here is a MWE:

set.seed(442)
library(caret)
library(hda)
library(caretEnsemble)
train <- twoClassSim(n = 1000, intercept = -8, linearVars = 3, 
                     noiseVars = 10, corrVars = 4, corrValue = 0.6)
test <- twoClassSim(n = 1500, intercept = -7, linearVars = 3, 
                    noiseVars = 10, corrVars = 4, corrValue = 0.6)

fitControl <- trainControl(method='cv', number = 5, savePredictions = FALSE, 
                           classProbs=TRUE, summaryFunction = twoClassSummary)

out <- buildModels(methodList = c("hda", "multinom"), control = fitControl, 
                   x = train[, -23], 
                   y = train[ , "Class"], metric = "ROC",
                   tuneLength = 4, baseSeed = 1204)

Which produces this helpful output:

Error in train.default(x = x, y = y, method = i, trControl = ctrl, 
tuneLength = tl,  : 
  Bad seeds: the seed object should be a list of length 6 with 5 integer 
vectors of size 48 and the last list element having a single integer

Which is absolutely correct. hda seems to expect seeds along it's grid to follow the formula tuneLength * 3 * tuneLength while usually the tuneLength ^2 approximation is what works.

I should have this patched up rather quickly for hda, but I'm wondering if there is a more programmatic way to set seeds based on tuneLength using some caret functions?

Ensembling decreases accuracy

I am applying caretEnsemble on a large dataset and have noticed the curious case where ensembling sometimes produces an ensembled prediction with an AUC that is lower than the AUC of any of the ensembled models.

I need to check that a) this is not a bug in the print or summary methods to caretEnsemble and b) that we issue a warning when this is the case so that the user is notified.

For rMSE, use constrained least square for ensemble

After computing your weights with greedOptRMSE, you are normalizing them to sum to one, so you wind up getting a convex combination. You could optimize the same loss function with constraints directly by formulating the problem as a quadratic program.

This should work, but is untested:

qpOptRMSE <- function(x, y) {
      require(quadprog)
      D <- crossprod(x)
      d <- crossprod(x, y)
      A <- cbind(rep(1, ncol(x)), diag(ncol(x)))
      bvec <- c(1, rep(0, ncol(x)))
      solve.QP(Dmat=D, dvec=d, Amat=A, bvec=bvec, meq=1)$solution
    }

It's based on this (which is GLP-3 licensed).

Installation

Zachary,

I am trying to install caretEnsemble for Windows 7.0. Please tell me the exact procedure for installing on windows. I have downloaded the caretEnsemble-master.zip but after installing it from the zip file it is not seen in the library list. So I can't load the library. Similarly with the caretEnsemble-Dev version.I will appreciate if you can give me a qucik reply on this.

rEgards
amit

Add parallel option to caretList and predict.caretList using foreach

In some cases this might be more desirable than just letting caret do the parallelization— e.g. we're fitting a bunch of different models, each of which has 1 set of tuning parameters.

Classification using caretEnsemble

This is the error while using glmnet
"Error in train.default(X[train, ], Y[train], method = "gbm", trControl = myControl, :
final tuning parameters could not be determined"

Choose S3 or S4 methods for caretEnsemble

Just a note to choose between PR #42 for S3 methods and PR #41 for S4 methods for the major caretStack and caretEnsemble methods in the package.

@zachmayer Just pull in 1 of the pull requests and you can reject the other as well as PR #39, which is a subset of the other two.

Suppress unnecessary messages in predict.caretEnsemble

If the new dataset has no incomplete cases, I'd like to suppress the message("Predictions being made only for cases with complete data").

Similarly, if all model have available data, I'd like to suppress: message("Predictions being made only from models with available data")

I think it's a little confusing to print those messages in cases when they don't apply. I think we could fix this by adding a newdata=NULL argument and passing it to predict.train.

Document undocumented data sets to pass CMD CHECK

The datasets used in the test-suite are undocumented. To pass the R CMD CHECK they need to either a) be documented, or b) be used only for testing and stored appropriately like they do in the lme4 source (https://github.com/lme4/lme4/tree/master/inst/testdata).

Substituting index=createMultiFolds by index= createTimeSlices produces an error when running the models

Thanks Zach for the latest updates

I would like to substitute in my control
index=createMultiFolds(Y[train], k=folds, times=repeats)

myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index=createMultiFolds(Y[train], k=folds, times=repeats) )

index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))

myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))

This works fine, however when attempting to run my control on a model I get the following error
model2 <- train(X[train,], Y[train], method='bagEarth',trControl=myControl,preProcess=PP)
Error en -unique(training) :
argumento no válido para un operador unitario

Is there a way to apply caret createTimeSlices to the ensemble maybe I am missing something
Any help welcomed
Thank you
Barnaby

update.caretStack that calls update.Train on the ensembling model

Lets you change the stacking more parameters by hand, if you so desire.

checkModels complains about multiclass problems using binary classification models

The code below constructs a number of (binary) classification models. Upon bulding an ensemble an error is thrown about multiclass problems.

myControl=trainControl(
classProbs = TRUE,
summaryFunction=twoClassSummary,
method = "repeatedcv",
repeats = 10,
number=10
)

glmtrain = train(x,y, trControl = myControl, method='glm', metric='ROC')
gbmtrain = train(x,y, trControl = myControl, method='gbm', metric='ROC')
rftrain = train(x,y, trControl = myControl, method='rf', metric='ROC')
nbtrain = train(x,y, trControl = myControl, method='nb', metric='ROC')

with str(y) = Factor w/ 2 levels "YES","NO": 2 2 1 2 2 2 2 2 1 1 ...

nestedList <- list(glmtrain, gbmtrain, rftrain, nbtrain)

trainedEnsemble = caretEnsemble(nestedList, iter=1000)
Error in checkModels_extractTypes(list_of_models) :
Not yet implemented for multiclass problems

Optionally allow buildModels to continue on model failure

Add checks to extractBestPreds

Todo in the code:

Insert checks here: observeds are all equal, row indexes are equal, Resamples are equal

Probably a matter of writing 3 functions:
checkObserveds
checkRowIndexes
checkResamples

Each one should give an explanatory error message of what's wrong, which model(s) are the culprit, and why we can't make an ensemble in this situation.

This is pulled out of #3, which kind of grew into many separate bug reports.

prune.caretEnsemble: remove models with weight 0 from the library

This can also reduce the object size. The user could call this after fitting the ensemble to reduce it's size without changing it's output.

n.minobsinnode

Inclusion of n.minobsinnode in GBM model gives an error:
model1 <- train(X, Y, method='gbm', trControl=myControl,tuneGrid=expand.grid(.n.trees=100, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode = 10))

My query is, how to pass n.minobsinnode in model1.

Thanks

Subscript Out of Bounds

Thanks for the code. I am really enjoying using it. I am seeing an issue pop-up on only a subset of the data sets that I am training. Any ideas what I might look into to address this?

Error in foo(betas[[length(betas)]], newdata)[, 1] : 
  subscript out of bounds

The traceback looks like this.

11: predictionFunction(method, modelFit, tempX, custom = models[[i]]$control$custom$prediction)
10: extractPrediction(list(object), unkX = newdata, unkOnly = TRUE, 
    ...)
9: predict.train(x, type = "raw", newdata = newdata, ...)
8: predict(x, type = "raw", newdata = newdata, ...) at helper_functions.R#137
7: FUN(X[[i]], ...)
6: pblapply(X, FUN, ...)
5: pbsapply(list_of_models, function(x) {
   if (type == "Classification" & x$control$classProbs) {
       predict(x, type = "prob", newdata = newdata, ...)[, 2]
   }
   else {
       predict(x, type = "raw", newdata = newdata, ...)
   }
   }) at helper_functions.R#133
4: multiPredict(ensemble$models, type, ...) at caretEnsemble.R#75
3: predict.caretEnsemble(fit, newdata = data.x)
2: predict(fit, newdata = data.x) at deposits-models.R#130
1: trainThenPredict(by, data, data.id)

combine predict methods for caretEnsemble and caretStack?

The base prediction is the same: make predictions from each model in the ensemble. The only difference is in one case we weight them and in the other case we run them through another caret model.

Maybe caretList needs it's own class, and there should be a predict.caretList function, which returns a matrix. Both predict.caretEnsemble and predict.caretStack could call predict.caretList.

caretList would obviously return a caretList, and we'd need another helper function to turn arbitrary lists of caretModels into caretList. The models slots of caretEnsemble and caretStack would both be of class caretList.

safeOptRMSE

Similar to safeOptAuc, stops ensembling if accuracy decreases. Consider making the safe functions the defaults, and renaming the old functions fastOpt*.

Then we could have a flag to caretEnsemble for safe=TRUE. If TRUE, we use the safe functions, if FALSE we use the slightly faster non-safe functions.

If there's not a big difference between the safe vs non-safe speed, we get rid of the non-safe ones.

Setup Travis CI

The new version of devtools looks pretty slick. It has functions for setting up testthat, putting data in the correct folders, and even better, using travis CI to automate testthat tests.

I'd really love to setup this project to use travis for automated testing!

http://blog.rstudio.org/2014/10/02/devtools-1-6/
https://travis-ci.org

Vignette

We should make a short package vignette showing the difference between non-ensembled train objects and caretEnsemble objects and briefly discussing the advantages of ensembles.

Maybe not for version 1.0, but 1.1 or so.

Multi-class classification greedy optimization

i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done.
do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)

summary/print/plot methods for caretStack

similar to caretEnsemble. Pass plot to plot.train on the ensemble caret model.

Plots vignette?

The tests make some cools plots... I'd love to see a vignette that shows off some of the plot.caretEnsemble and plot.caretList plots.

Roadmap?

@zachmayer this package is working in production for me in a couple of places. I was thinking it would be nice to finish up some of the current features and build unit tests to make sure the core feature -- ensembling of classification and regression caret objects, stays stable and works. I'm a little nervous that some of the additions I have been making over in Dev cloud up the focus of the package more than they help.

With that in mind I was thinking we could use this issue to brainstorm ideas for package extensions and prioritize them a bit based on needs. From my end I would say the top 4 things to improve are:

A unit test suite for all objects and their generics/methods
Unit tests for the optimizers, with some edge cases
Make sure the predict method works in edge cases and arrive at some way to calculate prediction errors
Convenient print, plot, and summary methods for caretEnsemble and caretList objects

Then some "nice to have" features would be:

Better variable importance calculations
Speed up prediction performance (parallelize it somehow)

What do you think?

AUC by caretensemble

While using caret ensemble we caome acorss 2 AUCs. 1 is while getting the error which is obtained by using
error <- colAUC(as.matrix(temp) %% weights, temp3)(here as.matrix(temp) %% weights give us the combined predictions of all the models)

and the other is given by the predicted probabilites obtained by using the formula
return(list(preds = data.frame(pred = est, se = se.tmp), weight = conf))
and which is then used to obtain the ROC, AUC and accuracy .
The 2 AUCs differ. So which is the correct one .Which predictions should be used to obtain the accuracy.

REgards
Amit

Installation failed due to error in Rd file

On Mac OS, doing a fresh install:

installing source package 'caretEnsemble' ...
** R
** data
** inst
** tests
** preparing package for lazy loading
** help
Error : /private/var/folders/jk/21t5rvgd3hgbgvmxlxx26q8w0000gn/T/Rtmpatuz5C/devtools97a269953002/caretEnsemble-master/man/makePredObsMatrix.Rd: Sections \title, and \name must exist and be unique in Rd files
ERROR: installing Rd objects failed for package 'caretEnsemble'
removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/caretEnsemble'
Error: Command failed (1)

Add parallel option to predict.caretStack using foreach

If both methodList and tuneList are specified, use the union of both sets

all methodList NOT IN tuneList get added with arguments = NULL
all method IN tuneList get added with arguments = NULL

So if you specify:
methodList='rf'
tuneList=list(rf=list(tuneLength=10), rpart=list(tuneLength=5))

3 models are fit:
rf with default train arguments
rf with tuneLength 10
rpart with tuneLength 10

This is post 1.0

Auto import for the predict method

If you ensemble a series of models using external packages (e.g. mda or glmnet) and then clear your workspace and R session and load the caretEnsemble object and attempt to use the predict method, it fails.

This is because the predict method for each individual model type is not necessarily in the namespace. The predict.caretEnsemble should be able to identify and import the predict methods necessary to generate predictions for each model type to be ensembled.

update.caretEnsemble to run more iterations

If the user decides the model needs some more iterations.

Maybe one of the graphs should be a "convergence" graph that shows the error decreasing over time? And if the error looks like it's still going down, the user could call update to run more iterations?

Automate unit tests

This will help us detect problems as they arise (e.g. next time there's another major update to caret).

caretEnsemble returns NaN for metric

When I run the following example:

library(caretEnsemble)
devtools::install_github("jknowles/EWStools") # to get the data
library(EWStools)
data(EWStestData)


ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 3, classProbs = TRUE, savePredictions = TRUE,
                     summaryFunction = twoClassSummary)


out <- buildModels(methodList = c("knn", "glm", "nb", "lda", "ctree"), 
                   control = ctrl, x = modeldat$traindata$preds, 
                   y = modeldat$traindata$class, tuneLength = 5)

# ensemble we will

out.ens <- caretEnsemble(out)

The model ensembles correctly and the object can produce predictions, but it is curious what summary reports:

> summary(out.ens)
The following models were ensembled: knn, glm, nb, ctree 
They were weighted: 
0.03 0.51 0.45 0.01
The resulting AUC is: 0.9857
The fit for each individual model on the AUC is: 
  method    metric    metricSD
1    knn 0.9402030 0.023904627
2    glm 0.9863390 0.008624078
3     nb       NaN          NA
4  ctree 0.8962804 0.036620901

It looks like the extractModRes function has an issue in how it selects the error parameters for the individual models.

Allow trace = FALSE in buildModels

Currently, when training a nnet or a multinom or other forms of neural networks, the R console is filled with information about the individual iterations. Passing trace = FALSE to train when fitting a nnet suppresses this console clutter.

However, currently buildModels will not properly pass trace = FALSE to the train argument. This should be fixed!

no github under R3.1.1

while trying to install CaretEnsemble, I get the message

install.packages('github')
Warning message:
package ‘github’ is not available (for R version 3.1.1)

What is the way out to install CaretEnsemble on linux ( fedora 17).
Here is the sessionInfo()

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] caret_6.0-35 ggplot2_1.0.0 lattice_0.20-29 devtools_1.5

loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-5 brglm_0.5-9 car_2.0-20
[4] codetools_0.2-8 colorspace_1.2-4 digest_0.6.4
[7] evaluate_0.5.5 foreach_1.4.2 grid_3.1.1
[10] gtable_0.1.2 gtools_3.4.1 httr_0.4
[13] iterators_1.0.7 lme4_1.1-7 MASS_7.3-33
[16] Matrix_1.1-4 memoise_0.2.1 minqa_1.2.3
[19] munsell_0.4.2 nlme_3.1-117 nloptr_1.0.4
[22] nnet_7.3-8 parallel_3.1.1 plyr_1.8.1
[25] proto_0.3-10 Rcpp_0.11.2 RCurl_1.95-4.3
[28] reshape2_1.4 scales_0.2.4 splines_3.1.1
[31] stringr_0.6.2 tools_3.1.1 whisker_0.3-2

Thanks

Error: length(unique(indexes)) == 1 is not TRUE for caret large ensemble of models

First of all let me thank you for very needed ensemble package which you wrote

I have come out to an issue when attempting to train a large number of models all for regression pre-selecting the models that work I get from 60 to 70 functional models all listed in caret and all working individually. Considering this I attempt to run

greedy <- caretEnsemble(all.models, iter=1000L)

I come out with the following message.

Error: length(unique(indexes)) == 1 is not TRUE

where from your code I go to the origin of the issue which is

makePredObsMatrix(all.models) 
Error: length(unique(indexes)) == 1 is not TRUE
length(unique(indexes) =2

The length in an example for unique(indexes) ranges in [[1]] 212 and 218 in [[2]] while the length for the observations must correspond to any of these ranges as per the description.

Do you know if there is a way to correct this error from within as to include all models. This seems to be a model or group of models specific issue (maybe related to caret 6.0) as far as I can see.

Any help will be welcomed, Thank you

Add verbose option to prediction functions to suppress progress bars

Add parallel option to predict.caretEnsemble using foreach

Issue with caretEnsemble

I am using the following code but it is resulting into Error: length(unique(indexes)) == 1 is not TRUE. AS per some previous post I have incuded the seeding part but still I am getting the error. Please help. I am using R 3.03.

library(mlbench)
data(Sonar)
library(caret)
set.seed(123)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- sample.int(1000, 3)

For the last model:

seeds[[11]] <- sample.int(1000, 1)
inTraining <- createDataPartition(Sonar$Class, p = 0.75, list = FALSE)
training <- Sonar[inTraining, ]
testing <- Sonar[-inTraining, ]
fitControl <- trainControl(method = "cv", number = 10,repeats = 1,classProbs=TRUE, savePredictions=TRUE, seeds=seeds)

PP <- c('center', 'scale')
set.seed(1)

gbmFit1 <- train(Class ~ ., data = training,method = "svmRadial",trControl = fitControl,verbose = FALSE)

gbmFit2 <- train(Class ~ ., data = training,method = "gbm",trControl = fitControl,verbose = FALSE)
all.models <- list( gbmFit1 , gbmFit2 )

greedy <- caretEnsemble(all.models, iter=1000L)

sort(greedy$weights, decreasing=TRUE)

greedy$error

Allow repeat model methods

Say we want 2 random forests in the ensemble, one with mtry=2 and one with mtry=5. Currently we can't have dup methods.

investigate how caret sets seeds for parallel models

It's possible we don't need the setSeeds. I think #50 will fix the incorrect resampling indexes error when we go to ensemble the models.

I think seeds only need to be set when models are fit in parallel, because if you start 10 R processes (one for each of 10 CV folds) they will tend to have the same random seed, and therefore all fit the same model with no randomness.

I think pre-setting the re-sampling indexes will make the samples used for each model truly random, and I think the default caret logic will make sure different seeds are used for different models, fit in parallel.

Cleanup API and remove unnecessary @export tags

I'd like to hide as much code as possible (other than caretList, caretEnsemble, and caretStack of course!) from the 1.0 release. I don't wan't people to start using internal functions and then get mad when we re-factor them or replace them.

examples for major functions

I think CRAN likes having examples

installing caretEnsemble without net connection

Hi Zach,
caretEnsemble is proving to be very useful for my work and it is working fine. However, I would like caretEnsemble to be installed on a high end machine but it is not connected to internet. So my query is:

If I do not have internet connection on my machine (but possible on a different machine) is there any standalone version (tar ball) of caretEnsemble which can be installed without accessing internet connection? And how to install it (like for R package, is it R CMD INSTALL package.tar.gz)
Thanks

Make a clean.caretList function

Goes through the models are removes data that isn't needed, e.g. resamples, out-of-sample predictions, the original dataset, and anything else we can think of.

After making a caretEnsemble or caretStack, we'd run clean.caretList to shrink the final models as much as possible.

Add verbose argument to caretList

As each model is fit, print the name of the model.

Consider making caret model list a named object

Building models with identical resampling profiles may be unclear to users. Allowing the user to specify the methods to build train objects from and then the control parameters, and generating the model list as a named class, e.g. a caretList object would make the ensemble approach much more approachable to users. It could look something like:

methodList <- c("rf", "nnet", "knn", "lm")
control <- trainControl(...)

buildModels <- function(methodList, control, ...) {
myseeds <- # build seeds using control parameters
modelList <- list(methodList)
for(i in methodList){
modelList[[i]] <- fit(x = x, y=y, method = i, control = control)
}
return(modelList)
}

Not sure if this is 'out of scope' for this package, but I think it would make the ensembling process more accessible.

warning in buildModels if trainControl indexes are not set

If buildModels is not passed a trainControl object with pre-definded indexes, for use with all models, we should raise a warning and try to set the resampling indexes ourselves, based on trainControl$method.

If we can't set the indexes, we should raise an error.

This probably requires a new function setIndexes.

zachmayer / caretensemble Goto Github PK

caretensemble's People

Contributors

Stargazers

Watchers

Forkers

caretensemble's Issues

Insert checks here: observeds are all equal, row indexes are equal, Resamples are equal

For the last model:

Recommend Projects

Recommend Topics

Recommend Org