zachmayer / caretensemble Goto Github PK
View Code? Open in Web Editor NEWcaret models all the way down :turtle:
License: Other
caret models all the way down :turtle:
License: Other
I am using the mlbench data example for classification given on the site . While building the models I get the following error for all the models:-
Error in train.default(X[train, ], Y[train], method = "mlpWeightDecay", :
final tuning parameters could not be determined
This message if for mlpweightdecay, but the error final tuning parameters could not be determined is for all the algorithms. Can somebbody help me on this?
I'd like to see us add a function that prunes the train models stored in the caretEnsemble
object. There is a lot of stuff stored in these individual train objects that takes up a lot of space and is not necessary for doing predictions or diagnostics. This is especially a problem when training models with a large number of observations (50k+).
This is an example of how much can be saved from a glm
object and an approach we might want to take here: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r
For some model methods the approximation used by setSeeds
fails to produce a seed object of the correct dimensions. Here is a MWE:
set.seed(442)
library(caret)
library(hda)
library(caretEnsemble)
train <- twoClassSim(n = 1000, intercept = -8, linearVars = 3,
noiseVars = 10, corrVars = 4, corrValue = 0.6)
test <- twoClassSim(n = 1500, intercept = -7, linearVars = 3,
noiseVars = 10, corrVars = 4, corrValue = 0.6)
fitControl <- trainControl(method='cv', number = 5, savePredictions = FALSE,
classProbs=TRUE, summaryFunction = twoClassSummary)
out <- buildModels(methodList = c("hda", "multinom"), control = fitControl,
x = train[, -23],
y = train[ , "Class"], metric = "ROC",
tuneLength = 4, baseSeed = 1204)
Which produces this helpful output:
Error in train.default(x = x, y = y, method = i, trControl = ctrl,
tuneLength = tl, :
Bad seeds: the seed object should be a list of length 6 with 5 integer
vectors of size 48 and the last list element having a single integer
Which is absolutely correct. hda
seems to expect seeds along it's grid to follow the formula tuneLength * 3 * tuneLength
while usually the tuneLength ^2
approximation is what works.
I should have this patched up rather quickly for hda
, but I'm wondering if there is a more programmatic way to set seeds based on tuneLength
using some caret
functions?
I am applying caretEnsemble
on a large dataset and have noticed the curious case where ensembling sometimes produces an ensembled prediction with an AUC that is lower than the AUC of any of the ensembled models.
I need to check that a) this is not a bug in the print
or summary
methods to caretEnsemble
and b) that we issue a warning when this is the case so that the user is notified.
After computing your weights with greedOptRMSE
, you are normalizing them to sum to one, so you wind up getting a convex combination. You could optimize the same loss function with constraints directly by formulating the problem as a quadratic program.
This should work, but is untested:
qpOptRMSE <- function(x, y) {
require(quadprog)
D <- crossprod(x)
d <- crossprod(x, y)
A <- cbind(rep(1, ncol(x)), diag(ncol(x)))
bvec <- c(1, rep(0, ncol(x)))
solve.QP(Dmat=D, dvec=d, Amat=A, bvec=bvec, meq=1)$solution
}
It's based on this (which is GLP-3 licensed).
Zachary,
I am trying to install caretEnsemble for Windows 7.0. Please tell me the exact procedure for installing on windows. I have downloaded the caretEnsemble-master.zip but after installing it from the zip file it is not seen in the library list. So I can't load the library. Similarly with the caretEnsemble-Dev version.I will appreciate if you can give me a qucik reply on this.
rEgards
amit
In some cases this might be more desirable than just letting caret do the parallelization— e.g. we're fitting a bunch of different models, each of which has 1 set of tuning parameters.
I am using the mlbench data example for classification given on the site . While building the models I get the following error for all the models:-
Error in train.default(X[train, ], Y[train], method = "mlpWeightDecay", :
final tuning parameters could not be determined
This message if for mlpweightdecay, but the error final tuning parameters could not be determined is for all the algorithms. Can somebbody help me on this?
This is the error while using glmnet
"Error in train.default(X[train, ], Y[train], method = "gbm", trControl = myControl, :
final tuning parameters could not be determined"
Just a note to choose between PR #42 for S3 methods and PR #41 for S4 methods for the major caretStack
and caretEnsemble
methods in the package.
@zachmayer Just pull in 1 of the pull requests and you can reject the other as well as PR #39, which is a subset of the other two.
If the new dataset has no incomplete cases, I'd like to suppress the message("Predictions being made only for cases with complete data").
Similarly, if all model have available data, I'd like to suppress: message("Predictions being made only from models with available data")
I think it's a little confusing to print those messages in cases when they don't apply. I think we could fix this by adding a newdata=NULL argument and passing it to predict.train.
The datasets used in the test-suite are undocumented. To pass the R CMD CHECK they need to either a) be documented, or b) be used only for testing and stored appropriately like they do in the lme4
source (https://github.com/lme4/lme4/tree/master/inst/testdata).
Thanks Zach for the latest updates
I would like to substitute in my control
index=createMultiFolds(Y[train], k=folds, times=repeats)
myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index=createMultiFolds(Y[train], k=folds, times=repeats) )
by
index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))
or
myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))
This works fine, however when attempting to run my control on a model I get the following error
model2 <- train(X[train,], Y[train], method='bagEarth',trControl=myControl,preProcess=PP)
Error en -unique(training) :
argumento no válido para un operador unitario
Is there a way to apply caret createTimeSlices to the ensemble maybe I am missing something
Any help welcomed
Thank you
Barnaby
Lets you change the stacking more parameters by hand, if you so desire.
The code below constructs a number of (binary) classification models. Upon bulding an ensemble an error is thrown about multiclass problems.
myControl=trainControl(
classProbs = TRUE,
summaryFunction=twoClassSummary,
method = "repeatedcv",
repeats = 10,
number=10
)
glmtrain = train(x,y, trControl = myControl, method='glm', metric='ROC')
gbmtrain = train(x,y, trControl = myControl, method='gbm', metric='ROC')
rftrain = train(x,y, trControl = myControl, method='rf', metric='ROC')
nbtrain = train(x,y, trControl = myControl, method='nb', metric='ROC')
nestedList <- list(glmtrain, gbmtrain, rftrain, nbtrain)
trainedEnsemble = caretEnsemble(nestedList, iter=1000)
Error in checkModels_extractTypes(list_of_models) :
Not yet implemented for multiclass problems
Todo in the code:
Probably a matter of writing 3 functions:
checkObserveds
checkRowIndexes
checkResamples
Each one should give an explanatory error message of what's wrong, which model(s) are the culprit, and why we can't make an ensemble in this situation.
This is pulled out of #3, which kind of grew into many separate bug reports.
This can also reduce the object size. The user could call this after fitting the ensemble to reduce it's size without changing it's output.
Inclusion of n.minobsinnode in GBM model gives an error:
model1 <- train(X, Y, method='gbm', trControl=myControl,tuneGrid=expand.grid(.n.trees=100, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode = 10))
My query is, how to pass n.minobsinnode in model1.
Thanks
Thanks for the code. I am really enjoying using it. I am seeing an issue pop-up on only a subset of the data sets that I am training. Any ideas what I might look into to address this?
Error in foo(betas[[length(betas)]], newdata)[, 1] :
subscript out of bounds
The traceback looks like this.
11: predictionFunction(method, modelFit, tempX, custom = models[[i]]$control$custom$prediction)
10: extractPrediction(list(object), unkX = newdata, unkOnly = TRUE,
...)
9: predict.train(x, type = "raw", newdata = newdata, ...)
8: predict(x, type = "raw", newdata = newdata, ...) at helper_functions.R#137
7: FUN(X[[i]], ...)
6: pblapply(X, FUN, ...)
5: pbsapply(list_of_models, function(x) {
if (type == "Classification" & x$control$classProbs) {
predict(x, type = "prob", newdata = newdata, ...)[, 2]
}
else {
predict(x, type = "raw", newdata = newdata, ...)
}
}) at helper_functions.R#133
4: multiPredict(ensemble$models, type, ...) at caretEnsemble.R#75
3: predict.caretEnsemble(fit, newdata = data.x)
2: predict(fit, newdata = data.x) at deposits-models.R#130
1: trainThenPredict(by, data, data.id)
The base prediction is the same: make predictions from each model in the ensemble. The only difference is in one case we weight them and in the other case we run them through another caret model.
Maybe caretList needs it's own class, and there should be a predict.caretList function, which returns a matrix. Both predict.caretEnsemble and predict.caretStack could call predict.caretList.
caretList would obviously return a caretList, and we'd need another helper function to turn arbitrary lists of caretModels into caretList. The models slots of caretEnsemble and caretStack would both be of class caretList.
Similar to safeOptAuc, stops ensembling if accuracy decreases. Consider making the safe functions the defaults, and renaming the old functions fastOpt*
.
Then we could have a flag to caretEnsemble
for safe=TRUE
. If TRUE
, we use the safe functions, if FALSE
we use the slightly faster non-safe functions.
If there's not a big difference between the safe vs non-safe speed, we get rid of the non-safe ones.
The new version of devtools looks pretty slick. It has functions for setting up testthat, putting data in the correct folders, and even better, using travis CI to automate testthat tests.
I'd really love to setup this project to use travis for automated testing!
http://blog.rstudio.org/2014/10/02/devtools-1-6/
https://travis-ci.org
We should make a short package vignette showing the difference between non-ensembled train objects and caretEnsemble
objects and briefly discussing the advantages of ensembles.
Maybe not for version 1.0, but 1.1 or so.
i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done.
do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)
similar to caretEnsemble. Pass plot to plot.train on the ensemble caret model.
The tests make some cools plots... I'd love to see a vignette that shows off some of the plot.caretEnsemble and plot.caretList plots.
@zachmayer this package is working in production for me in a couple of places. I was thinking it would be nice to finish up some of the current features and build unit tests to make sure the core feature -- ensembling of classification and regression caret
objects, stays stable and works. I'm a little nervous that some of the additions I have been making over in Dev
cloud up the focus of the package more than they help.
With that in mind I was thinking we could use this issue to brainstorm ideas for package extensions and prioritize them a bit based on needs. From my end I would say the top 4 things to improve are:
caretEnsemble
and caretList
objectsThen some "nice to have" features would be:
What do you think?
While using caret ensemble we caome acorss 2 AUCs. 1 is while getting the error which is obtained by using
error <- colAUC(as.matrix(temp) %% weights, temp3)(here as.matrix(temp) %% weights give us the combined predictions of all the models)
and the other is given by the predicted probabilites obtained by using the formula
return(list(preds = data.frame(pred = est, se = se.tmp), weight = conf))
and which is then used to obtain the ROC, AUC and accuracy .
The 2 AUCs differ. So which is the correct one .Which predictions should be used to obtain the accuracy.
REgards
Amit
On Mac OS, doing a fresh install:
all methodList NOT IN tuneList get added with arguments = NULL
all method IN tuneList get added with arguments = NULL
So if you specify:
methodList='rf'
tuneList=list(rf=list(tuneLength=10), rpart=list(tuneLength=5))
3 models are fit:
rf with default train arguments
rf with tuneLength 10
rpart with tuneLength 10
This is post 1.0
If you ensemble a series of models using external packages (e.g. mda
or glmnet
) and then clear your workspace and R session and load the caretEnsemble
object and attempt to use the predict
method, it fails.
This is because the predict
method for each individual model type is not necessarily in the namespace. The predict.caretEnsemble
should be able to identify and import the predict
methods necessary to generate predictions for each model type to be ensembled.
If the user decides the model needs some more iterations.
Maybe one of the graphs should be a "convergence" graph that shows the error decreasing over time? And if the error looks like it's still going down, the user could call update to run more iterations?
This will help us detect problems as they arise (e.g. next time there's another major update to caret).
When I run the following example:
library(caretEnsemble)
devtools::install_github("jknowles/EWStools") # to get the data
library(EWStools)
data(EWStestData)
ctrl <- trainControl(method = "repeatedcv",
repeats = 3, classProbs = TRUE, savePredictions = TRUE,
summaryFunction = twoClassSummary)
out <- buildModels(methodList = c("knn", "glm", "nb", "lda", "ctree"),
control = ctrl, x = modeldat$traindata$preds,
y = modeldat$traindata$class, tuneLength = 5)
# ensemble we will
out.ens <- caretEnsemble(out)
The model ensembles correctly and the object can produce predictions, but it is curious what summary reports:
> summary(out.ens)
The following models were ensembled: knn, glm, nb, ctree
They were weighted:
0.03 0.51 0.45 0.01
The resulting AUC is: 0.9857
The fit for each individual model on the AUC is:
method metric metricSD
1 knn 0.9402030 0.023904627
2 glm 0.9863390 0.008624078
3 nb NaN NA
4 ctree 0.8962804 0.036620901
It looks like the extractModRes
function has an issue in how it selects the error parameters for the individual models.
Currently, when training a nnet
or a multinom
or other forms of neural networks, the R console is filled with information about the individual iterations. Passing trace = FALSE
to train when fitting a nnet
suppresses this console clutter.
However, currently buildModels
will not properly pass trace = FALSE
to the train
argument. This should be fixed!
while trying to install CaretEnsemble, I get the message
install.packages('github')
Warning message:
package ‘github’ is not available (for R version 3.1.1)
What is the way out to install CaretEnsemble on linux ( fedora 17).
Here is the sessionInfo()
sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] caret_6.0-35 ggplot2_1.0.0 lattice_0.20-29 devtools_1.5
loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-5 brglm_0.5-9 car_2.0-20
[4] codetools_0.2-8 colorspace_1.2-4 digest_0.6.4
[7] evaluate_0.5.5 foreach_1.4.2 grid_3.1.1
[10] gtable_0.1.2 gtools_3.4.1 httr_0.4
[13] iterators_1.0.7 lme4_1.1-7 MASS_7.3-33
[16] Matrix_1.1-4 memoise_0.2.1 minqa_1.2.3
[19] munsell_0.4.2 nlme_3.1-117 nloptr_1.0.4
[22] nnet_7.3-8 parallel_3.1.1 plyr_1.8.1
[25] proto_0.3-10 Rcpp_0.11.2 RCurl_1.95-4.3
[28] reshape2_1.4 scales_0.2.4 splines_3.1.1
[31] stringr_0.6.2 tools_3.1.1 whisker_0.3-2
Thanks
First of all let me thank you for very needed ensemble package which you wrote
I have come out to an issue when attempting to train a large number of models all for regression pre-selecting the models that work I get from 60 to 70 functional models all listed in caret and all working individually. Considering this I attempt to run
greedy <- caretEnsemble(all.models, iter=1000L)
I come out with the following message.
Error: length(unique(indexes)) == 1 is not TRUE
where from your code I go to the origin of the issue which is
makePredObsMatrix(all.models)
Error: length(unique(indexes)) == 1 is not TRUE
length(unique(indexes) =2
The length in an example for unique(indexes) ranges in [[1]] 212 and 218 in [[2]] while the length for the observations must correspond to any of these ranges as per the description.
Do you know if there is a way to correct this error from within as to include all models. This seems to be a model or group of models specific issue (maybe related to caret 6.0) as far as I can see.
Any help will be welcomed, Thank you
I am using the following code but it is resulting into Error: length(unique(indexes)) == 1 is not TRUE. AS per some previous post I have incuded the seeding part but still I am getting the error. Please help. I am using R 3.03.
library(mlbench)
data(Sonar)
library(caret)
set.seed(123)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- sample.int(1000, 3)
seeds[[11]] <- sample.int(1000, 1)
inTraining <- createDataPartition(Sonar$Class, p = 0.75, list = FALSE)
training <- Sonar[inTraining, ]
testing <- Sonar[-inTraining, ]
fitControl <- trainControl(method = "cv", number = 10,repeats = 1,classProbs=TRUE, savePredictions=TRUE, seeds=seeds)
PP <- c('center', 'scale')
set.seed(1)
gbmFit1 <- train(Class ~ ., data = training,method = "svmRadial",trControl = fitControl,verbose = FALSE)
gbmFit2 <- train(Class ~ ., data = training,method = "gbm",trControl = fitControl,verbose = FALSE)
all.models <- list( gbmFit1 , gbmFit2 )
greedy <- caretEnsemble(all.models, iter=1000L)
sort(greedy$weights, decreasing=TRUE)
greedy$error
Say we want 2 random forests in the ensemble, one with mtry=2 and one with mtry=5. Currently we can't have dup methods.
It's possible we don't need the setSeeds. I think #50 will fix the incorrect resampling indexes error when we go to ensemble the models.
I think seeds only need to be set when models are fit in parallel, because if you start 10 R processes (one for each of 10 CV folds) they will tend to have the same random seed, and therefore all fit the same model with no randomness.
I think pre-setting the re-sampling indexes will make the samples used for each model truly random, and I think the default caret logic will make sure different seeds are used for different models, fit in parallel.
I'd like to hide as much code as possible (other than caretList, caretEnsemble, and caretStack of course!) from the 1.0 release. I don't wan't people to start using internal functions and then get mad when we re-factor them or replace them.
I think CRAN likes having examples
Hi Zach,
caretEnsemble is proving to be very useful for my work and it is working fine. However, I would like caretEnsemble to be installed on a high end machine but it is not connected to internet. So my query is:
If I do not have internet connection on my machine (but possible on a different machine) is there any standalone version (tar ball) of caretEnsemble which can be installed without accessing internet connection? And how to install it (like for R package, is it R CMD INSTALL package.tar.gz)
Thanks
Goes through the models are removes data that isn't needed, e.g. resamples, out-of-sample predictions, the original dataset, and anything else we can think of.
After making a caretEnsemble or caretStack, we'd run clean.caretList to shrink the final models as much as possible.
As each model is fit, print the name of the model.
Building models with identical resampling profiles may be unclear to users. Allowing the user to specify the methods to build train
objects from and then the control parameters, and generating the model list as a named class, e.g. a caretList
object would make the ensemble approach much more approachable to users. It could look something like:
methodList <- c("rf", "nnet", "knn", "lm")
control <- trainControl(...)
buildModels <- function(methodList, control, ...) {
myseeds <- # build seeds using control parameters
modelList <- list(methodList)
for(i in methodList){
modelList[[i]] <- fit(x = x, y=y, method = i, control = control)
}
return(modelList)
}
Not sure if this is 'out of scope' for this package, but I think it would make the ensembling process more accessible.
If buildModels is not passed a trainControl object with pre-definded indexes, for use with all models, we should raise a warning and try to set the resampling indexes ourselves, based on trainControl$method.
If we can't set the indexes, we should raise an error.
This probably requires a new function setIndexes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.