Giter Club home page Giter Club logo

caretensemble's People

Contributors

eric-czech avatar jasoncec avatar jknowles avatar rlesca01 avatar smsaladi avatar terrytangyuan avatar the-tourist- avatar topazand avatar washcycle avatar zachmayer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caretensemble's Issues

Classification using caretEnsemble

I am using the mlbench data example for classification given on the site . While building the models I get the following error for all the models:-
Error in train.default(X[train, ], Y[train], method = "mlpWeightDecay", :
final tuning parameters could not be determined
This message if for mlpweightdecay, but the error final tuning parameters could not be determined is for all the algorithms. Can somebbody help me on this?

trim.caretList, trim.caretEnsemble, and trim.caretStack methods

I'd like to see us add a function that prunes the train models stored in the caretEnsemble object. There is a lot of stuff stored in these individual train objects that takes up a lot of space and is not necessary for doing predictions or diagnostics. This is especially a problem when training models with a large number of observations (50k+).

This is an example of how much can be saved from a glm object and an approach we might want to take here: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r

buildModels fails because of an error in seed dimensions from setSeeds

For some model methods the approximation used by setSeeds fails to produce a seed object of the correct dimensions. Here is a MWE:

set.seed(442)
library(caret)
library(hda)
library(caretEnsemble)
train <- twoClassSim(n = 1000, intercept = -8, linearVars = 3, 
                     noiseVars = 10, corrVars = 4, corrValue = 0.6)
test <- twoClassSim(n = 1500, intercept = -7, linearVars = 3, 
                    noiseVars = 10, corrVars = 4, corrValue = 0.6)

fitControl <- trainControl(method='cv', number = 5, savePredictions = FALSE, 
                           classProbs=TRUE, summaryFunction = twoClassSummary)

out <- buildModels(methodList = c("hda", "multinom"), control = fitControl, 
                   x = train[, -23], 
                   y = train[ , "Class"], metric = "ROC",
                   tuneLength = 4, baseSeed = 1204)

Which produces this helpful output:

Error in train.default(x = x, y = y, method = i, trControl = ctrl, 
tuneLength = tl,  : 
  Bad seeds: the seed object should be a list of length 6 with 5 integer 
vectors of size 48 and the last list element having a single integer 

Which is absolutely correct. hda seems to expect seeds along it's grid to follow the formula tuneLength * 3 * tuneLength while usually the tuneLength ^2 approximation is what works.

I should have this patched up rather quickly for hda, but I'm wondering if there is a more programmatic way to set seeds based on tuneLength using some caret functions?

Ensembling decreases accuracy

I am applying caretEnsemble on a large dataset and have noticed the curious case where ensembling sometimes produces an ensembled prediction with an AUC that is lower than the AUC of any of the ensembled models.

I need to check that a) this is not a bug in the print or summary methods to caretEnsemble and b) that we issue a warning when this is the case so that the user is notified.

For rMSE, use constrained least square for ensemble

After computing your weights with greedOptRMSE, you are normalizing them to sum to one, so you wind up getting a convex combination. You could optimize the same loss function with constraints directly by formulating the problem as a quadratic program.

This should work, but is untested:

qpOptRMSE <- function(x, y) {
      require(quadprog)
      D <- crossprod(x)
      d <- crossprod(x, y)
      A <- cbind(rep(1, ncol(x)), diag(ncol(x)))
      bvec <- c(1, rep(0, ncol(x)))
      solve.QP(Dmat=D, dvec=d, Amat=A, bvec=bvec, meq=1)$solution
    }

It's based on this (which is GLP-3 licensed).

Installation

Zachary,

I am trying to install caretEnsemble for Windows 7.0. Please tell me the exact procedure for installing on windows. I have downloaded the caretEnsemble-master.zip but after installing it from the zip file it is not seen in the library list. So I can't load the library. Similarly with the caretEnsemble-Dev version.I will appreciate if you can give me a qucik reply on this.

rEgards
amit

Classification using caretEnsemble

I am using the mlbench data example for classification given on the site . While building the models I get the following error for all the models:-
Error in train.default(X[train, ], Y[train], method = "mlpWeightDecay", :
final tuning parameters could not be determined
This message if for mlpweightdecay, but the error final tuning parameters could not be determined is for all the algorithms. Can somebbody help me on this?

This is the error while using glmnet
"Error in train.default(X[train, ], Y[train], method = "gbm", trControl = myControl, :
final tuning parameters could not be determined"

Choose S3 or S4 methods for caretEnsemble

Just a note to choose between PR #42 for S3 methods and PR #41 for S4 methods for the major caretStack and caretEnsemble methods in the package.

@zachmayer Just pull in 1 of the pull requests and you can reject the other as well as PR #39, which is a subset of the other two.

Suppress unnecessary messages in predict.caretEnsemble

If the new dataset has no incomplete cases, I'd like to suppress the message("Predictions being made only for cases with complete data").

Similarly, if all model have available data, I'd like to suppress: message("Predictions being made only from models with available data")

I think it's a little confusing to print those messages in cases when they don't apply. I think we could fix this by adding a newdata=NULL argument and passing it to predict.train.

Substituting index=createMultiFolds by index= createTimeSlices produces an error when running the models

Thanks Zach for the latest updates

I would like to substitute in my control
index=createMultiFolds(Y[train], k=folds, times=repeats)

myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index=createMultiFolds(Y[train], k=folds, times=repeats) )

by

index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))

or

myControl = trainControl(method = "cv", number = folds, repeats = repeats, savePrediction = TRUE, classProbs = FALSE, returnResamp = "final",returnData = TRUE, allowParallel=TRUE, seeds = mseeds, index= createTimeSlices(Y[train], initialWindow= length(Y[train]) - sum(train==FALSE), horizon = 1, fixedWindow =FALSE))

This works fine, however when attempting to run my control on a model I get the following error
model2 <- train(X[train,], Y[train], method='bagEarth',trControl=myControl,preProcess=PP)
Error en -unique(training) :
argumento no válido para un operador unitario

Is there a way to apply caret createTimeSlices to the ensemble maybe I am missing something
Any help welcomed
Thank you
Barnaby

checkModels complains about multiclass problems using binary classification models

The code below constructs a number of (binary) classification models. Upon bulding an ensemble an error is thrown about multiclass problems.

myControl=trainControl(
classProbs = TRUE,
summaryFunction=twoClassSummary,
method = "repeatedcv",
repeats = 10,
number=10
)

glmtrain = train(x,y, trControl = myControl, method='glm', metric='ROC')
gbmtrain = train(x,y, trControl = myControl, method='gbm', metric='ROC')
rftrain = train(x,y, trControl = myControl, method='rf', metric='ROC')
nbtrain = train(x,y, trControl = myControl, method='nb', metric='ROC')

  • with str(y) = Factor w/ 2 levels "YES","NO": 2 2 1 2 2 2 2 2 1 1 ...

nestedList <- list(glmtrain, gbmtrain, rftrain, nbtrain)

trainedEnsemble = caretEnsemble(nestedList, iter=1000)
Error in checkModels_extractTypes(list_of_models) :
Not yet implemented for multiclass problems

Add checks to extractBestPreds

Todo in the code:

Insert checks here: observeds are all equal, row indexes are equal, Resamples are equal

Probably a matter of writing 3 functions:
checkObserveds
checkRowIndexes
checkResamples

Each one should give an explanatory error message of what's wrong, which model(s) are the culprit, and why we can't make an ensemble in this situation.

This is pulled out of #3, which kind of grew into many separate bug reports.

n.minobsinnode

Inclusion of n.minobsinnode in GBM model gives an error:
model1 <- train(X, Y, method='gbm', trControl=myControl,tuneGrid=expand.grid(.n.trees=100, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode = 10))

My query is, how to pass n.minobsinnode in model1.

Thanks

Subscript Out of Bounds

Thanks for the code. I am really enjoying using it. I am seeing an issue pop-up on only a subset of the data sets that I am training. Any ideas what I might look into to address this?

Error in foo(betas[[length(betas)]], newdata)[, 1] : 
  subscript out of bounds

The traceback looks like this.

11: predictionFunction(method, modelFit, tempX, custom = models[[i]]$control$custom$prediction)
10: extractPrediction(list(object), unkX = newdata, unkOnly = TRUE, 
    ...)
9: predict.train(x, type = "raw", newdata = newdata, ...)
8: predict(x, type = "raw", newdata = newdata, ...) at helper_functions.R#137
7: FUN(X[[i]], ...)
6: pblapply(X, FUN, ...)
5: pbsapply(list_of_models, function(x) {
   if (type == "Classification" & x$control$classProbs) {
       predict(x, type = "prob", newdata = newdata, ...)[, 2]
   }
   else {
       predict(x, type = "raw", newdata = newdata, ...)
   }
   }) at helper_functions.R#133
4: multiPredict(ensemble$models, type, ...) at caretEnsemble.R#75
3: predict.caretEnsemble(fit, newdata = data.x)
2: predict(fit, newdata = data.x) at deposits-models.R#130
1: trainThenPredict(by, data, data.id)

combine predict methods for caretEnsemble and caretStack?

The base prediction is the same: make predictions from each model in the ensemble. The only difference is in one case we weight them and in the other case we run them through another caret model.

Maybe caretList needs it's own class, and there should be a predict.caretList function, which returns a matrix. Both predict.caretEnsemble and predict.caretStack could call predict.caretList.

caretList would obviously return a caretList, and we'd need another helper function to turn arbitrary lists of caretModels into caretList. The models slots of caretEnsemble and caretStack would both be of class caretList.

safeOptRMSE

Similar to safeOptAuc, stops ensembling if accuracy decreases. Consider making the safe functions the defaults, and renaming the old functions fastOpt*.

Then we could have a flag to caretEnsemble for safe=TRUE. If TRUE, we use the safe functions, if FALSE we use the slightly faster non-safe functions.

If there's not a big difference between the safe vs non-safe speed, we get rid of the non-safe ones.

Vignette

We should make a short package vignette showing the difference between non-ensembled train objects and caretEnsemble objects and briefly discussing the advantages of ensembles.

Maybe not for version 1.0, but 1.1 or so.

Multi-class classification greedy optimization

i see that branch Dev has some more progress regarding multi-class classification ensemble stacking but unfortunately it is not yet done.
do you plan on implementing this and/or could you point me in the right direction so i might be able to finish it? I don't seem to understand what the problem/holdup is (no offense intended)

Plots vignette?

The tests make some cools plots... I'd love to see a vignette that shows off some of the plot.caretEnsemble and plot.caretList plots.

Roadmap?

@zachmayer this package is working in production for me in a couple of places. I was thinking it would be nice to finish up some of the current features and build unit tests to make sure the core feature -- ensembling of classification and regression caret objects, stays stable and works. I'm a little nervous that some of the additions I have been making over in Dev cloud up the focus of the package more than they help.

With that in mind I was thinking we could use this issue to brainstorm ideas for package extensions and prioritize them a bit based on needs. From my end I would say the top 4 things to improve are:

  • A unit test suite for all objects and their generics/methods
  • Unit tests for the optimizers, with some edge cases
  • Make sure the predict method works in edge cases and arrive at some way to calculate prediction errors
  • Convenient print, plot, and summary methods for caretEnsemble and caretList objects

Then some "nice to have" features would be:

  • Better variable importance calculations
  • Speed up prediction performance (parallelize it somehow)

What do you think?

AUC by caretensemble

While using caret ensemble we caome acorss 2 AUCs. 1 is while getting the error which is obtained by using
error <- colAUC(as.matrix(temp) %% weights, temp3)(here as.matrix(temp) %% weights give us the combined predictions of all the models)

and the other is given by the predicted probabilites obtained by using the formula
return(list(preds = data.frame(pred = est, se = se.tmp), weight = conf))
and which is then used to obtain the ROC, AUC and accuracy .
The 2 AUCs differ. So which is the correct one .Which predictions should be used to obtain the accuracy.

REgards
Amit

Installation failed due to error in Rd file

On Mac OS, doing a fresh install:

  • installing source package 'caretEnsemble' ...
    ** R
    ** data
    ** inst
    ** tests
    ** preparing package for lazy loading
    ** help
    Error : /private/var/folders/jk/21t5rvgd3hgbgvmxlxx26q8w0000gn/T/Rtmpatuz5C/devtools97a269953002/caretEnsemble-master/man/makePredObsMatrix.Rd: Sections \title, and \name must exist and be unique in Rd files
    ERROR: installing Rd objects failed for package 'caretEnsemble'
  • removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/caretEnsemble'
    Error: Command failed (1)

If both methodList and tuneList are specified, use the union of both sets

all methodList NOT IN tuneList get added with arguments = NULL
all method IN tuneList get added with arguments = NULL

So if you specify:
methodList='rf'
tuneList=list(rf=list(tuneLength=10), rpart=list(tuneLength=5))

3 models are fit:
rf with default train arguments
rf with tuneLength 10
rpart with tuneLength 10

This is post 1.0

Auto import for the predict method

If you ensemble a series of models using external packages (e.g. mda or glmnet) and then clear your workspace and R session and load the caretEnsemble object and attempt to use the predict method, it fails.

This is because the predict method for each individual model type is not necessarily in the namespace. The predict.caretEnsemble should be able to identify and import the predict methods necessary to generate predictions for each model type to be ensembled.

update.caretEnsemble to run more iterations

If the user decides the model needs some more iterations.

Maybe one of the graphs should be a "convergence" graph that shows the error decreasing over time? And if the error looks like it's still going down, the user could call update to run more iterations?

Automate unit tests

This will help us detect problems as they arise (e.g. next time there's another major update to caret).

caretEnsemble returns NaN for metric

When I run the following example:

library(caretEnsemble)
devtools::install_github("jknowles/EWStools") # to get the data
library(EWStools)
data(EWStestData)


ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 3, classProbs = TRUE, savePredictions = TRUE,
                     summaryFunction = twoClassSummary)


out <- buildModels(methodList = c("knn", "glm", "nb", "lda", "ctree"), 
                   control = ctrl, x = modeldat$traindata$preds, 
                   y = modeldat$traindata$class, tuneLength = 5)

# ensemble we will

out.ens <- caretEnsemble(out)

The model ensembles correctly and the object can produce predictions, but it is curious what summary reports:

> summary(out.ens)
The following models were ensembled: knn, glm, nb, ctree 
They were weighted: 
0.03 0.51 0.45 0.01
The resulting AUC is: 0.9857
The fit for each individual model on the AUC is: 
  method    metric    metricSD
1    knn 0.9402030 0.023904627
2    glm 0.9863390 0.008624078
3     nb       NaN          NA
4  ctree 0.8962804 0.036620901

It looks like the extractModRes function has an issue in how it selects the error parameters for the individual models.

Allow trace = FALSE in buildModels

Currently, when training a nnet or a multinom or other forms of neural networks, the R console is filled with information about the individual iterations. Passing trace = FALSE to train when fitting a nnet suppresses this console clutter.

However, currently buildModels will not properly pass trace = FALSE to the train argument. This should be fixed!

no github under R3.1.1

while trying to install CaretEnsemble, I get the message

install.packages('github')
Warning message:
package ‘github’ is not available (for R version 3.1.1)

What is the way out to install CaretEnsemble on linux ( fedora 17).
Here is the sessionInfo()

sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] caret_6.0-35 ggplot2_1.0.0 lattice_0.20-29 devtools_1.5

loaded via a namespace (and not attached):
[1] BradleyTerry2_1.0-5 brglm_0.5-9 car_2.0-20
[4] codetools_0.2-8 colorspace_1.2-4 digest_0.6.4
[7] evaluate_0.5.5 foreach_1.4.2 grid_3.1.1
[10] gtable_0.1.2 gtools_3.4.1 httr_0.4
[13] iterators_1.0.7 lme4_1.1-7 MASS_7.3-33
[16] Matrix_1.1-4 memoise_0.2.1 minqa_1.2.3
[19] munsell_0.4.2 nlme_3.1-117 nloptr_1.0.4
[22] nnet_7.3-8 parallel_3.1.1 plyr_1.8.1
[25] proto_0.3-10 Rcpp_0.11.2 RCurl_1.95-4.3
[28] reshape2_1.4 scales_0.2.4 splines_3.1.1
[31] stringr_0.6.2 tools_3.1.1 whisker_0.3-2

Thanks

Error: length(unique(indexes)) == 1 is not TRUE for caret large ensemble of models

First of all let me thank you for very needed ensemble package which you wrote

I have come out to an issue when attempting to train a large number of models all for regression pre-selecting the models that work I get from 60 to 70 functional models all listed in caret and all working individually. Considering this I attempt to run

greedy <- caretEnsemble(all.models, iter=1000L) 

I come out with the following message.

Error: length(unique(indexes)) == 1 is not TRUE

where from your code I go to the origin of the issue which is

makePredObsMatrix(all.models) 
Error: length(unique(indexes)) == 1 is not TRUE
length(unique(indexes) =2 

The length in an example for unique(indexes) ranges in [[1]] 212 and 218 in [[2]] while the length for the observations must correspond to any of these ranges as per the description.

Do you know if there is a way to correct this error from within as to include all models. This seems to be a model or group of models specific issue (maybe related to caret 6.0) as far as I can see.

Any help will be welcomed, Thank you

Issue with caretEnsemble

I am using the following code but it is resulting into Error: length(unique(indexes)) == 1 is not TRUE. AS per some previous post I have incuded the seeding part but still I am getting the error. Please help. I am using R 3.03.

library(mlbench)
data(Sonar)
library(caret)
set.seed(123)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- sample.int(1000, 3)

For the last model:

seeds[[11]] <- sample.int(1000, 1)
inTraining <- createDataPartition(Sonar$Class, p = 0.75, list = FALSE)
training <- Sonar[inTraining, ]
testing <- Sonar[-inTraining, ]
fitControl <- trainControl(method = "cv", number = 10,repeats = 1,classProbs=TRUE, savePredictions=TRUE, seeds=seeds)

PP <- c('center', 'scale')
set.seed(1)

gbmFit1 <- train(Class ~ ., data = training,method = "svmRadial",trControl = fitControl,verbose = FALSE)

gbmFit2 <- train(Class ~ ., data = training,method = "gbm",trControl = fitControl,verbose = FALSE)
all.models <- list( gbmFit1 , gbmFit2 )

greedy <- caretEnsemble(all.models, iter=1000L)

sort(greedy$weights, decreasing=TRUE)

greedy$error

Allow repeat model methods

Say we want 2 random forests in the ensemble, one with mtry=2 and one with mtry=5. Currently we can't have dup methods.

investigate how caret sets seeds for parallel models

It's possible we don't need the setSeeds. I think #50 will fix the incorrect resampling indexes error when we go to ensemble the models.

I think seeds only need to be set when models are fit in parallel, because if you start 10 R processes (one for each of 10 CV folds) they will tend to have the same random seed, and therefore all fit the same model with no randomness.

I think pre-setting the re-sampling indexes will make the samples used for each model truly random, and I think the default caret logic will make sure different seeds are used for different models, fit in parallel.

Cleanup API and remove unnecessary @export tags

I'd like to hide as much code as possible (other than caretList, caretEnsemble, and caretStack of course!) from the 1.0 release. I don't wan't people to start using internal functions and then get mad when we re-factor them or replace them.

installing caretEnsemble without net connection

Hi Zach,
caretEnsemble is proving to be very useful for my work and it is working fine. However, I would like caretEnsemble to be installed on a high end machine but it is not connected to internet. So my query is:

If I do not have internet connection on my machine (but possible on a different machine) is there any standalone version (tar ball) of caretEnsemble which can be installed without accessing internet connection? And how to install it (like for R package, is it R CMD INSTALL package.tar.gz)
Thanks

Make a clean.caretList function

Goes through the models are removes data that isn't needed, e.g. resamples, out-of-sample predictions, the original dataset, and anything else we can think of.

After making a caretEnsemble or caretStack, we'd run clean.caretList to shrink the final models as much as possible.

Consider making caret model list a named object

Building models with identical resampling profiles may be unclear to users. Allowing the user to specify the methods to build train objects from and then the control parameters, and generating the model list as a named class, e.g. a caretList object would make the ensemble approach much more approachable to users. It could look something like:

methodList <- c("rf", "nnet", "knn", "lm")
control <- trainControl(...)

buildModels <- function(methodList, control, ...) {
myseeds <- # build seeds using control parameters
modelList <- list(methodList)
for(i in methodList){
modelList[[i]] <- fit(x = x, y=y, method = i, control = control)
}
return(modelList)
}

Not sure if this is 'out of scope' for this package, but I think it would make the ensembling process more accessible.

warning in buildModels if trainControl indexes are not set

If buildModels is not passed a trainControl object with pre-definded indexes, for use with all models, we should raise a warning and try to set the resampling indexes ourselves, based on trainControl$method.

If we can't set the indexes, we should raise an error.

This probably requires a new function setIndexes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.