topepo / caret Goto Github PK

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models

Home Page: http://topepo.github.io/caret/index.html

R 80.75% HTML 18.31% CSS 0.23% C 0.14% TeX 0.57%

caret's People

Stargazers

Watchers

Forkers

dolremi kharakawa ivanliu1989 zachmayer tonglu nianxue pgramme kireru fdoperezi isidrohidalgo kongscn dbancea rlesca01 nuyux twyckoff ryanhnkim suensummit jschadewald hshujuan jknowles mirjamjenny elephann emmaggie ecristofaro nshrk pzawistowski gurobokum ha-nguyen qichense diego-mx mingleili lmccombs klainfo priyatham736 jozelazarevski casallas ilarischeinin zhmz90 showeye jyotiman courseracheng lukeliush rsuhada gregpoore da-steve101 noelnamai al186056 bleutner feixuan090803 strongh faa1947 nswitanek wmurphyrd ihar jraman vasanthgx brentonk 99701a0554 vitorbaptista maming008 andrewcstewart amin-shabih latuji wwbrannon wuliang416 somendratripathi albiondervishi cancandan kkholst amitvirat yue1harriet1 mhdella dondelelcaro ageek jcassiojr bdanalytics bingostar89 andylin512 himansu979 fxcebx philipus lluisramon hargan terrytangyuan tpnguyen fjeze sandy4321 washcycle bioinfo2015 jmlussier blueskypie puriney echohenry2006 channingxiao asardaes bwilbertz nhisato linearregression drewpinta koooee

caret's Issues

add `metrics` option to `summary.resamples`

Investigate allowing prediction function to return multiple values

The request:

Dear Max,

When running cross-validation, function trainControl allows
to set if predictions should be saved or not. This is done
with argument savePredictions. The result is a data.frame
with observed and predicted values of each cross-validation
run.

Some prediction functions (such as predict.lm, predict.ar,
predict.Arima, predict.glm, predict.loess, and so on) have
an argument called se.fit which allow to set if the standard
error of each predicted value should be saved or not. The
current structure of function predictionFunction does not
allow for setting argument se.fit = TRUE, and thus the
standard error of each predicted value cannot be accessed.

Is there an easy way of saving the standard errors? Overall,
I believe it would be interesting to save standard errors
of prediction whenever a predict function allows it.

Best,

Alessandro Samuel-Rosa

PhD Candidate
Graduate School in Agronomy - Soil Science
Federal Rural University of Rio de Janeiro

Seropédica, Rio de Janeiro, Brazil

Guest Researcher
ISRIC - World Soil Information
Wageningen, the Netherlands

[email protected] | Phone 0031 06 4435 9563

Homepage: soil-scientist.net Skype: alessandrosamuel

Check maximum dissimilarity example plot

Found here. Why are there numbers falling where there are no points?

Error " no points in data!" in knnImpute preProcess

Try to use "knnImpute" to do knn missing value imputation
object <- preProcess(newdata, method = "knnImpute")
newdata1 <- predict(object, newdata)
Error message is : Error in nn2(old[, cols, drop = FALSE], new[, cols, drop = FALSE], k = k) : no points in data!
Have found the bug in the perProcess() function

add LiblineaR

http://cran.r-project.org/web/packages/LiblineaR/index.html

add an option to diff.resamples to restrict models

Adapt train for survival models

Some things that would have to change:

train has a component called modelType that is currently either "Regression" or "Classification". Another option, "Survivial" would need to be added and a search should be made for any error traps or checks on this value.
is.Surv should be added to make sure the outcome is correct
Documentation will need to be added
Test cases for specific models
New models based on this change that should be considered: superpc (add this model type), randomForestSRC, bujar, randomSurvivalForest and probably others
The default performance metric would need to change (any suggestions?). These changes should get built into postResample too.

add nameInStrip options to plot.train

so that the tuning parameter name is in the strip. Also add to ``ggplot.train`.

Add Fuzzy Rule-based System via frbs

From email correspondance with Tarek Abdunabi, he provided the code below as a prototype workflow and a ton of information about the methods and possible tuning parameters.

##################################
## I. Regression Problem
## In this example, we are using the gas furnace dataset that 
## contains two input and one output variables.
##################################

## Input data: Using the Gas Furnace dataset
## then split the data to be training and testing datasets
data(frbsData)
data.train <- frbsData$GasFurnance.dt[1 : 204, ]
data.tst <- frbsData$GasFurnance.dt[205 : 292, 1 : 2]
real.val <- matrix(frbsData$GasFurnance.dt[205 : 292, 3], ncol = 1)

## Define interval of data
range.data <-apply(data.train,2,range)

## Set the method and its parameters,
## for example, we use Wang and Mendel's algorithm
method.type <- "WM"
control <- list(num.labels = 15, type.mf = "GAUSSIAN", type.defuz = "WAM", 
                type.tnorm = "MIN", type.snorm = "MAX", type.implication.func = "ZADEH",
                name="sim-0") 

## Learning step: Generate an FRBS model
object.reg <- frbs.learn(data.train, range.data, method.type, control)

## Predicting step: Predict for newdata
res.test <- predict(object.reg, data.tst)

## Display the FRBS model
summary(object.reg)

## Plot the membership functions
plotMF(object.reg)

##################################
## II. Classification Problem
## In this example, we are using the iris dataset that 
## contains four input and one output variables.
##################################

## Input data: Using the Iris dataset
data(iris)
set.seed(2)

## Shuffle the data
## then split the data to be training and testing datasets
irisShuffled <- iris[sample(nrow(iris)),]
irisShuffled[,5] <- unclass(irisShuffled[,5])
tra.iris <- irisShuffled[1:105,]
tst.iris <- irisShuffled[106:nrow(irisShuffled),1:4]
real.iris <- matrix(irisShuffled[106:nrow(irisShuffled),5], ncol = 1)

## Define range of input data. Note that it is only for the input variables.
range.data.input <- apply(iris[,-ncol(iris)], 2,range)

## Set the method and its parameters. In this case we use FRBCS.W algorithm
method.type <- "FRBCS.W"
control <- list(num.labels = 7, type.mf = "GAUSSIAN", type.tnorm = "MIN", 
                type.snorm = "MAX", type.implication.func = "ZADEH")  

## Learning step: Generate fuzzy model
object.cls <- frbs.learn(tra.iris, range.data.input, method.type, control)

## Predicting step: Predict newdata
res.test <- predict(object.cls, tst.iris)

## Display the FRBS model
summary(object.cls)

## Plot the membership functions
plotMF(object.cls)

Error when training a poisson gbm model

Tuning a gbm model with the poisson distribution does not seem to work. The following code exits with an error:

> set.seed(1)
> n <- 500
> nvar <- 10
> b0 <- 1
> b1 <- 1
> x <- as.data.frame(matrix(rep(runif(n), nvar), ncol=nvar))
> mu <- with(x, exp(b0 + b1 * sin(pi * V1)))
> y <- rpois(n, lambda = mu)
> 
> tr <- train(x, y, method = "gbm", distribution = "poisson", verbose = FALSE)
Error in { : 
  task 1 failed - "arguments imply differing number of rows: 3, 0"
>

I have narrowed it down to the result <- foreach(...) in nominalTrainWorkflow() as the source of the error.

Session Info

R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] gbm_2.1-06      survival_2.37-7 caret_6.0-35    ggplot2_0.9.3.1
[5] lattice_0.20-29

loaded via a namespace (and not attached):
 [1] BradleyTerry2_1.0-5 brglm_0.5-9         car_2.0-20         
 [4] codetools_0.2-8     colorspace_1.2-4    compiler_3.1.1     
 [7] digest_0.6.4        foreach_1.4.2       grid_3.1.1         
[10] gtable_0.1.2        gtools_3.4.1        iterators_1.0.7    
[13] lme4_1.1-6          MASS_7.3-33         Matrix_1.1-4       
[16] minqa_1.2.3         munsell_0.4.2       nlme_3.1-117       
[19] nnet_7.3-8          plyr_1.8.1          proto_0.3-10       
[22] Rcpp_0.11.1         RcppEigen_0.3.2.1.2 reshape2_1.4       
[25] scales_0.2.4        stringr_0.6.2       tcltk_3.1.1        
[28] tools_3.1.1

new example for website

http://stats.stackexchange.com/questions/111621/r-caret-how-to-combine-principal-components-and-non-principal-component-predic/111724#111724

parallel option for filterVarImp

Also, for two classes, run only once

pass … through to prcomp.resamples

un-named index lists fail

Dear Mr. Kuhn!

Recently, while using train() in caret package, I noticed that index variable in trainControl does not accept unnamed lists:

 > n = 100;
 > m = 10;
 > set.seed(1);
 > x = matrix(runif(n * m), nrow = n);
 > y = runif(n);
 > idx <- lapply(1:10, function(x) sample(1:n, 0.9 * n));
 > train(x = x, y = y, method = "glm", trControl = trainControl(index = idx));
 Error in { : task 1 failed - "argument is of length zero"
 > names(idx)
 NULL
 > names(idx) <- 1:10
 > train(x = x, y = y, method = "glm", trControl = trainControl(index = idx));
 Generalized Linear Model 

 100 samples
  10 predictors

 No pre-processing
 Resampling: Bootstrapped (10 reps) 

 Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... 

 Resampling results

   RMSE   Rsquared  RMSE SD  Rsquared SD
   0.302  0.0923    0.0587   0.0825

I know that index is usually populated by the output from createPartition or functions alike, and they return lists with names like "Fold1", etc.
I don't know if it was intended that unnamed lists are not accepted. If so, I kindly make a suggestion to mention it somewhere in the documentation, as understanding what the error relates to was highly not obvious, at least for me.

Thank you,
Best regards,
Andrii Maksai

update documentation about `returnResamp`

In HTML and the man page.

The resamples class is designed to compare between-model resampling distributions so I made the choice that only one configuration of tuning parameters should show up there. For some of the other plotting functions (like xyplot.train and a few others), I wanted to plot the within-model resamples versus the tuning parameters. That is why xyplot.train requires resamples = "all" but resamples needs returnResamp="final"

Setup travis CI continuous testing

Apparently, devtools 1.6 makes it very easy to setup free unit testing using travis CI. I think it integrates with github, to allow for automated unit testing of pull requests. That would be pretty cool.

https://travis-ci.org/
http://blog.rstudio.org/2014/10/02/devtools-1-6/

make class for holdout predictions

heldout as a class name? An an option to check indices for consistencies

`cluster(object)` doesn't work

You need to do caret:::cluster.resamples(object) instead

improve sbf score function documentation

Owen,

The documentation is unclear. The score function takes two vectors (x and y) that are an individual predictor and outcome, respectively. For example, anovaScores is an example scoring function that can be used for classification:

set.seed(1)
dat <- twoClassSim(100)
anovaScores(dat$Linear01, dat$Class)
[1] 0.6580359

Internally, sbfuses apply to run this on each column of the predictor matrix and those results are pumped into the filter function.

So it looks to me that help(caretSBF) is most accurate. I'll update the documentation for the web site.

Thanks,

Max

On Thu, Sep 25, 2014 at 3:22 PM, Owen Solberg [email protected] wrote:
Dear Dr. Kuhn,

First let me express my appreciation for your book and R package — both are rapidly becoming mainstays of my analysis.

I am wondering if there is an inconsistency in some of caret's documentation.

On one hand, it sounds like the Selection By Filter score function is supposed to take a matrix of predictors and return a named vector of results.

From the website (http://topepo.github.io/caret/featureselection.html#filter):The score Function This function takes as inputs the predictors and the outcome in objects called x and y, respectively. The output should be a named vector of scores where the names correspond to the column names of x.

From help(sbfControl)
The score function is used to return a vector of scores with names for each predictor (such as a p-value). Inputs are:

x the predictors for the training samples
y the current training outcomes

Other parts of the documentation says that the score function operates on a single feature at at time:

From help(caretSBF)

anovaScores fits a simple linear model between a single feature and the outcome, then the p-value for the whole model F-test is returned.

This is also they way the code actually works:

> caretSBF$score
function (x, y) 
{
    if (is.factor(y)) 
        anovaScores(x, y)
    else gamScores(x, y)
}

Owen Solberg
Senior Scientist, Bioinformatics
HealthTell, Inc.
T: 510.545.6936
E: [email protected]
HealthTell | LinkedIn | Facebook | Twitter
NOTICE: This message contains information that may be privileged and/or confidential. The information is intended for the use of the addressee only. If you are not the addressee, please be advised that any disclosure, copying, distribution or other use of the contents of this message is prohibited.

Investigate binda (Multi-Class Discriminant Analysis using Binary Predictors)

Package located on CRAN

sparse matrices and train

See here.

This should be feasible now that x and y are carried along separately through train.

add sampling option to train

In trainControl, think about having an option to do sampling after resampling and prior to pre-processing/model fitting with options for upSample, downSample, SMOTE and ROSE. Make offline code available (the same way models are stored) to avoid more package dependencies.

Issue a warning with `update.train`

That the resamples may not be associated with the final model (and perhaps remove them).

consider moving `metric` in rfe() and sbf() to the control function

So they do not conflict with the same option in train

in predict.train, subset on features

Right now, if extra features are in newdata it can crash if there was pre-processing. After the formula handling, use predictors(object) to filter out extra predictors. This may be NULL, so do some checking first.

Tag-Specific HTML files have incorrect footers.

Right now, the footer has Created on <Sexpr session>.

This needs to be changed to some searchable tag (maybe FOOTFOOT to be consistent with the other parts of the code) and gsub'ed in.

add ordinalgmifs

on CRAN

add bootstrap 632+ estimator

It can be checked against functions sortinghat:::errorest_632plus, ipred:::errorest and pep err:::perr,

update nearZeroVar

As suggested by Michael Benesty, this enhancement would allow the function to be used in parallel and with large data sets:

test <- function(dataset){
  res <- foreach(name = colnames(dataset), .combine=rbind) %do% {
      r <- caret::nearZeroVar(dataset[[name]], saveMetrics = T)
      r[,"column" ] <-  name
      r
    }
  res[, c(5, 1, 2, 3, 4)]
  }

Investigate wsrf (Weighted Subspace Random Forest)

Found here

Investigate randomUniformForest

the package

add an `extract` class

Via Benjamin Mack (and his code).

Have a function to extract the resamples from a train, rfe or sbf model. Have an option for optimal_only or something similar.

change r-forge links in man pages

Move current html files over to `gs-pages`

add an update function for rfe

refine svm rbf grid

if tuneLength > 10 make the grid interpolate between 2^{-2} to 2^{-6}

Add a multivariate option to sbf()

See Owen Solberg's changes

knnImpute does not work with caret_6.0-34

Hi,

When I use knnImpute with caret_6.0-34, it returns a error.
The below is the reproduce code, and the same code works with caret_6.0-30 without any problem.

with caret_6.0-34:

> suppressMessages(library(caret))
> data(iris)
> 
> iris[1,1] <- NA
> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           NA         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> 
> m = preProcess(iris[,-5], method="knnImpute")
> head(predict(m, iris[,-5]), 3)
Error in nn2(old[, cols, drop = FALSE], new[, cols, drop = FALSE], k = k) : 
  no points in data!
Calls: head ... predict -> predict.preProcess -> apply -> FUN -> nn2
Execution halted

with caret_6.0-30:

> suppressMessages(library(caret))
> data(iris)
> 
> iris[1,1] <- NA
> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           NA         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> 
> m = preProcess(iris[,-5], method="knnImpute")
> head(predict(m, iris[,-5]), 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.7824364   1.0156020    -1.335752   -1.311052
2   -1.1444955  -0.1315388    -1.335752   -1.311052
3   -1.3858682   0.3273175    -1.392399   -1.311052

Investigate deep learning packages

Including autoencoder, deepnet and darch.

investigate FADA package for pre-processing

See presentation

Move documentation to roxygen2?

This might be a good idea for keeping docs and man pages in sync. I personally like roxygen2 a lot, as it removes some of the boring parts in writing R manual pages.

add a way for resamples to compare all models to a reference model

Finalize adaptive resampling options

To-do:

update resampling names
finalize documentation and man files
add HTML documentation
Update ggplot method to include size and add warnings about the number of resamples

add indexOut to rfe

Move current CRAN package code from SVN

Add a models man page

Older versions of the package listed each model and their tuning parameters in the train man file. A request was made to add the information back in but on another page. This would be updated at each new version.

add case weights to summary function

The original request is from here. This is the request:

I'm using R's caret package to do some grid search and model evaluation. I have a custom evaluation metric that is a weighted average of absolute error. Weights are assigned at the observation level.

X <- c(1,1,2,0,1) #feature 1
w <- c(1,2,2,1,1) #weights
Y <- 1:5 #target, continuous

#assume I run a model using X as features and Y as target and get a vector of predictions

mymetric <- function(predictions, target, weights){

v <- sum(abs(target-predictions)*weights)/sum(weights) 
return(v)
}

Here an example is given on how to use summaryFunction to define a custom evaluation metric for caret's train().
To quote:

The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:

data is a reference for a data frame or matrix with columns called obs
and pred for the observed and predicted outcome values (either numeric
data for regression or character values for classification).
Currently, class probabilities are not passed to the function. The
values in data are the held-out predictions (and their associated
reference values) for a single combination of tuning parameters. If
the classProbs argument of the trainControl object is set to TRUE,
additional columns in data will be present that contains the class
probabilities. The names of these columns are the same as the class
levels. lev is a character string that has the outcome factor levels
taken from the training data. For regression, a value of NULL is
passed into the function. model is a character string for the model
being used (i.e. the value passed to the method argument of train).

I cannot quite figure out how to pass the observation weights to summaryFunction.

 library(caret)
library(mlbench)

library(doMC)
registerDoMC(cores=7)

extra <- 3

set.seed(1)
training <- twoClassSim(200,  noiseVars = extra)

p <- ncol(training)

ctrl <- trainControl(method = "cv", savePredictions = TRUE)

ctrl2 <- ctrl
ctrl2$allowParallel = FALSE

set.seed(1)
svmRFE <- rfe(training[, -p], training$Class,
              sizes = c(2, 6),
              rfeControl = rfeControl(functions = caretFuncs, 
                                      method = "cv",
                                      verbose = TRUE,
                                      saveDetails = TRUE),
              ## pass options to train()
              method = "svmRadial",
              tuneLength = 3,
              trControl = ctrl2,
              preProc = c("center", "scale"))

You get:

> subset(svmRFE$pred, rowIndex == 1 & Variables == 2)
                   pred Class1 Class2    obs Variables Resample rowIndex
predictions.857  Class1     NA     NA Class1         2   Fold08        1
predictions.1067 Class1     NA     NA Class1         2   Fold08        1

with classProbs = TRUE it works fine.