Giter Club home page Giter Club logo

caret's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caret's Issues

Investigate allowing prediction function to return multiple values

The request:

Dear Max,

When running cross-validation, function trainControl allows
to set if predictions should be saved or not. This is done
with argument savePredictions. The result is a data.frame
with observed and predicted values of each cross-validation
run.

Some prediction functions (such as predict.lm, predict.ar,
predict.Arima, predict.glm, predict.loess, and so on) have
an argument called se.fit which allow to set if the standard
error of each predicted value should be saved or not. The
current structure of function predictionFunction does not
allow for setting argument se.fit = TRUE, and thus the
standard error of each predicted value cannot be accessed.

Is there an easy way of saving the standard errors? Overall,
I believe it would be interesting to save standard errors
of prediction whenever a predict function allows it.

Best,

Alessandro Samuel-Rosa

PhD Candidate
Graduate School in Agronomy - Soil Science
Federal Rural University of Rio de Janeiro

Seropédica, Rio de Janeiro, Brazil

Guest Researcher
ISRIC - World Soil Information
Wageningen, the Netherlands

[email protected] | Phone 0031 06 4435 9563

Homepage: soil-scientist.net Skype: alessandrosamuel

Error " no points in data!" in knnImpute preProcess

Try to use "knnImpute" to do knn missing value imputation
object <- preProcess(newdata, method = "knnImpute")
newdata1 <- predict(object, newdata)
Error message is : Error in nn2(old[, cols, drop = FALSE], new[, cols, drop = FALSE], k = k) : no points in data!
Have found the bug in the perProcess() function

Adapt train for survival models

Some things that would have to change:

  • train has a component called modelType that is currently either "Regression" or "Classification". Another option, "Survivial" would need to be added and a search should be made for any error traps or checks on this value.
  • is.Surv should be added to make sure the outcome is correct
  • Documentation will need to be added
  • Test cases for specific models
  • New models based on this change that should be considered: superpc (add this model type), randomForestSRC, bujar, randomSurvivalForest and probably others
  • The default performance metric would need to change (any suggestions?). These changes should get built into postResample too.

Add Fuzzy Rule-based System via frbs

From email correspondance with Tarek Abdunabi, he provided the code below as a prototype workflow and a ton of information about the methods and possible tuning parameters.

##################################
## I. Regression Problem
## In this example, we are using the gas furnace dataset that 
## contains two input and one output variables.
##################################

## Input data: Using the Gas Furnace dataset
## then split the data to be training and testing datasets
data(frbsData)
data.train <- frbsData$GasFurnance.dt[1 : 204, ]
data.tst <- frbsData$GasFurnance.dt[205 : 292, 1 : 2]
real.val <- matrix(frbsData$GasFurnance.dt[205 : 292, 3], ncol = 1)

## Define interval of data
range.data <-apply(data.train,2,range)

## Set the method and its parameters,
## for example, we use Wang and Mendel's algorithm
method.type <- "WM"
control <- list(num.labels = 15, type.mf = "GAUSSIAN", type.defuz = "WAM", 
                type.tnorm = "MIN", type.snorm = "MAX", type.implication.func = "ZADEH",
                name="sim-0") 

## Learning step: Generate an FRBS model
object.reg <- frbs.learn(data.train, range.data, method.type, control)

## Predicting step: Predict for newdata
res.test <- predict(object.reg, data.tst)

## Display the FRBS model
summary(object.reg)

## Plot the membership functions
plotMF(object.reg)

##################################
## II. Classification Problem
## In this example, we are using the iris dataset that 
## contains four input and one output variables.
##################################

## Input data: Using the Iris dataset
data(iris)
set.seed(2)

## Shuffle the data
## then split the data to be training and testing datasets
irisShuffled <- iris[sample(nrow(iris)),]
irisShuffled[,5] <- unclass(irisShuffled[,5])
tra.iris <- irisShuffled[1:105,]
tst.iris <- irisShuffled[106:nrow(irisShuffled),1:4]
real.iris <- matrix(irisShuffled[106:nrow(irisShuffled),5], ncol = 1)

## Define range of input data. Note that it is only for the input variables.
range.data.input <- apply(iris[,-ncol(iris)], 2,range)

## Set the method and its parameters. In this case we use FRBCS.W algorithm
method.type <- "FRBCS.W"
control <- list(num.labels = 7, type.mf = "GAUSSIAN", type.tnorm = "MIN", 
                type.snorm = "MAX", type.implication.func = "ZADEH")  

## Learning step: Generate fuzzy model
object.cls <- frbs.learn(tra.iris, range.data.input, method.type, control)

## Predicting step: Predict newdata
res.test <- predict(object.cls, tst.iris)

## Display the FRBS model
summary(object.cls)

## Plot the membership functions
plotMF(object.cls)

Error when training a poisson gbm model

Tuning a gbm model with the poisson distribution does not seem to work. The following code exits with an error:

> set.seed(1)
> n <- 500
> nvar <- 10
> b0 <- 1
> b1 <- 1
> x <- as.data.frame(matrix(rep(runif(n), nvar), ncol=nvar))
> mu <- with(x, exp(b0 + b1 * sin(pi * V1)))
> y <- rpois(n, lambda = mu)
> 
> tr <- train(x, y, method = "gbm", distribution = "poisson", verbose = FALSE)
Error in { : 
  task 1 failed - "arguments imply differing number of rows: 3, 0"
> 

I have narrowed it down to the result <- foreach(...) in nominalTrainWorkflow() as the source of the error.

Session Info

R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] gbm_2.1-06      survival_2.37-7 caret_6.0-35    ggplot2_0.9.3.1
[5] lattice_0.20-29

loaded via a namespace (and not attached):
 [1] BradleyTerry2_1.0-5 brglm_0.5-9         car_2.0-20         
 [4] codetools_0.2-8     colorspace_1.2-4    compiler_3.1.1     
 [7] digest_0.6.4        foreach_1.4.2       grid_3.1.1         
[10] gtable_0.1.2        gtools_3.4.1        iterators_1.0.7    
[13] lme4_1.1-6          MASS_7.3-33         Matrix_1.1-4       
[16] minqa_1.2.3         munsell_0.4.2       nlme_3.1-117       
[19] nnet_7.3-8          plyr_1.8.1          proto_0.3-10       
[22] Rcpp_0.11.1         RcppEigen_0.3.2.1.2 reshape2_1.4       
[25] scales_0.2.4        stringr_0.6.2       tcltk_3.1.1        
[28] tools_3.1.1

un-named index lists fail

Dear Mr. Kuhn!

Recently, while using train() in caret package, I noticed that index variable in trainControl does not accept unnamed lists:

 > n = 100;
 > m = 10;
 > set.seed(1);
 > x = matrix(runif(n * m), nrow = n);
 > y = runif(n);
 > idx <- lapply(1:10, function(x) sample(1:n, 0.9 * n));
 > train(x = x, y = y, method = "glm", trControl = trainControl(index = idx));
 Error in { : task 1 failed - "argument is of length zero"
 > names(idx)
 NULL
 > names(idx) <- 1:10
 > train(x = x, y = y, method = "glm", trControl = trainControl(index = idx));
 Generalized Linear Model 

 100 samples
  10 predictors

 No pre-processing
 Resampling: Bootstrapped (10 reps) 

 Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... 

 Resampling results

   RMSE   Rsquared  RMSE SD  Rsquared SD
   0.302  0.0923    0.0587   0.0825     

I know that index is usually populated by the output from createPartition or functions alike, and they return lists with names like "Fold1", etc.
I don't know if it was intended that unnamed lists are not accepted. If so, I kindly make a suggestion to mention it somewhere in the documentation, as understanding what the error relates to was highly not obvious, at least for me.

Thank you,
Best regards,
Andrii Maksai

update documentation about `returnResamp`

In HTML and the man page.

The resamples class is designed to compare between-model resampling distributions so I made the choice that only one configuration of tuning parameters should show up there. For some of the other plotting functions (like xyplot.train and a few others), I wanted to plot the within-model resamples versus the tuning parameters. That is why xyplot.train requires resamples = "all" but resamples needs returnResamp="final"

improve sbf score function documentation

Owen,

The documentation is unclear. The score function takes two vectors (x and y) that are an individual predictor and outcome, respectively. For example, anovaScores is an example scoring function that can be used for classification:

set.seed(1)
dat <- twoClassSim(100)
anovaScores(dat$Linear01, dat$Class)
[1] 0.6580359

Internally, sbfuses apply to run this on each column of the predictor matrix and those results are pumped into the filter function.

So it looks to me that help(caretSBF) is most accurate. I'll update the documentation for the web site.

Thanks,

Max

On Thu, Sep 25, 2014 at 3:22 PM, Owen Solberg [email protected] wrote:
Dear Dr. Kuhn,

First let me express my appreciation for your book and R package — both are rapidly becoming mainstays of my analysis.

I am wondering if there is an inconsistency in some of caret's documentation.

On one hand, it sounds like the Selection By Filter score function is supposed to take a matrix of predictors and return a named vector of results.

From the website (http://topepo.github.io/caret/featureselection.html#filter):The score Function This function takes as inputs the predictors and the outcome in objects called x and y, respectively. The output should be a named vector of scores where the names correspond to the column names of x.

From help(sbfControl)
The score function is used to return a vector of scores with names for each predictor (such as a p-value). Inputs are:

x the predictors for the training samples
y the current training outcomes

Other parts of the documentation says that the score function operates on a single feature at at time:

From help(caretSBF)

anovaScores fits a simple linear model between a single feature and the outcome, then the p-value for the whole model F-test is returned.

This is also they way the code actually works:

> caretSBF$score
function (x, y) 
{
    if (is.factor(y)) 
        anovaScores(x, y)
    else gamScores(x, y)
}

Owen Solberg
Senior Scientist, Bioinformatics
HealthTell, Inc.
T: 510.545.6936
E: [email protected]
HealthTell | LinkedIn | Facebook | Twitter
NOTICE: This message contains information that may be privileged and/or confidential. The information is intended for the use of the addressee only. If you are not the addressee, please be advised that any disclosure, copying, distribution or other use of the contents of this message is prohibited.

add sampling option to train

In trainControl, think about having an option to do sampling after resampling and prior to pre-processing/model fitting with options for upSample, downSample, SMOTE and ROSE. Make offline code available (the same way models are stored) to avoid more package dependencies.

in predict.train, subset on features

Right now, if extra features are in newdata it can crash if there was pre-processing. After the formula handling, use predictors(object) to filter out extra predictors. This may be NULL, so do some checking first.

update nearZeroVar

As suggested by Michael Benesty, this enhancement would allow the function to be used in parallel and with large data sets:

test <- function(dataset){
  res <- foreach(name = colnames(dataset), .combine=rbind) %do% {
      r <- caret::nearZeroVar(dataset[[name]], saveMetrics = T)
      r[,"column" ] <-  name
      r
    }
  res[, c(5, 1, 2, 3, 4)]
  }

add an `extract` class

Via Benjamin Mack (and his code).

Have a function to extract the resamples from a train, rfe or sbf model. Have an option for optimal_only or something similar.

knnImpute does not work with caret_6.0-34

Hi,

When I use knnImpute with caret_6.0-34, it returns a error.
The below is the reproduce code, and the same code works with caret_6.0-30 without any problem.

with caret_6.0-34:

> suppressMessages(library(caret))
> data(iris)
> 
> iris[1,1] <- NA
> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           NA         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> 
> m = preProcess(iris[,-5], method="knnImpute")
> head(predict(m, iris[,-5]), 3)
Error in nn2(old[, cols, drop = FALSE], new[, cols, drop = FALSE], k = k) : 
  no points in data!
Calls: head ... predict -> predict.preProcess -> apply -> FUN -> nn2
Execution halted

with caret_6.0-30:

> suppressMessages(library(caret))
> data(iris)
> 
> iris[1,1] <- NA
> head(iris, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           NA         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
> 
> m = preProcess(iris[,-5], method="knnImpute")
> head(predict(m, iris[,-5]), 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.7824364   1.0156020    -1.335752   -1.311052
2   -1.1444955  -0.1315388    -1.335752   -1.311052
3   -1.3858682   0.3273175    -1.392399   -1.311052

Move documentation to roxygen2?

This might be a good idea for keeping docs and man pages in sync. I personally like roxygen2 a lot, as it removes some of the boring parts in writing R manual pages.

Finalize adaptive resampling options

To-do:

  • update resampling names
  • finalize documentation and man files
  • add HTML documentation
  • Update ggplot method to include size and add warnings about the number of resamples

Add a models man page

Older versions of the package listed each model and their tuning parameters in the train man file. A request was made to add the information back in but on another page. This would be updated at each new version.

add case weights to summary function

The original request is from here. This is the request:

I'm using R's caret package to do some grid search and model evaluation. I have a custom evaluation metric that is a weighted average of absolute error. Weights are assigned at the observation level.

X <- c(1,1,2,0,1) #feature 1
w <- c(1,2,2,1,1) #weights
Y <- 1:5 #target, continuous

#assume I run a model using X as features and Y as target and get a vector of predictions

mymetric <- function(predictions, target, weights){

v <- sum(abs(target-predictions)*weights)/sum(weights) 
return(v)
}

Here an example is given on how to use summaryFunction to define a custom evaluation metric for caret's train().
To quote:

The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:

data is a reference for a data frame or matrix with columns called obs
and pred for the observed and predicted outcome values (either numeric
data for regression or character values for classification).
Currently, class probabilities are not passed to the function. The
values in data are the held-out predictions (and their associated
reference values) for a single combination of tuning parameters. If
the classProbs argument of the trainControl object is set to TRUE,
additional columns in data will be present that contains the class
probabilities. The names of these columns are the same as the class
levels. lev is a character string that has the outcome factor levels
taken from the training data. For regression, a value of NULL is
passed into the function. model is a character string for the model
being used (i.e. the value passed to the method argument of train).

I cannot quite figure out how to pass the observation weights to summaryFunction.

rfe saves predictions incorrectly

When probabilities are not used with train models:

 library(caret)
library(mlbench)

library(doMC)
registerDoMC(cores=7)

extra <- 3

set.seed(1)
training <- twoClassSim(200,  noiseVars = extra)

p <- ncol(training)

ctrl <- trainControl(method = "cv", savePredictions = TRUE)

ctrl2 <- ctrl
ctrl2$allowParallel = FALSE

set.seed(1)
svmRFE <- rfe(training[, -p], training$Class,
              sizes = c(2, 6),
              rfeControl = rfeControl(functions = caretFuncs, 
                                      method = "cv",
                                      verbose = TRUE,
                                      saveDetails = TRUE),
              ## pass options to train()
              method = "svmRadial",
              tuneLength = 3,
              trControl = ctrl2,
              preProc = c("center", "scale"))

You get:

> subset(svmRFE$pred, rowIndex == 1 & Variables == 2)
                   pred Class1 Class2    obs Variables Resample rowIndex
predictions.857  Class1     NA     NA Class1         2   Fold08        1
predictions.1067 Class1     NA     NA Class1         2   Fold08        1

with classProbs = TRUE it works fine.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.