harrysouthworth / gbm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from gbm-developers/gbm3

106.0 106.0 27.0 1.53 MB

Gradient boosted models

License: Other

R 36.36% TeX 6.76% HTML 0.14% PostScript 4.09% C++ 52.18% C 0.47%

gbm's People

Contributors

Stargazers

Watchers

gbm's Issues

How to interprete the output of GBM for Survival analysis?

I'm interested in getting predictions expressed in expected lifetimes, survival curves etc. Could one please explain me what the output of predict.gbm is? Are these probably (log) partial hazards?
Your help is much appreciated.

example code fails

Sent to my inbox. The first example in the help file for gbm fails on some runs, sometimes runs ok.

The issue is caused by one of the models in cv.models having all NaNs or Infs in the valid.error element. Those values come out of the .Call("gbm" call looking like that in gbm.fit.R

N <- 1000
X1 <- runif(N)
X2 <- 2_runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3_runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X11.5 + 2 * (X2.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

introduce some missing values

X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
#0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=FALSE, # don't print out progress
n.cores=1) # use only a single core (detecting #cores is

error-prone, so avoided here)

bug in gbm.more

http://stackoverflow.com/questions/24233865/r-gbm-more-function-doesnt-work-for-all-distributions

n.minobsinnode updated?

Hi team,

I am just revisiting some old code that used the caret package as a wrapper to gbm. I noticed it was breaking, and it seems that we now need to specify "n.minobsinnode" (specifically, the tuning grid previously required just interaction.depth, n.trees, and shrinkage). Has there been an update to gbm that I missed? Or is this, perhaps, a change in caret?

Thanks for any help you can provide,

Adam

Improvement for coxph

Would really appreciate the ability to use intervals in the Surv class, so instead of passing Surv(time, censor indicator) it would be great to pass Surv(start time, end time, censor indicator). This would require rewriting some of the C++ code because adding this flexibility implies that at a particular time t not all records are part of the risk set. In the current version, as long as the record is not censored it is in the risk set, but with a start time and end time we must have t >= start time and t < end time in order to be in the risk set

Bernoulli requires the response to be in {0,1}...

... and similarly for other distributions. Whilst this doesn't affect the functionality of the package, it is a bit inconvenient.

Could the requirement be lowered to factors with two levels, and implicit conversion done within gbm.fit() (or whatever function from within the package is being used)?

Likewise, in the correspondingpredict() method, could a factor be returned instead of 0's and 1's?

GBM crashing consistently on Mac OS X (10.9.4)

Every call to gbm (from package version 2.1, R version 3.1.1) on my Mac causes R to abort. Is this a known issue?

Levels with zero weight are always assigned to the right child

g <- gbm(y ~ x, 
         distribution="gaussian", 
         train.fraction=1,
         bag.fraction=0.1,
         data=data.frame(x=as.factor(1:100), y=rnorm(100)), 
         n.trees=1,
         n.minobsinnode=1)

g$c.splits[[1]]
 [1]  1  1 -1  1  1  1  1 -1  1 -1  1  1 -1  1  1  1  1  1  1  1  1  1
[23]  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[45]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[67]  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
[89]  1  1  1  1  1  1  1  1  1  1  1  1

In this example 5 levels are assigned to the left child and 95 are assigned to the right child. Of those at the right child, 90 will have had zero weight while training (since bag.fraction=0.1). It's an artificial example but a node having zero weight for a particular level is very possible when training a real model due to

correlation among predictors
bag.fraction < 1
predictors with high cardinality.

I'm wondering if this could be an issue since the tree is (in expectation) over-predicting for zero-weight levels. Granted, later trees will attempt to correct for any over-prediction but

the later trees may also have zero-weight levels at their nodes
convergence may be improved if the need for correction is avoided.

Perhaps in these circumstances it is more reasonable to use the prediction at the parent node since there is no data to suggest whether the zero-weight levels should be assigned to either the left or right child?

Memory leak with "laplace" still present?

I'm experiencing a significant memory leak in 2.1-0.3 with the "laplace" distribution, similar to what is described here:

https://code.google.com/p/gradientboostedmodels/issues/detail?id=32

The file laplace.cpp was fixed at line 97 in 2.1-0.3 to address this. But though I'm running 2.1-0.3 (at least that's what "library (gbm)" says), I'm stilling getting a memory leak.

The training set has 280K rows with 11 columns, and the parameters are:

gbm.fit (trainingSet, outcomes, nTrain = 279870, distribution = "laplace", interaction.depth = 4, n.trees = 1000)

which uses 1.1 GB of working set and 4.7 GB of committed memory. And very roughly, adding 2000 more trees uses an additional 2 GB of working set and 4-5 GB of committed memory. But when the distribution is changed "gaussian", the working set and committed memory stay around 550M, no matter how many trees.

I tried building the master branch, but gbm.fit() doesn't appear to work:

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           nan             nan     0.0010       nan
     2           nan             nan     0.0010       nan
     3           nan             nan     0.0010       nan

(R version 3.0.2, gbm 2.1-0.3, Windows 7 64-bit.)

GBM utilizing only small amount of compute capacity

Hi,
First, thank you for your great work!

Running R Studio and GBM on my 600,000 x 10 data set with cross validation = 5 takes close to 24 hours on my i5 8GB desktop. So I thought I would try Amazon AWS.

I started up an AWS Windows instance with 32 cores and 60 GB of RAM to run the same GBM job mentioned above - but it does not seem to be running much quicker. It seems to only be using mainly 4 or 5 cores (which, I guess is because each cross validated set is sent to one and only one core), while the other 27 are sitting there doing nothing. It is also only using 14 GB out of the 60 GB of available memory.

How can I use the full capacity of my 32 core, 80 GB RAM server?

emial from Brian Ripley to me

"I have made some changes so this works when imported from other packages, as commented by their maintainers. In version 2.1.1 now on CRAN."

obscure error with high shrinkage and gbm.perf(OOB,...)

When building a gbm with high shrinkage, gbm.perf fails with an obscure error. Could the error be more clear? In the example below the shrinkage is ridiculously high, but it happens on my real data with shrinkage=0.1

R code
https://gist.github.com/az0/b032f6dcfb279fd0cd9e

R output
https://gist.github.com/az0/3fc58bda6186be682fb5

Weights not properly calibrated in gbmDoFold function

gbm/R/gbmDoFold.R

Line 41 in 9c11f64

w=w, var.monotone=var.monotone, n.trees=n.trees,

At the point of the line mentioned above the x and y variables have been modified so that they include the right data, i.e. the data to be used to fit the nth cv fold. The same thing happens to the offset variable, but not to the weights variable. Therefore the weights which are passed through are pretty much in a random order.

This is an easy fix AFAIK, you just need to add in a line w <- w[i.train][i] and everyone is happy.

Encounter NaN depending on number of training observations and setting for shrinkage parameter in gbm.fit()

Depending on the number of observations used for training and setting of shrinkage parameter, I encounter the following problem in gbm.fit()

> mdl <- gbm.fit(x=train.df[,1:(ncol(train.df)-1)],
+                 y=train.df[,ncol(train.df)],
+                 distribution = "multinomial",
+   .... [TRUNCATED] 
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        2.1972             nan     0.1000    0.6183
     2        1.8424             nan     0.1000    0.3274
     3        1.6543             nan     0.1000    0.2367
     4        1.5193             nan     0.1000    0.1802
     5        1.4185             nan     0.1000    0.1462
     6        1.3352             nan     0.1000    0.1213
     7        1.2657             nan     0.1000    0.0945
     8        1.2104             nan     0.1000    0.0811
     9        1.1629             nan     0.1000    0.0742
    10        1.1200             nan     0.1000    0.0609
    20        0.8972             nan     0.1000    0.0206
    40        0.7478             nan     0.1000    0.0040
    60        0.6810             nan     0.1000    0.0014
    80        0.6490             nan     0.1000       nan
   100           nan             nan     0.1000       nan
   120           nan             nan     0.1000       nan
   140           nan             nan     0.1000       nan
   150           nan             nan     0.1000       nan

I have source code and data in a zip file that can reproduce this problem. How can I provide this?

Bug in CoxPH cpp code???

The CoxPH cpp code to calculate deviance does not agree with the equation listed in the vignette. The function

double CCoxPH::Deviance
(
double *adT,
double *adDelta,
double *adOffset,
double *adWeight,
double *adF,
unsigned long cLength,
int cIdxOff
)
{
unsigned long i=0;
double dL = 0.0;
double dF = 0.0;
double dW = 0.0;
double dTotalAtRisk = 0.0;

dTotalAtRisk = 0.0;
for(i=cIdxOff; i<cLength+cIdxOff; i++)
{
    dF = adF[i] + ((adOffset==NULL) ? 0.0 : adOffset[i]);
    dTotalAtRisk += adWeight[i]*exp(dF);
    if(adDelta[i]==1.0)
    {
        dL += adWeight[i]*(dF - log(dTotalAtRisk));
        dW += adWeight[i];
    }
}

return -2*dL/dW;

}

has two key values, dL and dW. dL is calculating

\sum (w_i \delta_i (f(x_i) - \log R_i)

and dW is calculating

\sum w_i \delta_i

If we expand the formula for deviance in the vignette we would get

-2 \sum (w_i \delta_i [ f(x_i) - \log R_i + \log w_i ] )

the calculation for dL is valid but dW seems to be wrong. We would want dw to be calculating

\sum (w_i \log w_i)

and the final answer should combine the values dL and dW via

-2 * (dL + dW)

Incorrect prediction with new levels due to invalid memory addresses

d <- data.frame(x=as.factor(1:20), y=1:20)

train <- d[1:10,]
test  <- d[11:20,]

p <- rep(0, 10)

while(sum(abs(p)) == 0)
{
    g <- gbm(y ~ x,
             distribution="gaussian",
             bag.fraction=1,
             data=train, 
             n.trees=1,
             shrinkage=1,
             n.minobsinnode=1)

    p <- predict(g, newdata=test, n.trees=1) - g$initF
}

pretty.gbm.tree(g, 1)
#  SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
#0        0           0.0        1         2           3           62.5     10        0.0
#1       -1          -2.5       -1        -1          -1            0.0      5       -2.5
#2       -1           2.5       -1        -1          -1            0.0      5        2.5
#3       -1           0.0       -1        -1          -1            0.0     10        0.0

g$c.splits[[1]]
#[1] -1 -1 -1 -1 -1  1  1  1  1  1

print(p)
#[1] 0.0 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

In this example the test data has 10 levels unseen during training. The predictions should all be 0.0 since (according to gbmentry.cpp) the intent is that new levels are treated as missing:

iCatSplitIndicator = INTEGER(
            VECTOR_ELT(rCSplits,
                       (int)adSplitCode[iCurrentNode]))[(int)dX];
if(iCatSplitIndicator==-1)
{
   iCurrentNode = aiLeftNode[iCurrentNode];
}
else if(iCatSplitIndicator==1)
{
   iCurrentNode = aiRightNode[iCurrentNode];
}
else // categorical level not present in training
{
   iCurrentNode = aiMissingNode[iCurrentNode];
}

The problem is that INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) is of length equal to the number of levels in the train data, and yet the program retrieves values at positions
11, 12, ..., 20 as these are the values of (int)dX in the test data. This can be easily verified by adding a printf. Surprisingly there is no segfault. The values at the addresses immediately following INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode])) appear in general not to be equal to -1 or 1 and so in general the program correctly uses the missing node. However, by random chance if there is a -1 or 1 the record will be scored at the left or right child respectively.

This is more illustrative:

                                    |-> shouldn't be accessing here
                                    |
[-1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 3245, 1, 64, 2342, 93348, -34857, 82, -8634, 9, 239]
                                          ^by chance this is a 1, so level 12 goes to 
                                           the right child instead of the missing node

fit and predict binomial gbm with two offset terms

Dear Harry,

I am using the ‘dismo’ package to conduct boosted regression trees (BRT) for both binary and count data. The dismo package uses ‘gbm’ package for the implementation of BRT. I would like to incorporate two offset terms in the model, as well as being able to make predictions.
For the count data I am using a Poisson model. Based on a previous post (https://stat.ethz.ch/pipermail/r-help/2010-September/253647.html), I implemented the following code:

library(gbm)
library(dismo)

define offset

offset=(log(data$off1)+ log(data$off2)) #equivalent to log(data$off1*data$off2)

fit poisson

m.pois<-gbm.step(data=data, gbm.x=7:8, gbm.y=4, offset=offset, family="poisson", tree.complexity=1, learning.rate=0.001, bag.fraction=0.7, n.folds=10)

predict poisson

link <-predict.gbm(m.pois, data, n.trees=n.trees, type="link")
link.offset<- link + offset
pred <- exp(link.offset)

My questions is how to implement the same for a binomial model? I have tried to look in different forums and documentation without success. The only clue that I have is the following document: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/inst/doc/gbm.pdf?revision=18&root=gbm&pathrev=22

Any advice and/or additional references on this issue would be more than welcome.

Thank you in advance,

Competing risks Fine and Gray models

Hi,

I was wondering if there are any plans to extend the survival models to competing risks Fine and Gray models.

Thanks.

Unreproducible output from gbm

set.seed(20150312)
wh <- gbm(Species ~ ., data=iris, cv.folds=3)
ri1 <- relative.influence(wh)

Do it again

set.seed(20150312)
wh <- gbm(Species ~ ., data=iris, cv.folds=3)
ri2 <- relative.influence(wh)

Results differ

ri1
ri2

std::random_shuffle(colNumbers.begin(), colNumbers.end()) ;

apparently due to that line (277 in tree.cpp)

with many thanks to Paul Metcalfe

[Question] How are candidate splits for continuous variables determined?

Can someone shed some light on how gbm chooses candidate splits for continuous variables? I've understood previously that continuous variables are discretized to make candidate splits.

I've been reading through the C++, and narrowed my way to CNodeSearch::IncorporateObs. I've included what looks like the relevant bit of the code below (commenting out a few things):

// Evaluate the current split
// the newest observation is still in the right child
  dCurrentSplitValue = 0.5*(dLastXValue + dX);

    if((dLastXValue != dX) && 
        (cCurrentLeftN >= cMinObsInNode) && 
        (cCurrentRightN >= cMinObsInNode) &&
        ((lMonotone==0) ||
        (lMonotone*(dCurrentRightSumZ*dCurrentLeftTotalW - 
                    dCurrentLeftSumZ*dCurrentRightTotalW) > 0)))
    {
        dCurrentImprovement = // compute improvement

        if(dCurrentImprovement > dBestImprovement)
        {
            // update the splitting variable, split value, improvement;
            // and N, sumZ, and TotalW for left and right nodes.

        }
    }

I understand that each predictor is ordered, and I think dX is a value of a predictor for an individual, and dLastXValue is the lowest value of the predictor.

From the code, it looks like the candidate split is just the average of the current observation (dX) and the last. This seems to indicate that a candidate split is generated for every observation, and that there will be n-1 candidate splits for each predictor. Is this correct?

Error in running of the gbm package

Good morning.

I am phd student at university of São Paulo in Brazil.

I am running the algorithms of the biomod2 in parallel, but the execution of the GBM algorithm is presenting failed.

I am using the package "Snowfall" to run the algorithms of the biomod2 in parallel, but a execution of the algorithm GBM presenting failed.

This code:

model.sp <- try(gbm(formula = makeFormula(colnames(Data)[1],head(Data)          [,expl_var_names,drop=FALSE], 'simple',0),
                    data = Data[calibLines,,drop=FALSE],
                    distribution = Options@GBM$distribution,
                    var.monotone = rep(0, length = ncol(Data)-2), # -2 because of removing of sp and weights
                    weights = Yweights,
                    interaction.depth = Options@GBM$interaction.depth,
                    n.minobsinnode = Options@GBM$n.minobsinnode,
                    shrinkage = Options@GBM$shrinkage,
                    bag.fraction = Options@GBM$bag.fraction,
                    train.fraction = Options@GBM$train.fraction,
                    n.trees = Options@GBM$n.trees,
                    verbose = Options@GBM$verbose,
                    #class.stratify.cv = Options@GBM$class.stratify.cv,
                    cv.folds = Options@GBM$cv.folds))#,
                    #n.cores=1)) ## to prevent from parallel issues

The model.sp presentig this message:

<simpleError in x[i]: object of type 'closure' is not subsettable

Error in gbmCluster(n.cores) : could not find function "detectCores"\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in gbmCluster(n.cores): could not find function "detectCores">

I don't know what.

Do you help me?

Thank you

cv.folds=1

By mistake I specified cv.folds=1 and had a heck of a time debugging the following error:

Error in (function (formula = formula(data), distribution = "bernoulli", : object 'p' not found

Please create error message when user specifies cv.folds=1

Rounding error in quantile() - variable N: foo has no variation.

Due to what appears to be a bug in the quantile() function from the stats package, it doesn't seem to be a reliable way of determining if a numeric vector is non-varying.

Here's the bug I filed against the quantile() function:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15746

...and here's where it gets used in gbm:
https://github.com/harrysouthworth/gbm/blob/master/R/gbm.fit.R#L90-L100

segfault

Sent to me by email.

I am testing ‘gbm' on some new data. Using gbm v2.1-05, R 3.0.3 (using the GUI), Max OS 10.9.2. I’ve attached a .rds file with the test data. My response variable is a factor consisting of >40 levels. The predictors are a mix of categorical and numeric/integer. I get an error (actually, R crashes) using ‘gbm.more’. I can replicate with the following code:

require(gbm)
Loading required package: gbm
Loading required package: survival
Loading required package: splines
Loading required package: lattice
Loading required package: parallel
Loaded gbm 2.1-05

dset = readRDS(“test.rds”) # Attached .rds file

This will complete in ~90 seconds (no errors)

q = gbm(puma00~year+loc+hincp+age+race+educ+hstat+rent+mortgage,
distribution="multinomial",
data=dset,
interaction.depth=4,
shrinkage = 0.01,
n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)),
n.cores=1,
n.trees=100)

Not sure what this error is when printing ‘q’ -- not fatal, just FYI

q
gbm(formula = puma00 ~ year + loc + hincp + age + race + educ +
hstat + rent + mortgage, distribution = "multinomial", data = dset,
n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30,
ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1)
A gradient boosted model with multinomial loss function.
100 iterations were performed.
There were 9 predictors of which 9 had non-zero influence.
Error in apply(x$cv.fitted, 1, function(x, labels) { :
dim(X) must have a positive length

HERE is the real problem: attempt to add 100 trees

q2 = gbm.more(q, n.new.trees=100)

*** caught segfault ***
address 0x1c74a8000, cause 'memory not mapped'

Traceback:
1: .Call("gbm", Y = as.double(y), Offset = as.double(offset), X = as.double(x), X.order = as.integer(x.order), weights = as.double(w), Misc = as.double(Misc), cRows = as.integer(cRows), cCols = as.integer(cCols), var.type = as.integer(object$var.type), var.monotone = as.integer(object$var.monotone), distribution = as.character(distribution.call.name), n.trees = as.integer(n.new.trees), interaction.depth = as.integer(object$interaction.depth), n.minobsinnode = as.integer(object$n.minobsinnode), n.classes = as.integer(object$num.classes), shrinkage = as.double(object$shrinkage), bag.fraction = as.double(object$bag.fraction), train.fraction = as.integer(nTrain), fit.old = as.double(object$fit), n.cat.splits.old = as.integer(length(object$c.splits)), n.trees.old = as.integer(object$n.trees), verbose = as.integer(verbose), PACKAGE = "gbm")
2: gbm.more(q, n.new.trees = 100)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

INTERESTINGLY, if I switch to a numeric response variable, I get different issues:

q = gbm(hincp~year+loc+age+race+educ+hstat+rent+mortgage,
distribution="gaussian",
data=dset,
interaction.depth=4,
shrinkage = 0.01,
n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)),
n.cores=1,
n.trees=100)

No predictors are included...

q
gbm(formula = hincp ~ year + loc + age + race + educ + hstat +
rent + mortgage, distribution = "gaussian", data = dset,
n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30,
ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1)
A gradient boosted model with gaussian loss function.
100 iterations were performed.
There were 8 predictors of which 0 had non-zero influence.

Summary of cross-validation residuals:
0% 25% 50% 75% 100%
NA NA NA NA NA

Cross-validation pseudo R-squared: 1

summary(q)
Error in plot.window(xlim, ylim, log = log, ...) :
need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf

BUT ‘gbm.more’ does not cause R to crash...

q2 = gbm.more(q, n.new.trees=100) # No error

Any ideas? Maybe a pointer issue when distribution=“multinomial”?

Many thanks,
Kevin

Error printing certain gbm objects

For example:

df <- data.frame(
  x = runif(100),
  y = runif(100),
  z = sample(0:1, 100, replace = TRUE)
)

trained_gbm <- gbm.fit(df[, c("x", "y")], df$z)

trained_gbm

gbm library failed to load as RStudio console window too small

Hi there
not sure if this is a gbm only issue but gbm library failed to load and I got this error.

Error : .onAttach failed in attachNamespace() for 'gbm', details:
call: formatDL(nm, txt, indent = max(nchar(nm, "w")) + 3)
error: incorrect values of 'indent' and 'width'

Luckily someone else who had the problem had solved it already. Turns out it was because my RStudio console window was too small.
See here for details.
https://stackoverflow.com/questions/29151985/unable-to-attache-the-library-gbm-to-the-r-project