harrysouthworth / gbm Goto Github PK
View Code? Open in Web Editor NEWThis project forked from gbm-developers/gbm3
Gradient boosted models
License: Other
This project forked from gbm-developers/gbm3
Gradient boosted models
License: Other
I'm interested in getting predictions expressed in expected lifetimes, survival curves etc. Could one please explain me what the output of predict.gbm
is? Are these probably (log) partial hazards?
Your help is much appreciated.
Sent to my inbox. The first example in the help file for gbm fails on some runs, sometimes runs ok.
The issue is caused by one of the models in cv.models having all NaNs or Infs in the valid.error element. Those values come out of the .Call("gbm" call looking like that in gbm.fit.R
N <- 1000
X1 <- runif(N)
X2 <- 2_runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3_runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X11.5 + 2 * (X2.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
#0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=FALSE, # don't print out progress
n.cores=1) # use only a single core (detecting #cores is
Hi team,
I am just revisiting some old code that used the caret package as a wrapper to gbm. I noticed it was breaking, and it seems that we now need to specify "n.minobsinnode" (specifically, the tuning grid previously required just interaction.depth, n.trees, and shrinkage). Has there been an update to gbm that I missed? Or is this, perhaps, a change in caret?
Thanks for any help you can provide,
Adam
Would really appreciate the ability to use intervals in the Surv class, so instead of passing Surv(time, censor indicator) it would be great to pass Surv(start time, end time, censor indicator). This would require rewriting some of the C++ code because adding this flexibility implies that at a particular time t not all records are part of the risk set. In the current version, as long as the record is not censored it is in the risk set, but with a start time and end time we must have t >= start time and t < end time in order to be in the risk set
... and similarly for other distributions. Whilst this doesn't affect the functionality of the package, it is a bit inconvenient.
Could the requirement be lowered to factors with two levels, and implicit conversion done within gbm.fit()
(or whatever function from within the package is being used)?
Likewise, in the correspondingpredict()
method, could a factor be returned instead of 0
's and 1
's?
g <- gbm(y ~ x,
distribution="gaussian",
train.fraction=1,
bag.fraction=0.1,
data=data.frame(x=as.factor(1:100), y=rnorm(100)),
n.trees=1,
n.minobsinnode=1)
g$c.splits[[1]]
[1] 1 1 -1 1 1 1 1 -1 1 -1 1 1 -1 1 1 1 1 1 1 1 1 1
[23] 1 -1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[45] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[67] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[89] 1 1 1 1 1 1 1 1 1 1 1 1
In this example 5 levels are assigned to the left child and 95 are assigned to the right child. Of those at the right child, 90 will have had zero weight while training (since bag.fraction=0.1
). It's an artificial example but a node having zero weight for a particular level is very possible when training a real model due to
bag.fraction
< 1I'm wondering if this could be an issue since the tree is (in expectation) over-predicting for zero-weight levels. Granted, later trees will attempt to correct for any over-prediction but
Perhaps in these circumstances it is more reasonable to use the prediction at the parent node since there is no data to suggest whether the zero-weight levels should be assigned to either the left or right child?
I'm experiencing a significant memory leak in 2.1-0.3 with the "laplace" distribution, similar to what is described here:
https://code.google.com/p/gradientboostedmodels/issues/detail?id=32
The file laplace.cpp was fixed at line 97 in 2.1-0.3 to address this. But though I'm running 2.1-0.3 (at least that's what "library (gbm)" says), I'm stilling getting a memory leak.
The training set has 280K rows with 11 columns, and the parameters are:
gbm.fit (trainingSet, outcomes, nTrain = 279870, distribution = "laplace", interaction.depth = 4, n.trees = 1000)
which uses 1.1 GB of working set and 4.7 GB of committed memory. And very roughly, adding 2000 more trees uses an additional 2 GB of working set and 4-5 GB of committed memory. But when the distribution is changed "gaussian", the working set and committed memory stay around 550M, no matter how many trees.
I tried building the master branch, but gbm.fit() doesn't appear to work:
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.0010 nan
2 nan nan 0.0010 nan
3 nan nan 0.0010 nan
(R version 3.0.2, gbm 2.1-0.3, Windows 7 64-bit.)
Hi,
First, thank you for your great work!
Running R Studio and GBM on my 600,000 x 10 data set with cross validation = 5 takes close to 24 hours on my i5 8GB desktop. So I thought I would try Amazon AWS.
I started up an AWS Windows instance with 32 cores and 60 GB of RAM to run the same GBM job mentioned above - but it does not seem to be running much quicker. It seems to only be using mainly 4 or 5 cores (which, I guess is because each cross validated set is sent to one and only one core), while the other 27 are sitting there doing nothing. It is also only using 14 GB out of the 60 GB of available memory.
How can I use the full capacity of my 32 core, 80 GB RAM server?
"I have made some changes so this works when imported from other packages, as commented by their maintainers. In version 2.1.1 now on CRAN."
When building a gbm with high shrinkage, gbm.perf fails with an obscure error. Could the error be more clear? In the example below the shrinkage is ridiculously high, but it happens on my real data with shrinkage=0.1
Line 41 in 9c11f64
At the point of the line mentioned above the x and y variables have been modified so that they include the right data, i.e. the data to be used to fit the nth cv fold. The same thing happens to the offset variable, but not to the weights variable. Therefore the weights which are passed through are pretty much in a random order.
This is an easy fix AFAIK, you just need to add in a line w <- w[i.train][i]
and everyone is happy.
Depending on the number of observations used for training and setting of shrinkage parameter, I encounter the following problem in gbm.fit()
> mdl <- gbm.fit(x=train.df[,1:(ncol(train.df)-1)],
+ y=train.df[,ncol(train.df)],
+ distribution = "multinomial",
+ .... [TRUNCATED]
Iter TrainDeviance ValidDeviance StepSize Improve
1 2.1972 nan 0.1000 0.6183
2 1.8424 nan 0.1000 0.3274
3 1.6543 nan 0.1000 0.2367
4 1.5193 nan 0.1000 0.1802
5 1.4185 nan 0.1000 0.1462
6 1.3352 nan 0.1000 0.1213
7 1.2657 nan 0.1000 0.0945
8 1.2104 nan 0.1000 0.0811
9 1.1629 nan 0.1000 0.0742
10 1.1200 nan 0.1000 0.0609
20 0.8972 nan 0.1000 0.0206
40 0.7478 nan 0.1000 0.0040
60 0.6810 nan 0.1000 0.0014
80 0.6490 nan 0.1000 nan
100 nan nan 0.1000 nan
120 nan nan 0.1000 nan
140 nan nan 0.1000 nan
150 nan nan 0.1000 nan
I have source code and data in a zip file that can reproduce this problem. How can I provide this?
The CoxPH cpp code to calculate deviance does not agree with the equation listed in the vignette. The function
double CCoxPH::Deviance
(
double *adT,
double *adDelta,
double *adOffset,
double *adWeight,
double *adF,
unsigned long cLength,
int cIdxOff
)
{
unsigned long i=0;
double dL = 0.0;
double dF = 0.0;
double dW = 0.0;
double dTotalAtRisk = 0.0;
dTotalAtRisk = 0.0;
for(i=cIdxOff; i<cLength+cIdxOff; i++)
{
dF = adF[i] + ((adOffset==NULL) ? 0.0 : adOffset[i]);
dTotalAtRisk += adWeight[i]*exp(dF);
if(adDelta[i]==1.0)
{
dL += adWeight[i]*(dF - log(dTotalAtRisk));
dW += adWeight[i];
}
}
return -2*dL/dW;
}
has two key values, dL and dW. dL is calculating
\sum (w_i \delta_i (f(x_i) - \log R_i)
and dW is calculating
\sum w_i \delta_i
If we expand the formula for deviance in the vignette we would get
-2 \sum (w_i \delta_i [ f(x_i) - \log R_i + \log w_i ] )
the calculation for dL is valid but dW seems to be wrong. We would want dw to be calculating
\sum (w_i \log w_i)
and the final answer should combine the values dL and dW via
-2 * (dL + dW)
d <- data.frame(x=as.factor(1:20), y=1:20)
train <- d[1:10,]
test <- d[11:20,]
p <- rep(0, 10)
while(sum(abs(p)) == 0)
{
g <- gbm(y ~ x,
distribution="gaussian",
bag.fraction=1,
data=train,
n.trees=1,
shrinkage=1,
n.minobsinnode=1)
p <- predict(g, newdata=test, n.trees=1) - g$initF
}
pretty.gbm.tree(g, 1)
# SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
#0 0 0.0 1 2 3 62.5 10 0.0
#1 -1 -2.5 -1 -1 -1 0.0 5 -2.5
#2 -1 2.5 -1 -1 -1 0.0 5 2.5
#3 -1 0.0 -1 -1 -1 0.0 10 0.0
g$c.splits[[1]]
#[1] -1 -1 -1 -1 -1 1 1 1 1 1
print(p)
#[1] 0.0 2.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In this example the test data has 10 levels unseen during training. The predictions should all be 0.0 since (according to gbmentry.cpp) the intent is that new levels are treated as missing:
iCatSplitIndicator = INTEGER(
VECTOR_ELT(rCSplits,
(int)adSplitCode[iCurrentNode]))[(int)dX];
if(iCatSplitIndicator==-1)
{
iCurrentNode = aiLeftNode[iCurrentNode];
}
else if(iCatSplitIndicator==1)
{
iCurrentNode = aiRightNode[iCurrentNode];
}
else // categorical level not present in training
{
iCurrentNode = aiMissingNode[iCurrentNode];
}
The problem is that INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode]))
is of length equal to the number of levels in the train data, and yet the program retrieves values at positions
11, 12, ..., 20
as these are the values of (int)dX
in the test data. This can be easily verified by adding a printf
. Surprisingly there is no segfault. The values at the addresses immediately following INTEGER(VECTOR_ELT(rCSplits, (int)adSplitCode[iCurrentNode]))
appear in general not to be equal to -1
or 1
and so in general the program correctly uses the missing node. However, by random chance if there is a -1
or 1
the record will be scored at the left or right child respectively.
This is more illustrative:
|-> shouldn't be accessing here
|
[-1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 3245, 1, 64, 2342, 93348, -34857, 82, -8634, 9, 239]
^by chance this is a 1, so level 12 goes to
the right child instead of the missing node
Dear Harry,
I am using the ‘dismo’ package to conduct boosted regression trees (BRT) for both binary and count data. The dismo package uses ‘gbm’ package for the implementation of BRT. I would like to incorporate two offset terms in the model, as well as being able to make predictions.
For the count data I am using a Poisson model. Based on a previous post (https://stat.ethz.ch/pipermail/r-help/2010-September/253647.html), I implemented the following code:
library(gbm)
library(dismo)
offset=(log(data$off1)+ log(data$off2)) #equivalent to log(data$off1*data$off2)
m.pois<-gbm.step(data=data, gbm.x=7:8, gbm.y=4, offset=offset, family="poisson", tree.complexity=1, learning.rate=0.001, bag.fraction=0.7, n.folds=10)
link <-predict.gbm(m.pois, data, n.trees=n.trees, type="link")
link.offset<- link + offset
pred <- exp(link.offset)
My questions is how to implement the same for a binomial model? I have tried to look in different forums and documentation without success. The only clue that I have is the following document: https://r-forge.r-project.org/scm/viewvc.php/*checkout*/pkg/inst/doc/gbm.pdf?revision=18&root=gbm&pathrev=22
Any advice and/or additional references on this issue would be more than welcome.
Thank you in advance,
Hi,
I was wondering if there are any plans to extend the survival models to competing risks Fine and Gray models.
Thanks.
set.seed(20150312)
wh <- gbm(Species ~ ., data=iris, cv.folds=3)
ri1 <- relative.influence(wh)
set.seed(20150312)
wh <- gbm(Species ~ ., data=iris, cv.folds=3)
ri2 <- relative.influence(wh)
ri1
ri2
std::random_shuffle(colNumbers.begin(), colNumbers.end()) ;
apparently due to that line (277 in tree.cpp)
with many thanks to Paul Metcalfe
Can someone shed some light on how gbm chooses candidate splits for continuous variables? I've understood previously that continuous variables are discretized to make candidate splits.
I've been reading through the C++, and narrowed my way to CNodeSearch::IncorporateObs. I've included what looks like the relevant bit of the code below (commenting out a few things):
// Evaluate the current split
// the newest observation is still in the right child
dCurrentSplitValue = 0.5*(dLastXValue + dX);
if((dLastXValue != dX) &&
(cCurrentLeftN >= cMinObsInNode) &&
(cCurrentRightN >= cMinObsInNode) &&
((lMonotone==0) ||
(lMonotone*(dCurrentRightSumZ*dCurrentLeftTotalW -
dCurrentLeftSumZ*dCurrentRightTotalW) > 0)))
{
dCurrentImprovement = // compute improvement
if(dCurrentImprovement > dBestImprovement)
{
// update the splitting variable, split value, improvement;
// and N, sumZ, and TotalW for left and right nodes.
}
}
I understand that each predictor is ordered, and I think dX is a value of a predictor for an individual, and dLastXValue is the lowest value of the predictor.
From the code, it looks like the candidate split is just the average of the current observation (dX) and the last. This seems to indicate that a candidate split is generated for every observation, and that there will be n-1 candidate splits for each predictor. Is this correct?
Good morning.
I am phd student at university of São Paulo in Brazil.
I am running the algorithms of the biomod2 in parallel, but the execution of the GBM algorithm is presenting failed.
I am using the package "Snowfall" to run the algorithms of the biomod2 in parallel, but a execution of the algorithm GBM presenting failed.
This code:
model.sp <- try(gbm(formula = makeFormula(colnames(Data)[1],head(Data) [,expl_var_names,drop=FALSE], 'simple',0),
data = Data[calibLines,,drop=FALSE],
distribution = Options@GBM$distribution,
var.monotone = rep(0, length = ncol(Data)-2), # -2 because of removing of sp and weights
weights = Yweights,
interaction.depth = Options@GBM$interaction.depth,
n.minobsinnode = Options@GBM$n.minobsinnode,
shrinkage = Options@GBM$shrinkage,
bag.fraction = Options@GBM$bag.fraction,
train.fraction = Options@GBM$train.fraction,
n.trees = Options@GBM$n.trees,
verbose = Options@GBM$verbose,
#class.stratify.cv = Options@GBM$class.stratify.cv,
cv.folds = Options@GBM$cv.folds))#,
#n.cores=1)) ## to prevent from parallel issues
The model.sp presentig this message:
<simpleError in x[i]: object of type 'closure' is not subsettable
Error in gbmCluster(n.cores) : could not find function "detectCores"\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in gbmCluster(n.cores): could not find function "detectCores">
I don't know what.
Do you help me?
Thank you
By mistake I specified cv.folds=1 and had a heck of a time debugging the following error:
Error in (function (formula = formula(data), distribution = "bernoulli", : object 'p' not found
Please create error message when user specifies cv.folds=1
Due to what appears to be a bug in the quantile() function from the stats package, it doesn't seem to be a reliable way of determining if a numeric vector is non-varying.
Here's the bug I filed against the quantile() function:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=15746
...and here's where it gets used in gbm:
https://github.com/harrysouthworth/gbm/blob/master/R/gbm.fit.R#L90-L100
Sent to me by email.
I am testing ‘gbm' on some new data. Using gbm v2.1-05, R 3.0.3 (using the GUI), Max OS 10.9.2. I’ve attached a .rds file with the test data. My response variable is a factor consisting of >40 levels. The predictors are a mix of categorical and numeric/integer. I get an error (actually, R crashes) using ‘gbm.more’. I can replicate with the following code:
require(gbm)
Loading required package: gbm
Loading required package: survival
Loading required package: splines
Loading required package: lattice
Loading required package: parallel
Loaded gbm 2.1-05
dset = readRDS(“test.rds”) # Attached .rds file
q = gbm(puma00~year+loc+hincp+age+race+educ+hstat+rent+mortgage,
distribution="multinomial",
data=dset,
interaction.depth=4,
shrinkage = 0.01,
n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)),
n.cores=1,
n.trees=100)
q
gbm(formula = puma00 ~ year + loc + hincp + age + race + educ +
hstat + rent + mortgage, distribution = "multinomial", data = dset,
n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30,
ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1)
A gradient boosted model with multinomial loss function.
100 iterations were performed.
There were 9 predictors of which 9 had non-zero influence.
Error in apply(x$cv.fitted, 1, function(x, labels) { :
dim(X) must have a positive length
q2 = gbm.more(q, n.new.trees=100)
*** caught segfault ***
address 0x1c74a8000, cause 'memory not mapped'
Traceback:
1: .Call("gbm", Y = as.double(y), Offset = as.double(offset), X = as.double(x), X.order = as.integer(x.order), weights = as.double(w), Misc = as.double(Misc), cRows = as.integer(cRows), cCols = as.integer(cCols), var.type = as.integer(object$var.type), var.monotone = as.integer(object$var.monotone), distribution = as.character(distribution.call.name), n.trees = as.integer(n.new.trees), interaction.depth = as.integer(object$interaction.depth), n.minobsinnode = as.integer(object$n.minobsinnode), n.classes = as.integer(object$num.classes), shrinkage = as.double(object$shrinkage), bag.fraction = as.double(object$bag.fraction), train.fraction = as.integer(nTrain), fit.old = as.double(object$fit), n.cat.splits.old = as.integer(length(object$c.splits)), n.trees.old = as.integer(object$n.trees), verbose = as.integer(verbose), PACKAGE = "gbm")
2: gbm.more(q, n.new.trees = 100)
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
q = gbm(hincp~year+loc+age+race+educ+hstat+rent+mortgage,
distribution="gaussian",
data=dset,
interaction.depth=4,
shrinkage = 0.01,
n.minobsinnode=max(30,ceiling(nrow(dset)*0.001)),
n.cores=1,
n.trees=100)
q
gbm(formula = hincp ~ year + loc + age + race + educ + hstat +
rent + mortgage, distribution = "gaussian", data = dset,
n.trees = 100, interaction.depth = 4, n.minobsinnode = max(30,
ceiling(nrow(dset) * 0.001)), shrinkage = 0.01, n.cores = 1)
A gradient boosted model with gaussian loss function.
100 iterations were performed.
There were 8 predictors of which 0 had non-zero influence.
Summary of cross-validation residuals:
0% 25% 50% 75% 100%
NA NA NA NA NA
Cross-validation pseudo R-squared: 1
summary(q)
Error in plot.window(xlim, ylim, log = log, ...) :
need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
q2 = gbm.more(q, n.new.trees=100) # No error
Any ideas? Maybe a pointer issue when distribution=“multinomial”?
Many thanks,
Kevin
For example:
df <- data.frame(
x = runif(100),
y = runif(100),
z = sample(0:1, 100, replace = TRUE)
)
trained_gbm <- gbm.fit(df[, c("x", "y")], df$z)
trained_gbm
Hi there
not sure if this is a gbm only issue but gbm library failed to load and I got this error.
Error : .onAttach failed in attachNamespace() for 'gbm', details:
call: formatDL(nm, txt, indent = max(nchar(nm, "w")) + 3)
error: incorrect values of 'indent' and 'width'
Luckily someone else who had the problem had solved it already. Turns out it was because my RStudio console window was too small.
See here for details.
https://stackoverflow.com/questions/29151985/unable-to-attache-the-library-gbm-to-the-r-project
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.