Giter Club home page Giter Club logo

bartmachine's People

Contributors

brooksambrose avatar jbleich89 avatar kapelner avatar moserware avatar olivroy avatar rdiaz02 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

bartmachine's Issues

negative predicted probabilities

Hi, thank you for developing such a fantastic package. I find the use of bartMachine very convenient. One question I have for the y_hat_train in the bartMachine outcome object and the calc_credible_intervals function is that both of them can give negative predicted probabilities when I use a binary outcome as response variable and do bartmachine with regression. Presumably given that the default model for binary regression is probit in bartMachine, shouldn't the $\hat{f(x)}$ be positive probability values?

Multimodal posterior interpretation

Hello, I don't know If I should raise an issue since it is more a theoretical problem.

I am using BART for causal prediction with a dichotomic outcome, by comparing the observation level prediction by setting a specific predictor to some counterfactual values. I noticed that the distribution of average prediction (across all observed individuals in a posterior) sometimes is multimodal.
I guess this may be due to the natural discreteness of a tree-based distribution, where a different choice near the root of a tree may cause a totally different structure in the rest of it.

How should I interpret these multimodal posteriors? that there are many possible effects and the model cannot decide? or that there is a possible interaction with other variables? or may it be a by-product of forcing a predictor value in an individual in which such value is not likely given the other covariates?

Error with installation from source

Hello,

I guess there is a typo in the checkmate package's name:

ERROR: dependency ‘chekmate’ is not available for package ‘bartMachine’

Using bartMachine inside a package

Hello!

I am trying to use bartMachine as part of a larger package, but I'm struggling against Java memory management.

When used inside a package, I never run library(bartMachine) and even if the java.parameters option is set up, the information is never passed to the java machine underneath.

Trying to emulate the loading of the package I run:

libname <- list.files(R.home(), 'library', full.names = T)
rJava::.jpackage('bartMachine', lib.loc = libname)
rJava::.jpackage('bartMachineJARs', lib.loc = libname)

after setting the memory option. But nevertheless running rJava::.jcall(rJava::.jnew("java/lang/Runtime"), "J", "maxMemory") / 1e9 after it shows me the default amount of memory.

To give more info, this is my workflow:

  • First check if java.parameters is set, if not set up, ask the user to choose an amount of memory to use.
  • Then there's the .jpackage() code as above, but I'm not sure it's doing anything.
  • If not done already, the user chooses the number of cores to use.
  • the Bart model is run
  • I extract the posterior from it using bart_machine_get_posterior()
  • I extract the variable importance using get_var_props_over_chain() which is where I get the memory problem:
<OutOfMemoryError/VirtualMachineError/Error/Throwable/Object/Exception/error/condition>
Error in `.jcall(bart_machine$java_bart_machine, "[D", "getAttributeProps", 
    type)`: java.lang.OutOfMemoryError: Java heap space

Is it possible to use bartMachine inside a package without loading and attaching it?

How to use printTreeIllustations

I would like to plot some sample trees in the BART model. Is "printTreeIllustations" used for visualizing trees?

I uncommented ".jcall(java_bart_machine, "V", "printTreeIllustations")" in "build_bart_machine". However, when I ran it, there is an error: method printTreeIllustations with signature ()V not found.

How can I fix it? Or is there other function to visualize sample trees?

Thanks!

Min

This bartMachine object was loaded from an R image but was not serialized. Please build bartMachine using the option "serialize = TRUE" next time.

Hi,

This bartMachine object was loaded from an R image but was not serialized.
  Please build bartMachine using the option "serialize = TRUE" next time.

I've seen the error message being reported elsewhere, but the solution has been to ensure the bartMachine version used is the same version the model was built with. For version 1.2.6, this does not remove the error message.

Thank you!

Java memory using problem

I ran a BART model with 11000 samples and 20 features(half of them are categorical variable). My mac has 8G ram. At first, I set memory to 5000 MB via function set_bart_machine_memory(5000).

Then I can fit a model through the function bartMachine one time. If I want to run another model then the R returns a error like this:

Exception in thread "pool-10-thread-1" Exception in thread "pool-10-thread-3"
java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-10-thread-2" java.lang.OutOfMemoryError: Java heap space
Exception in thread "pool-10-thread-4" java.lang.OutOfMemoryError: Java heap space
錯誤在.jcall(bart_machine$java_bart_machine, "Z", "isDestroyed") :
java.lang.OutOfMemoryError: Java heap space

I think that having two bartMachine object in memory may not be a good idea, so I just kill the first model through function destroy_bart_machine(), then the second model is OK to run.

The main problem is on bartMachineCV(). There are about 20 model to fit in default, and the memory error like the one above hits me when R is running the bart model with second set of parameter setting (that is : bartMachine CV try: k: 2 nu, q: 3, 0.9 m: 200 ).

Is that the bartMachineCV() function run all the 20 or more model, save all of them in the memory, then pick the "best RMSE performance" one up? I think that will be a problem to computer with limited memory space.

If the bartMachineCV() function can finish the first model, save the RMSE result, then destroy the first bart object in memory, then run the second model..... until all the CV models are finished. Now you have 20 RMSE values, and pick the one has best RMSE, then run THIS BEST MODEL again...... It will take a litter bit more time, but can save a lot of memory space...... Is that a good idea? Or is there some way to run bartMachineCV() on a 8GB RAM computer?

Thanks.

Out of Memory error

Hi, I'm interested in using bartMachine to build a BART model to explore interactions with the interaction_investigator function. My dataset is ~200,000 rows and 49 variables. However, even on a subset of this dataset (~20,000 rows and 9 columns), running bartMachine(predictors, y, num_trees = 20) gives an error:

Error in .jarray(model_matrix_training_data, dispatch = TRUE) : java.lang.OutOfMemoryError: Java heap space

Is bartMachine able to deal with a dataset of this size or is it intended for small datasets? If it can deal with this, are there any tips/workarounds for avoiding this error?

Thanks for your help!

Classification probabilities reversed

The package is reversing either binary classifications or it's reversing the class probabilities. Whether I'm using the package through mlr (installed from master today via devtools) or just bartMachine itself through CRAN, here's what happens when I run the example from the vignette on R 3.2.3:

> data("Pima.te", package = "MASS")
> X <- data.frame(Pima.te[, -8])
> y <- Pima.te[, 8]
> bartMachine(X, y)
bartMachine initializing with 50 trees...
bartMachine vars checked...
bartMachine java init...
bartMachine factors created...
bartMachine before preprocess...
bartMachine after preprocess... 8 total features...
bartMachine sigsq estimated...
bartMachine training data finalized...
Now building bartMachine for classification ...
evaluating in sample data...
Iteration 100/1250
Iteration 200/1250
Iteration 300/1250
Iteration 400/1250
Iteration 500/1250
Iteration 600/1250
Iteration 700/1250
Iteration 800/1250
Iteration 900/1250
Iteration 1000/1250
Iteration 1100/1250
Iteration 1200/1250
done building BART in 2.11 sec 

burning and aggregating chains from all threads... done
done
bartMachine v1.2.2 for classification

training data n = 332 and p = 7 
built in 2.4 secs on 1 core, 50 trees, 250 burn-in and 1000 post. samples

confusion matrix:

           predicted No predicted Yes model errors
actual No        13.000       210.000        0.942
actual Yes       71.000        38.000        0.651
use errors        0.845         0.847        0.846

Obviously not a great classification there...

I looked at #10, which seems like a similar problem, but that should be fixed in 1.2.2, right? Is this a new bug?

Memory Allocation for bartMachine

Greetings,

I have spent a lot of time researching how to fix the memory issue in bartMachine, but nothing that I have tried has worked.

Interestingly enough, it worked first try last night. Then I saved my script and logged off. Today I went back in, and nothing is working. I have started a new session with no package loaded, ran the "java.prameters" code BEFORE "library(bartMachine)" a few dozen times; nothing works. During some iterations, it works, but then upon repeating the exact same steps that made it work, it failed again.

I have tried every suggestions here https://stackoverflow.com/questions/34624002/r-error-java-lang-outofmemoryerror-java-heap-space and here #5 . Is there something that I am missing? My machine has 64GB; I have been allocating up to 50 but nothing seems to work.

Thanks,
Lou

Work with individual trees from bartMachine object

Thank you for the excellent package.

I would like to work with the individual trees from a bartMachine object. Is this possible?

To clarify, let me explain my reason for wanting to do so. It is based on a result in Stefan Wager & Susan Athey (JASA, 2018; link: https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1319839). They show that a conditional mean prediction from a random forest can be interpreted as a kernel-weighted average: kernel weights are assigned to each observation to generate a prediction at point X=x in the covariate space, and these weights equal the share of trees for which a given observation is placed into a leaf that would include X=x.

If I have the individual trees, then I can construct such kernel weights. I would just need the ability to input a covariate value X=x and know, for each tree, what observations fall into the leaf that would generate the prediction for X=x.

Doing so is useful for various reasons (not to get too deep into the weeds, but I am interested to use this for a kernel-based implementation of "trimming bounds" from a paper by David S. Lee (2008, REStat)).

Spatial structure in the (regression) residuals

First of all, thank you for creating this package, the documentation is very clear and easy to follow. I apologize in advance for the long post, but I'd like to make my problem as clear as possible.

I have a response variable which is the nighttime light luminosity (called ntl in the data.frame I am using). As you can see in the image below, it has some bright spots (highlighted in red) which are areas with high brightness (outliers in the data.frame).

ntl

Also, here is the histogram of the response. It's clear that the distribution is right-skewed (and possibly there is bimodality?).
histogram

Purpose of my analysis
My goal is to predict the ntl from the coarse spatial resolution to a higher. This means that, I need to maintain these bright spots (the outliers) in the predicted image. Because at a later stage, I will downscale the residuals of the regression using area-to-point Kriging, I need them to be random (no spatial structure).

Analysis
Following the approach found in your paper, I created this code:

options(java.parameters = "-Xmx5g")
library("bartMachine")
set_bart_machine_num_cores(3)

# set working directory
wd <- "path/"

# Projected reference system (in order to convert the residuals into a raster image)
provoliko <- "EPSG:24313"

# original df
df <- read.csv(paste0(wd, 'block.data.csv'))

# extract the x and y columns (coordinates) from the df
crds <- df[, 1:2]

# here I keep only the necessary columns for my analysis
keep <- c("ntl", "pop", "agbh", "nir", "ebbi", "ndbi", "road", "pan", "nbai", 
          "tirs")
df <- df[keep]

x <- df[, 2:10]
y <- df[, 1]

bart_machine <- bartMachine(x, y)
bart_machine

The output of the default bartMachine model is:

> bart_machine
bartMachine v1.3.4.1 for regression

training data size: n = 5658 and p = 9 
built in 11.7 secs on 1 core, 50 trees, 250 burn-in and 1000 post. samples

sigsq est for y beforehand: 76.35 
avg sigsq estimate after burn-in: 40.8314 

in-sample statistics:
 L1 = 21551.53 
 L2 = 208688.47 
 rmse = 6.07 
 Pseudo-Rsq = 0.83
p-val for shapiro-wilk test of normality of residuals: 0 
p-val for zero-mean noise: 0.98127

Using the function plot_convergence_diagnostics(bart_machine), the result is:
residuals_diagnostics

Again, for a "good" model I would like to see a symmetric scatter of points around the horizontal like at zero, indicating random deviations of predictions from the observed values, but from the plot above I see that this isn't the case.

Moreover, a map of the residuals is shown below. In red, I highlighted the areas were I believe the model couldn't model well (you can compare these areas to the areas in the first image above).
bart_rsds
As you can see, there is clearly a spatial structure (i.e., the residuals do not show a random pattern).

My question is, is there a way tell BART to consider the outliers (i.e., bright spots in the study area) as "more important" when modelling the NTL? What are your recommendations?

Because the csv I'm using has several thousands of rows, I can share it via a link, from here. Just so you know, running a model with default parameters takes less than 30 seconds on my laptop (8 gigs of RAM, 4 cores processor (CPU: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz)).

Lastly, I tried to use a model computed by the bartMachineCV function (as well as variable selection) but the results are not better.

Are multiclass problems coming soon?

This is an enhancement request - not an Issue.

I was wondering if you were still planning on implementing multi-class models. The following quote is from Kapelner A, Bleich J. bartMachine: Machine Learning with Bayesian Additive Regression Trees. J. Stat. Soft. [Internet]. 2016 Apr. 4;70(4):1-40.

"2.3. BART for classification
BART can easily be modified to handle classification problems for categorical response variables. In Chipman et al. (2010), only binary outcomes were explored but recent work has extended BART to the multiclass problem (Kindo, Wang, and Pe 2013). Our implementation handles binary classification and we plan to implement multiclass outcomes in a future release.
"

(emphasis mine)

I currently have to switch to an alternative package, with substantially different formatting requirements, output, etc... It would be wonderful to have that capacity in bartMachine.

"seed = " doesn't seem to work with k_fold_cv

Hello,

Thanks for the amazingly great package! I'm loving it, and able to do lots of useful things.

I seem to have found a minor bug related to creating reproducible results.

When I run k_fold_cv several times with the same seed, I get different results. I assumed the seed = would work the same way it works in build_bart_machine.

Example below (ignore the poor classification results - those are unrelated to the issue)

oos_stats <- k_fold_cv(xx, ww, k_folds = 5, seed = 82002)
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...

oos_stats$confusion_matrix
predicted 1 predicted 0 model errors
actual 1 6.000 129.000 0.956
actual 0 8.000 1915.000 0.004
use errors 0.571 0.063 0.067

Then run again....

oos_stats <- k_fold_cv(xx, ww, k_folds = 5, seed = 82002)
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...
.predicting probabilities where "1" is considered the target level...

oos_stats$confusion_matrix
predicted 1 predicted 0 model errors
actual 1 4.0 131.000 0.970
actual 0 6.0 1917.000 0.003
use errors 0.6 0.064 0.067

Segmentation fault installing bartMachineJARs in R 3.4.0

I'm getting a segmentation fault while installing bartMachineJARs in R 3.4.0:

> install.packages("bartMachineJARs", repos="https://cran.r-project.org")
Installing package into ‘/home/aorth/R/x86_64-pc-linux-gnu-library/3.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.r-project.org/src/contrib/bartMachineJARs_1.0.tar.gz'
Content type 'application/x-gzip' length 3213066 bytes (3.1 MB)
==================================================
downloaded 3.1 MB

* installing *source* package ‘bartMachineJARs’ ...
** package ‘bartMachineJARs’ successfully unpacked and MD5 sums checked
** R
** inst
** preparing package for lazy loading
** help
No man pages found in package  ‘bartMachineJARs’
*** installing help indices
** building package indices
** testing if installed package can be loaded
sh: line 1:  5502 Segmentation fault      '/export/apps/R/3.4.0/lib64/R/bin/R' --no-save --slave 2>&1 < '/tmp/Rtmpdjw7VM/file157a539d6b34'
ERROR: loading failed
* removing ‘/home/aorth/R/x86_64-pc-linux-gnu-library/3.4/bartMachineJARs’

The downloaded source packages are in
        ‘/tmp/RtmpYLaVlB/downloaded_packages’
Warning message:
In install.packages("bartMachineJARs", repos = "https://cran.r-project.org") :
  installation of package ‘bartMachineJARs’ had non-zero exit status

rJava and other dependencies are already installed. The environment is CentOS 6 with Java OpenJDK 1.7.0_121.

Note: I just succeeded to install bartMachine in R 3.3.3, but I'm posting this issue to track the problem with R 3.4.0.

partial dependency on non-quantile scale

Is there a means to implement pd_plot() to plot predictor vars on the original scale. i.e. similar the 'x_quantile' parameter option in ICEbox, not necessarily calculated only at specified quantiles?

Additionally can the dataframe be extracted from pd_plot with credible interval values, in order to plot with other plotting packages? The list contained in the pd_plot function doesn't contain CI values.

Thank you for the help!

interaction_investigator()

Hello @kapelner ,

I'm replicating the results from your paper, but I am getting different results for the interaction effects in section 4.11. I think it's because of this loop, specifically the j<=i part.

for (i in 1:bart_machine$p) { for (j in 1:bart_machine$p) { if (j <= i) { avg_counts[iter] = interaction_counts_avg[i, j] sd_counts[iter] = interaction_counts_sd[i, j] names(avg_counts)[iter] = paste(rownames(interaction_counts_avg)[i], "x", rownames(interaction_counts_avg)[j]) iter = iter + 1 } } }

image

Could you have a look?

DAG as a way to visualise decisions of the trained BART tree

First of all, thank you for this implementation it works really well in R and is easy to use as well as efficient.

There are a lot of implemented functions that allow the user to measure the performance of a BART model. Other metrics are normally implemented by the user (e.g. correlation or R2 etc.). What about explaining the model? I don't know if there is an implemented way to draw the final decision tree that was made by the Bart machine.

For example, I am looking for a way to generate a DAG graph from the decisions made by the final BART tree. This way we could calculate the importance/contribution of each feature to the final model performance, possibly implementing SHAP values as well.

In short, is there a way to draw the BART tree as a DAG or any other way to measure feature contribution?

seed?

It looks like there's no way to set bartMachine's random seed from R?

(I don't mean to intrude or anything, so let me know if Github issues aren't the best way to communicate this sort of thing to you.)

install hangs on checking Java runtime

Hi, the installation I am seeing hangs on

checking whether Java run-time works... yes
checking whether -Xrs is supported...

I can confirm that I have a JDK installed, and I have done the R CMD javareconf step. I installed rJava through apt-get install r-cran-rjava. Do you have any tips on what to look for while debugging?

hBART in main branch?

Hello,

I was wondering if the heteroskedastic BART (hBART) features are/will be available in the main branch. It seems like the hBART branch is very outdated at this point, and I'm not sure if the features of that branch will work with the more modern features of the bartMachine package (such as visualization and variable selection).

Best,
Jacob

partial plot of binary response

Hi Adam,

For the partial plot of binary responses, is it possible to change y-axis to probabilities from probits? I find it more useful to directly display probabilities rather than probits to people when I show the partial plot.

Best,
Jason

bart_machine_get_posterior() won't work with a bartMachineArr() lists restored from disk.

Hello,

I'm generating an ensemble model with bartMachineArr() to produce a more robust posterior predictive distribution. I need to save the model for later use.
When I restore the array though, only the first model will work with bart_machine_get_posterior(), while for the others I get:

 Error in check_serialization(bart_machine) : 
  This bartMachine object was loaded from an R image but was not serialized.
  Please build bartMachine using the option "serialize = TRUE" next time.

I guess the serialize argument of bartMachine doesn't get passed to the other models, or some connection is lost.

Here's the dummy code to produce the model:

n_models <- 5

model <- bartMachine(X = X, y = y, serialize = TRUE, ...)

if (n_models > 1) {
	model <- bartMachineArr(model, R = n_models)
} else {
	model <- list(model)
}

readr::write_rds(model, 'model.rds', compress = 'gz')

And to produce averaged predictive posteriors:

pred_post <- bart_machine_get_posterior(model[[1]], new_data = data)$y_hat_posterior_samples

if (n_models > 1) {
	for (i in 2:n_models) {
		pred_post <- pred_post + bart_machine_get_posterior(model[[i]], new_data = data)$y_hat_posterior_samples
	}
}

pred_post <- pred_post / n_models

My session info:

R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] it_IT.UTF-8/it_IT.UTF-8/it_IT.UTF-8/C/it_IT.UTF-8/it_IT.UTF-8

attached base packages:
[1] grid      parallel  stats     graphics  grDevices utils     datasets  methods  
[9] base     

other attached packages:
 [1] knitr_1.31          tidytrees_0.2.2     pROC_1.17.0.1       magrittr_2.0.1     
 [5] pbmcapply_1.5.0     pbapply_1.4-3       readr_1.4.0         glue_1.4.2         
 [9] bartMachine_1.2.6   missForest_1.4      itertools_0.1-3     iterators_1.0.13   
[13] foreach_1.5.1       randomForest_4.6-14 bartMachineJARs_1.1 rJava_0.9-13       
[17] bayestestR_0.8.2    partykit_1.2-11     mvtnorm_1.1-1       libcoin_1.0-7      
[21] readxl_1.3.1        stringr_1.4.0       dplyr_1.0.6        

loaded via a namespace (and not attached):
 [1] pkgload_1.1.0      splines_4.0.5      Formula_1.2-4      assertthat_0.2.1  
 [5] pander_0.6.3       cellranger_1.1.0   yaml_2.2.1         remotes_2.2.0     
 [9] sessioninfo_1.1.1  pillar_1.6.0       backports_1.2.1    lattice_0.20-41   
[13] digest_0.6.27      pryr_0.1.4         checkmate_2.0.0    htmltools_0.5.1.1 
[17] Matrix_1.3-2       plyr_1.8.6         pkgconfig_2.0.3    devtools_2.3.2    
[21] magick_2.6.0       purrr_0.3.4        processx_3.5.2     tibble_3.1.1      
[25] generics_0.1.0     usethis_2.0.0      ellipsis_0.3.2     cachem_1.0.4      
[29] withr_2.4.1        cli_2.5.0          survival_3.2-10    crayon_1.4.1      
[33] memoise_2.0.0      evaluate_0.14      ps_1.6.0           fs_1.5.0          
[37] fansi_0.4.2        pkgbuild_1.2.0     rapportools_1.0    tools_4.0.5       
[41] prettyunits_1.1.1  hms_1.0.0          matrixStats_0.58.0 lifecycle_1.0.0   
[45] callr_3.5.1        compiler_4.0.5     inum_1.0-2         tinytex_0.30      
[49] rlang_0.4.11       base64enc_0.1-3    rmarkdown_2.7      testthat_3.0.1    
[53] codetools_0.2-18   DBI_1.1.1          R6_2.5.0           lubridate_1.7.10  
[57] fastmap_1.1.0      utf8_1.2.1         rprojroot_2.0.2    insight_0.12.0    
[61] desc_1.3.0         stringi_1.6.1      Rcpp_1.0.6         vctrs_0.3.8       
[65] rpart_4.1-15       tidyselect_1.1.1   xfun_0.22

interaction constraints

I was wondering if it would be possible to enable interaction constraints to bartMachine. These have recently been added to xgboost (link). Having interaction constraints would be very handy for fitting models where some variables are held out from the whole response surface, e.g, y ~ f(x1) + f(x2, x3, x4, x5,...).

Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space

when I tried to run bartMachine for a dataset with 350000 obs * 13 features, I got the following message:

building BART with mem-cache speedup...
Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space
done building BART in 1.513 sec

burning and aggregating chains from all threads...
at gnu.trove.list.array.TDoubleArrayList.(TDoubleArrayList.java:91)
at gnu.trove.list.array.TDoubleArrayList.(TDoubleArrayList.java:79)
at bartMachine.bartMachineTreeNode.propagateDataByChangedRule(Unknown Source)
at bartMachine.bartMachine_g_mh.doMHGrowAndCalcLnR(Unknown Source)
at bartMachine.bartMachine_g_mh.metroHastingsPosteriorTreeSpaceIteration(Unknown Source)
done
at bartMachine.bartMachine_e_gibbs_base.SampleTree(Unknown Source)
at bartMachine.bartMachine_e_gibbs_base.DoOneGibbsSample(Unknown Source)
at bartMachine.bartMachine_e_gibbs_base.DoGibbsSampling(Unknown Source)
at bartMachine.bartMachine_e_gibbs_base.Build(Unknown Source)
at bartMachine.bartMachineRegressionMultThread$1.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:695)
evaluating in sample data...
Error in .jarray(model_matrix_training_data, dispatch = TRUE) :
java.lang.OutOfMemoryError: Java heap space

any suggestion how to solve this problem?

Exporting CIs

Hi there. First, thanks a lot for this great package :)

Second, I was wondering if there's a way to save/access the confidence bounds created in the pd_plot command? I would like to customise the output in ggplot before exporting but pd_plot only saves quantiles and posterior means but not the confidence bounds shown in the graph.

Any suggestions would be very welcome. Many thanks in advance!

Multinomial responses

bartMachine is fantastic and has much nicer features than the other BART implementations available in R. The one thing it's missing is the capacity to handle multinomial outcomes. Is that on the roadmap?

Trailing posterior samples of sigma are zero

The vector of posterior samples on the error variance returned by get_sigsqs contains many zeroes. Minimal example:

library(bartMachine)
n = 50
X = data.frame(cippa=rnorm(n), lippa=rnorm(n), pasqualino=rnorm(n))
y = rnorm(n)
bm = bartMachine(X, y)
ss = get_sigsqs(bm, plot_hist=T)
print(ss)

Output:

Loading required package: rJava
Loading required package: bartMachineJARs
Loading required package: randomForest
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Loading required package: missForest
Welcome to bartMachine v1.3.2! You have 0.54GB memory available.

If you run out of memory, restart R, and use e.g.
'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
'library(bartMachine)'.

bartMachine initializing with 50 trees...
bartMachine vars checked...
bartMachine java init...
bartMachine factors created...
bartMachine before preprocess...
bartMachine after preprocess... 3 total features...
bartMachine sigsq estimated...
bartMachine training data finalized...
Now building bartMachine for regression...
evaluating in sample data...done
   [1] 0.6738567 0.6823280 0.3963788 0.5420501 0.5526897 0.4949011 0.6561506
   [8] 0.4309214 0.6297360 0.6574286 0.8740465 0.6049023 0.7190190 0.6186070
  [15] 0.4147947 0.5096590 0.3854179 0.3236273 0.6343288 0.7104224 0.5492379
  [22] 0.5403672 0.7041673 0.5576870 0.6404842 0.4569708 0.7032812 0.6057967
  [29] 0.4068528 0.6529362 0.4497582 0.5405523 0.5146371 0.5606146 0.7231270
  [36] 0.4251352 0.5017961 0.4039385 0.4185928 0.5516639 0.5349040 0.6444614
  [43] 0.4168752 0.4755697 0.4472841 0.6164776 0.5950383 0.6032409 0.4814202
  [50] 0.3817277 0.4124045 0.6495234 0.6515946 0.6564860 0.5243605 0.4625689
  [57] 0.5676971 0.5400568 0.3962279 0.3540788 0.4758006 0.4983467 0.4925431
  [64] 0.5626715 0.9368755 0.7203849 0.4467512 0.5900052 0.2964828 0.6399533
  [71] 0.4715198 0.7480451 0.5043543 0.5348309 0.4169205 0.3890896 0.5225142
  [78] 0.5893633 0.6248662 0.4411586 0.5316090 0.5906821 0.6600916 0.6358257
  [85] 0.5430974 0.4191855 0.4234045 0.6401893 0.6160910 0.6149260 0.5239037
  [92] 0.7346262 0.7927698 0.8559771 0.8564749 0.6399547 0.6617545 0.4862764
  [99] 0.5218394 0.5361528 0.4645625 0.4862391 0.4249653 0.4966983 0.5722455
 [106] 0.5756632 0.5889968 0.7468448 0.7112704 0.4752701 0.4422910 0.6224502
 [113] 0.7478145 0.6917788 0.6593739 0.5041079 0.5702368 0.4908382 0.5388601
 [120] 0.6565747 0.7446141 0.4281959 0.8973551 0.5082524 0.6022598 0.6682022
 [127] 0.6210314 0.6441824 0.4827757 0.7639993 0.4104385 0.8480981 0.7278081
 [134] 0.6674551 0.7050705 0.5499230 0.7574979 0.6489151 0.7373134 0.5471537
 [141] 0.5827605 0.5526380 0.5107312 0.4410340 0.4361805 0.3881677 0.6540108
 [148] 0.4434175 0.5201778 0.7684820 0.6036935 0.7783705 0.8112201 0.5085767
 [155] 0.4166957 0.5891744 0.8272326 0.8059974 0.6039739 0.4926725 0.5685766
 [162] 0.4819520 0.4345115 0.7241730 0.5001127 0.6093101 0.8074775 0.6211340
 [169] 0.7598558 0.6495594 0.5982428 0.6298588 0.7029029 0.5206628 0.6280212
 [176] 0.5671791 0.4642438 0.9423288 0.6641100 0.5236050 0.4615422 0.5714215
 [183] 0.6319731 0.5353613 0.4966538 0.5876032 0.6829575 0.5461618 0.3516722
 [190] 0.4463553 0.4113644 0.7175616 0.7268501 0.9897334 0.5659359 0.5467450
 [197] 0.3853242 0.4799703 0.4543558 0.3864065 0.3867739 0.4059116 0.4904520
 [204] 0.4990398 0.5829876 0.6681405 0.5245365 0.4816886 0.7247148 0.4489095
 [211] 0.4673745 0.5346889 0.5267316 0.5896845 0.7151791 0.4212330 0.6294356
 [218] 0.7690222 0.6634902 0.6094897 0.5036922 0.5318404 0.4286724 0.4636125
 [225] 0.3526284 0.4528986 0.3979473 0.6440758 0.4455897 0.4236689 0.5220958
 [232] 0.5161978 0.6882160 0.5583662 0.6369836 0.4804866 0.5249673 0.3036010
 [239] 0.3186854 0.3680794 0.3473360 0.3445205 0.3993789 0.6134708 0.5605042
 [246] 0.4203016 0.5262919 0.7055306 0.4151140 0.3950877 0.4945300 0.3403064
 [253] 0.6336636 0.5760601 0.5937295 0.7816578 0.5743999 0.5130912 0.3211351
 [260] 0.4333243 0.4532728 0.8836527 0.5542467 0.5189086 0.3853733 0.5863797
 [267] 0.6456744 0.5490664 0.6830404 0.5591154 0.6707948 0.7266904 0.8191645
 [274] 0.6519134 0.4418197 0.6754014 0.6498065 0.6172073 0.6850096 0.7220211
 [281] 0.4933638 0.3720437 0.5880862 0.4299151 0.5928170 0.6562778 0.8242513
 [288] 0.4299219 0.4942766 0.5546288 0.4497238 0.5144395 0.8645564 0.5358512
 [295] 0.6815332 0.4826089 0.5361426 0.6610768 0.4361343 0.7495521 0.8360831
 [302] 0.6435257 0.4058136 0.3989772 0.7589361 0.6532333 0.6084130 0.6920568
 [309] 0.4990992 0.7679044 0.6186806 0.7343711 0.7986024 0.5366567 0.4798369
 [316] 0.6301960 0.4961707 0.5888135 0.4025134 0.5232452 0.6341495 0.5325702
 [323] 0.5920682 0.4763510 0.7678790 0.3672965 0.8628438 0.5431980 0.5667032
 [330] 0.4591356 0.5990987 0.5662614 0.4698281 0.6696803 0.5874177 0.5867314
 [337] 0.5061874 0.4649711 0.6824143 0.4890839 0.5424590 0.3392641 0.3994007
 [344] 0.3851744 0.5515101 0.5159149 0.4030643 0.6579546 0.5439394 0.4285955
 [351] 0.5869903 0.4381423 0.8446627 0.3845370 0.5609957 0.4207567 0.6653356
 [358] 0.4841911 0.4964988 0.4404760 0.5421151 0.3389921 0.7477354 0.9128842
 [365] 0.6247109 0.4614823 0.6427378 0.4976938 0.7045662 0.4602396 0.4469913
 [372] 0.5559259 0.5829157 0.4815411 0.4331116 0.6426673 0.0000000 0.0000000
 [379] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [386] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [393] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [400] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [407] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [414] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [421] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [428] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [435] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [442] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [449] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [456] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [463] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [470] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [477] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [484] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [491] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [498] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [505] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [512] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [519] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [526] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [533] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [540] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [547] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [554] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [561] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [568] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [575] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [582] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [589] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [596] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [603] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [610] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [617] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [624] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [631] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [638] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [645] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [652] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [659] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [666] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [673] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [680] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [687] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [694] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [701] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [708] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [715] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [722] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [729] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [736] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [743] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [750] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [757] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [764] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [771] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [778] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [785] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [792] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [799] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [806] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [813] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [820] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [827] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [834] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [841] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [848] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [855] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [862] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [869] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [876] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [883] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [890] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [897] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [904] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [911] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [918] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [925] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [932] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [939] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [946] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [953] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [960] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [967] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [974] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [981] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [988] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
 [995] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
building BART with mem-cache speedup...
Iteration 100/1250
Iteration 200/1250
Iteration 300/1250
Iteration 400/1250
Iteration 500/1250
Iteration 600/1250
Iteration 700/1250
Iteration 800/1250
Iteration 900/1250
Iteration 1000/1250
Iteration 1100/1250
Iteration 1200/1250
done building BART in 0.217 sec 

burning and aggregating chains from all threads... done

image

Versions:

  • macOS 13.1, M1 processor
  • R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid" Platform: aarch64-apple-darwin20 (64-bit)
  • openjdk-19.0.1_macos-aarch64_bin
  • bartMachine 1.3.2 installed from CRAN

bartMachine in python?

Hello,
This is quite interesting package with some important and fast results. Is there an implementation in python or an efficient alternative in python?

Thank you in advance :)

Usage in Matlab

First of all I would like to thank you for your continued support for this software.

I would like to use this package in MATLAB for a regression problem. Since Matlab has a java runtime embedded I thought this would be easy to accomplish since the bartMachine is implemented in Java. However this did not go as planned. For testing I used a deterministic scalar function in two variables which should get approximated by the bartMachine. The prediction does not approximate the desired function at all, instead the predicted points on the domain seem to show a parabolic function. If you could give me some pointers how to fix this issue I would be highly grateful.

The following code was used as a test in Matlab R2022a:

% load bartMachine java dependencies
bartJARs = dir('path-to-bartMachine-folder/**/inst/java/*.jar');
for ii = 1:length(bartJARs)
javaaddpath(strcat(bartJARs(ii).folder, '\', bartJARs(ii).name));
end
rng(12341234);
[X, Y] = ndgrid(-10:0.5:10,-10:0.5:10);
f = @(x,y) x.^3 + x.*y + y.^3;
Z = f(X,Y);
mesh(X,Y,Z)
Feature = [X(:),Y(:)];
Label = Z(:);
Combined = [Feature,Label];
bart = bartMachine.bartMachineRegressionMultThread;
alist = java.util.ArrayList;
for ii = 1:size(Combined,1)
    alist.add(Combined(ii,:));
end
bart.setData(alist);
bart.Build();
Label_pred = zeros(size(Label));
for ii = 1:size(Label_pred,1)
    Label_pred(ii) = bart.Evaluate(Feature(ii,:));
end
figure;
scatter3(Feature(:,1), Feature(:,2), Label_pred, '.');

Survival outcome

Hi there,

Is there any way to analyze survival outcomes using the package currently?

Thanks!

Thinning?

Have you thought about implementing thinning of the iterations? I'm thinking this would be a parameter that lets you specify that only every n iterations (e.g. n=10) are saved after burn-in.

I've only played with your implementation a little bit, but in a toy example, 10k post-burn-in iterations got me an effective sample size of around 250. I'd be almost as well off (and my memory would be much better off) keeping every 10th iteration.

(P.S. Thanks for doing this. I'm really excited about bartMachine!)

R Fatal Error on Load bartMachineJARs

I've been using bartMachine for a while now, but I recently tried to install it in a new install of R and in the R GUI, when I try to load the bartMachine package or even the bartMachineJARs package, I get an immediate fatal error with no output. I'm using macOS 10.12.6. Here is my sessionInfo output:

Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0    yaml_2.1.19   

The last error in my log file is below, but I don't think it relates to this because the timing is off (it's currently 13:41 and the last log file entry is labeled today at 17:25):

09 Jul 2018 17:25:55 [rsession-david] ERROR Unexpected exception: boost: mutex lock failed in pthread_mutex_lock: Invalid argument; LOGGED FROM: void rstudio::session::ClientEventService::run() /Users/vagrant/workspace/IDE/macos/src/cpp/session/SessionClientEventService.cpp:351

Now, the really strange thing is that when I open a terminal and start R in the terminal, it will load bartMachine just fine.

MacBook-Pro:~ david$ R

R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Previously saved workspace restored]

> library(bartMachine
+ )
Loading required package: rJava
Loading required package: bartMachineJARs
Loading required package: car
Loading required package: carData
Loading required package: randomForest
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Loading required package: missForest
Loading required package: foreach
Loading required package: itertools
Loading required package: iterators
Welcome to bartMachine v1.2.3! You have 0.54GB memory available.

If you run out of memory, restart R, and use e.g.
'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
'library(bartMachine)'.

I tried installing the previous version of both bartMachine and bartMachineJARs, both to no avail. In the interest of completeness. Here's what happens when I run javareconf (which I had done previously as root).

MacBook-Pro:~ david$ R CMD javareconf
Java interpreter : /usr/bin/java
Java version     : 10.0.1
Java home path   : /Library/Java/JavaVirtualMachines/jdk-10.0.1.jdk/Contents/Home
Java compiler    : /usr/bin/javac
Java headers gen.: /usr/bin/javah
Java archive tool: /usr/bin/jar

trying to compile and link a JNI program 
detected JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/darwin
detected JNI linker flags : -L$(JAVA_HOME)/lib/server -ljvm
clang -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/Library/Java/JavaVirtualMachines/jdk-10.0.1.jdk/Contents/Home/include -I/Library/Java/JavaVirtualMachines/jdk-10.0.1.jdk/Contents/Home/include/darwin  -I/usr/local/include   -fPIC  -Wall -g -O2  -c conftest.c -o conftest.o
clang -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o conftest.so conftest.o -L/Library/Java/JavaVirtualMachines/jdk-10.0.1.jdk/Contents/Home/lib/server -ljvm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation


JAVA_HOME        : /Library/Java/JavaVirtualMachines/jdk-10.0.1.jdk/Contents/Home
Java library path: $(JAVA_HOME)/lib/server
JNI cpp flags    : -I$(JAVA_HOME)/include -I$(JAVA_HOME)/include/darwin
JNI linker flags : -L$(JAVA_HOME)/lib/server -ljvm
Updating Java configuration in /Library/Frameworks/R.framework/Resources
Done.

This isn't a deal breaker, but it would be nice to have it working in the GUI or RStudio.

Probabilities predicted for which class?

When I predict probabilities, I'm getting probabilities for the opposite class of what I'm expecting. Example:

library(bartMachine)
data(Sonar, package = "mlbench")
model = bartMachine(Sonar[-61], Sonar$Class)
classes = predict(model, new_data = Sonar[-61], type = "class")
probs = predict(model, new_data = Sonar[-61], type = "prob")
levels(Sonar$Class)

I'm getting something like

[1] R R R R R R R M R R R R ...
Levels: M R

for classes and for probs

[1] 0.61762749 0.63869063 0.51221708 ...

So the probabilities are for the "R" class, which is the second class in the level set. I would expect probabilities for the first class.

Changing the level set before giving the data to bartMachine doesn't seem to make a difference.

Memory issue happened when using bartMachine with foreach

Hi, I met a memory issue when I ran bartMachine in parallel by using the function ‘foreach’. Sample codes look like below:

option(java.parameters = '-Xmx5g')
library(bartMachine)
bart = bartMachine(X, Y)
result = foreach(i = 1:n, .combine = c, .packages = c(‘bartMachine’)) %dopar% {
newX = …
predict(bart, newX)
} 

The execution hangs after going inside the foreach loop with an error message saying ‘java.lang.OutOfMemoryError: Java heap space’, no matter how much memory I set at the beginning.

No error returned when I removed predict(bart, newX) from the foreach loop. No error returned when I changed the foreach loop to regular for loop. I feel like there is some conflict between bartMachine and foreach? Can you help me with this? Thank you!

Problem with plot_y_vs_yhat

Hello,

I was trying to reproduce the example in the package vignette but when I use the function plot_y_vs_yhat with the argument prediction_intervals = TRUE i receive the following error:

Error in credible_intervals[, 1] : incorrect number of dimensions 

I tried also with another set of data and the same happens.

Error code when using bartMachine

I have been attempting to create a PS model using bartMachine. My code is as follows:

new_ps_model <- bartMachine(X = data %>%
dplyr::select(colnames(bart_ps_model$X)),
y = data[,"drugclass"] %>%
unlist(),
numtrees = 50,
use_missing_data = TRUE,
num_burn_in = 10,
num_iterations_after_burn_in = 10,
serialize = FALSE)
When this function runs, I get the following error:

Error in if (xi > xj) 1L else -1L : missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In xtfrm.data.frame(x) : cannot xtfrm data frames
2: In Ops.factor(xi, xj) : '>' not meaningful for factors

Not sure exactly where to go with this - I went through my data and while there are some missing values in the columns in X, there are no missing values in y. Would appreciate any recommendations for next steps.

Negative probabilities in posterior for classification problem

I am running a dichotomous classification problem using bartMachine.

When I check the posterior (example below) using bart_machine_get_posterior I see some negative values.

image

According to the help, the units are probabilities (e.g., not probits).

"y_hat_posterior_samples The full set of posterior samples of size num_iterations_after_burn_in for each observation. For regression, the estimates have the same units as the response. For classification, the estimates are probabilities."

Thus, negative values should be impossible. Any thoughts regarding what's happening here? It's not an isolated case.

Problem with missing data

Running the code directly from the vignette, I get the following error when attempting to fit the model with missing covariates.

Error in .jcall(java_bart_machine, "V", "addTrainingDataRow", as.character(model_matrix_training_data[i,  : 
java.lang.NullPointerException

Specifically, I ran

library("bartMachine")
options(java.parameters="-Xmx1000m")
set_bart_machine_num_cores(4)
y <- automobile$log_price
X <- automobile; X$log_price <- NULL
bart_machine <- bartMachine(X=X, y=y, use_missing_data = TRUE,
                             use_missing_data_dummies_as_covars = TRUE)

Particularly confusing, because I ran this code on old versions of the package as well (and got the same error), so I'm unsure whether this is a problem with the package or a problem with my install. For reference, this issue also appears here.

v1.2.5.1 cannot read v1.2.5.0 object

I use bartMachine together with caret. As recommended, I always turn the option serialize to TRUE in train. However, I noticed that bartMachine v1.2.5.1 is not able to use model created by bartMachine v1.2.5.0. I receive the following error message:

Error in check_serialization(object) :
This bartMachine object was loaded from an R image but was not serialized.
Please build bartMachine using the option "serialize = TRUE" next time.

Edit
This is reproducible with the example from ?bartMachine

library(bartMachine)  ### Here version 1.2.5.0 is mandatory

## Generate Friedman data
set.seed(11)
n  = 200 
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
y = 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n)

## Build BART regression model
bart_machine = bartMachine(X, y, serialize = TRUE)
summary(bart_machine)

bartMachine v1.2.3 for regression

training data n = 200 and p = 5 
built in 0.8 secs on 1 core, 50 trees, 250 burn-in and 1000 post. samples

sigsq est for y beforehand: 7.281 
avg sigsq estimate after burn-in: 0.76294 

in-sample statistics:
 L1 = 104.28 
 L2 = 92.58 
 rmse = 0.68 
 Pseudo-Rsq = 0.9801
p-val for shapiro-wilk test of normality of residuals: 0.0258 
p-val for zero-mean noise: 0.41082 


## Save the bartMachine object
saveRDS(bart_machine, "~/bartMachine_version1.2.5.0.rda")

Now we upgrade bartMachine to version 1.2.5.1

bart_machine <- readRDS("~/bartMachine_version1.2.5.0.rda")
summary(bart_machine)

bartMachine v1.2.5.1 for regression

training data n = 200 and p = 5
built in 0.8 secs on 1 core, 50 trees, 250 burn-in and 1000 post. samples
Error in .jcall(bart_machine$java_bart_machine, "[D", "getGibbsSamplesSigsqs") :
RcallMethod: attempt to call a method of a NULL object.

y_hat = predict(bart_machine, X)

Error in check_serialization(object) :
This bartMachine object was loaded from an R image but was not serialized.
Please build bartMachine using the option "serialize = TRUE" next time.

structure in the response variable

Hi,

This is probably a very naive question but I have multiple datasets from different sites that might have different properties. My response variable and my explanatory var are similar in all datasets. I wonder how it is possible to take into account the sites information.
If I split by sites, I might lose some of them as they have less information than other.

Thanks!

Nico

bartMachine doesn't look in its local environment for X and y

Sorry - but I might not be using the terminology correctly. Maybe this is an "R" issue (I'm a user - but not a programming expert). However, this behavior differs from any other function I have used in R (base or packages).

The issue is that when I call bartMachine(X, y) from inside of a function, it pulls X and y from the Global environment - not the values of X and y that I passed it.

Here is a reproducible example:

Basically, the first time through I make sure x and y are not present in the global environment. I then check that x and y are present inside my function (i.e., I didn't screw up passing them). But, when bartMachine(x, y) is called, it says it can't find x.

Then I define x and y in the global environment, make the same call, and voila.

library(tidyverse)
library(bartMachine)

data(mtcars)
vlist <- c("cyl", "disp")

testmod <- function(d, t, v){
x <- d[v]
y <- d[[t]]

cat("Is x here? Yes - here is x[1,1] ", x[1,1])
cat("\n")
cat("Is y here? Yes, here is y[1] ", y[1])
cat("\n")

mod <- bartMachine(X = x, y = y)

return(mod)
}

rm(x, y)
#> Warning in rm(x, y): object 'x' not found
#> Warning in rm(x, y): object 'y' not found

modwt <- testmod(mtcars, "wt", vlist)
#> Is x here? Yes - here is x[1,1] 6
#> Is y here? Yes, here is y[1] 2.62
#> bartMachine initializing with 50 trees...
#> Error in (function (X = NULL, y = NULL, Xy = NULL, num_trees = 50, num_burn_in = 250, : object 'x' not found

x <- mtcars[c("cyl", "disp")]
y <- mtcars[["wt"]]

modwt <- testmod(mtcars, "wt", vlist)
#> Is x here? Yes - here is x[1,1] 6
#> Is y here? Yes, here is y[1] 2.62
#> bartMachine initializing with 50 trees...
#> bartMachine vars checked...
#> bartMachine java init...
#> bartMachine factors created...
#> bartMachine before preprocess...
#> bartMachine after preprocess... 2 total features...
#> bartMachine sigsq estimated...
#> bartMachine training data finalized...
#> Now building bartMachine for regression...
#> evaluating in sample data...done
Created on 2023-02-27 with reprex v2.0.2

Predictions on a large dataset are very slow

Hello,

First of all, many compliments for bartMachine, a really nice implementation of a wonderful algorithm.

I am using BART for potential outcomes causal inference, which is based on comparing Yhat at the observation level, predicted after assigning specific values to variable X for all observations while keeping the other covariates Z fixed at their original value. (https://nyuscholars.nyu.edu/en/publications/bayesian-nonparametric-modeling-for-causal-inference)

The problem is that my dataset is very large [27358 x 224] and predictions made with bart_machine_get_posterior simply take forever. Since I'll need to do this for 224 variables with multiple evaluated values each the analysis would take days.

Reading around the issues I saw that the Array version of BART would fix memory issues, but would it also fix speed ones? Or is there any setting in bartMachine which would make predictions faster? Considering that my problem is more prediction time than estimation time (for my model it takes ~ 30 mins), is there a way to balance speed toward the first?

The only alternative solution I could think of is to build the model on 2/3 of the dataset and estimate the variable effects on the other third.

Here are the arguments I use for the model:

bartMachine(X = X, y = Y,
verbose = T,
num_trees = 200,
num_iterations_after_burn_in = 5000,
run_in_sample = F,
mem_cache_for_speed = F, # Otherwise it crashes
use_missing_data = T, serialize = save)

And this is the code I use to estimate the Individual Treatment Effect (maybe some speedup is possible also here):

compute_BART_ITE <- function(bart.mod, data = NULL, vars = NULL, quants = c(.1, .3, .5, .7, .9)) {

	if (is.null(vars)) vars <- bart.mod$X %>% colnames()

	data <- if (is.null(data)) bart.mod$X else data %>% select(any_of(vars))

	lapply(vars, function(V) {
		print(glue("{which(V %in% vars)}/{length(vars)}: {V}"))

		if (n_distinct(data[[V]]) > 5 & is.numeric(data[[V]])) {
			pred.val <- quantile(data[[V]], quants, na.rm = T) %>% sort %>% signif(3)
		} else pred.val <- unique(data[[V]]) %>% sort

		data[[V]] <- pred.val[1]
		tictoc::tic('Computed reference matrix')

		ref.matrix <- bart_machine_get_posterior(bart.mod, new_data = data)$y_hat_posterior_samples

		tictoc::toc()

		pblapply(pred.val[-1], function(val) {
			data[[V]] <- val

			log(bart_machine_get_posterior(bart.mod, new_data = data)$y_hat_posterior_samples) - log(ref.matrix)
		}) %>% magrittr::set_names(paste(pred.val[-1], 'vs', pred.val[1]))

	}) %>% magrittr::set_names(vars)
}

bartMachineCV is too verbose even with verbose = FALSE: it drops the verbose argument and includes cat's

bartMachineCV is very verbose, even with verbose = FALSE. I know some messages come from Java directly, but there are others that come from R and that, arguably, should not be produced with verbose = FALSE. Two main issues:

  1. At the end of build_bart_machine_cv , here:
    bart_machine_cv = build_bart_machine(X, y,
    , the call to bart_machine_cv does not pass the verbose argument (passing the ... does not do here), and thus bart_machine_cv is run with its default verbose = TRUE.
    (How to reproduce: launch bartMachineCV and set a break point right before that call.

These are some screenshots of such a debugging session. Note the value of verbose is FALSE:

before

We run and this happens (note the verbose output)
after_1

Now, if we call bart_machine_cv explicitly passing verbose = verbose it honors the argument:

before2

after_2

  1. A second problem is that build_bart_machine_cv has many cats that are not surrounded by the if (verbose) construction that is present in, say, build_bart_machine itself . (I actually wonder if using cat, instead of message is best practice; but this is a different issue).

Suppressing output during model fitting

Is it possible to suppress output messages completely when running bartMachine? I tried verbose = F and even redirecting the output to a temporary file to no avail.

set.seed(11)
n  = 200 
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
y = 10 * sin(pi* X[ ,1] * X[,2]) +20 * (X[,3] -.5)^2 + 10 * X[ ,4] + 5 * X[,5] + rnorm(n)

sink(file = tempfile())
bartMachine(X, y, verbose = F)
sink()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.