Giter Club home page Giter Club logo

randomplantedforest's Introduction

randomPlantedForest

R-CMD-check Codecov test coverage randomPlantedForest status badge

randomPlantedForest implements “Random Planted Forest”, a directly interpretable tree ensemble (arxiv).

Installation

You can install the development version of randomPlantedForest from GitHub with

# install.packages("remotes")
remotes::install_github("PlantedML/randomPlantedForest")

or from r-universe with

install.packages("randomPlantedForest", repos = "https://plantedml.r-universe.dev")

Example

Model fitting uses a familiar interface:

library(randomPlantedForest)

mtcars$cyl <- factor(mtcars$cyl)
rpfit <- rpf(mpg ~ cyl + wt + hp, data = mtcars, ntrees = 25, max_interaction = 2)
rpfit
#> -- Regression Random Planted Forest --
#> 
#> Formula: mpg ~ cyl + wt + hp 
#> Fit using 3 predictors and 2-degree interactions.
#> Forest is _not_ purified!
#> 
#> Called with parameters:
#> 
#>             loss: L2
#>           ntrees: 25
#>  max_interaction: 2
#>           splits: 30
#>        split_try: 10
#>            t_try: 0.4
#>            delta: 0
#>          epsilon: 0.1
#>    deterministic: FALSE
#>         nthreads: 1
#>           purify: FALSE
#>               cv: FALSE

predict(rpfit, new_data = mtcars) |>
  cbind(mpg = mtcars$mpg) |>
  head()
#>      .pred  mpg
#> 1 20.81459 21.0
#> 2 20.72354 21.0
#> 3 26.04526 22.8
#> 4 21.26845 21.4
#> 5 18.45921 18.7
#> 6 19.54406 18.1

Prediction components can be accessed via predict_components, including the intercept, main effects, and interactions up to a specified degree. The returned object also contains the original data as x, which is required for visualization. The glex package can be used as well: glex(rpfit) yields the same result.

components <- predict_components(rpfit, new_data = mtcars) 

str(components)
#> List of 3
#>  $ m        :Classes 'data.table' and 'data.frame':  32 obs. of  6 variables:
#>   ..$ cyl   : num [1:32] 0.445 0.445 0.863 0.445 -1.274 ...
#>   ..$ wt    : num [1:32] -0.0615 -0.1421 2.3182 -0.0155 -0.3116 ...
#>   ..$ hp    : num [1:32] 0.162 0.162 2.021 0.162 -0.941 ...
#>   ..$ cyl:wt: num [1:32] 0.00389 0.00389 0.69586 0.17156 0.4615 ...
#>   ..$ cyl:hp: num [1:32] 0.1453 0.1453 -0.0511 0.1453 0.1179 ...
#>   ..$ hp:wt : num [1:32] -0.1264 -0.1367 -0.0487 0.1138 0.1596 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  $ intercept: num 20.2
#>  $ x        :Classes 'data.table' and 'data.frame':  32 obs. of  3 variables:
#>   ..$ cyl: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
#>   ..$ wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ...
#>   ..$ hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ...
#>   ..- attr(*, ".internal.selfref")=<externalptr> 
#>  - attr(*, "class")= chr [1:3] "glex" "rpf_components" "list"

Various visualization options are available via glex, e.g. for main and second-order interaction effects:

# install glex if not available:
if (!requireNamespace("glex")) remotes::install_github("PlantedML/glex")
#> Loading required namespace: glex
library(glex)
library(ggplot2)
library(patchwork) # For plot arrangement

p1 <- autoplot(components, "wt")
p2 <- autoplot(components, "hp")
p3 <- autoplot(components, "cyl")
p4 <- autoplot(components, c("wt", "hp"))

(p1 + p2) / (p3 + p4) +
  plot_annotation(
    title = "Selected effects for mtcars",
    caption = "(It's a tiny dataset but it has to fit in a README, okay?)"
  )

See the Bikesharing decomposition article for more examples.

randomplantedforest's People

Contributors

feedelamort avatar jemus42 avatar josephtmeyer avatar jyliuu avatar mhiabu avatar mnwright avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

randomplantedforest's Issues

Informative print method

Currently printing an rpf object only prints the list components, which is probably not what a user is looking for. Something a little more informative with relevant information about the forest and related stats would be nice, like for example in osrf.

Implementing the method itself is simple enough, but I'm not sure about getting the associated stats, like OOB error.
I should also probably start storing the call in the object as other fitters do, and either try to get parameters out of $fit$get_parameters() or also store them explicitly for easier retrieval 🤔

Given the nature of the method, it would probably be useful to show info like the max_interaction param prominently since it's relevant to model complexity.

Handling of missing values (NA)

Two possible options:

  1. "We don't do NA, sorry": (Current behavior)

Missings in input data would cause an error or would be dropped (non-silently, to be safe(r)) via na.omit or similar.
Could use an na_rm argument in rpf() and predict.rpf() for that purpose.

  1. Handle NAs on the C++ level in whatever tree-ish way is suitable.

See also the Rcpp for everyone chapter on missings.

This is not a pressing issue for now since the implementation can be built and benchmarked under the assumption of complete data, but once we start considering a CRAN release we should at least have an opinion on the matter, I guess.

Fix logit loss

std::vector<std::vector<double>> W = *split.W, W_new = *split.W;
for(int p=0; p<value_size; ++p){
for(auto individual: split.I_s){
W[individual][p] = exp(W[individual][p]);
W_new[individual][p] = exp(W_new[individual][p] + log(M_s[p] / M_sp) - W_s_mean[p]);

If W and W_new are the same, why then treat them separately below?

split.min_sum += (*split.Y)[individual][p] * log(W[individual][p] / (1 + std::accumulate(W[individual].begin(), W[individual].end(), 0.0)) ); // ~ R_old
split.min_sum -= (*split.Y)[individual][p] * log(W_new[individual][p] / (1 + std::accumulate(W_new[individual].begin(), W_new[individual].end(), 0.0)) ); // ~ R_new

W and W_new are the same, so nothing happens?
And won't this give log(0) in some cases?

Split up randomPlantedForest.cpp file into multiple files

Currently the resulting binary is too large and R CMD check does complain, and CRAN will definitely have questions about it and will either need a strong justification or reject until reduced.

IIRC the main issue is that Rcpp module introduce some complexity in that regard, but maybe it could still be possible to at least separate regression- and classification RPF.
If not, we'll have to do some pruning to remove all unneded / discarded code from the file to get the binary size down.

Add purify method R wrapper

C++ module exports purify and new_purify, the latter is WIP.

Wrapping purify should be doable, afterwards plot method can be built on top.

Plot tree families similar to illustrations in paper

Would be nice both from an illustrative point of view to explain the method, and to get an idea of the data partitioning in low dimensional cases where there's not too much going on.

I think the $forest object should contain all the info needed for that but I don't know how to put it in a useful structure yet, and how to visualize it then.
Ideally something that could support interactivity, but a png would suffice I guess.

Use R RNG

R CMD check (see devtools::check()) warns about non-portable code and output:

> checking compiled code ... NOTE
  File ‘randomPlantedForest/libs/randomPlantedForest.so’:
    Found ‘_ZSt4cout’, possibly from ‘std::cout’ (C++)
      Object: ‘randomPlantedForest.o’
    Found ‘rand’, possibly from ‘rand’ (C)
      Object: ‘randomPlantedForest.o’
  
  Compiled code should not call entry points which might terminate R nor
  write to stdout/stderr instead of to the console, nor use Fortran I/O
  nor system RNGs.
  
  See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual.

For messages, maybe Rcout + Rcerr is the way to go?

Regarding the use of rand, I'm not sure what the best practice is. The manual references in the warning is found here and the section about random number generaton is found here - I think the main point is that the RNG state should be seeded from R, so that reproducibility can be ensured on the R-level.
I'm sure Marvin can clarify / provide examples from ranger.

Multiclass classification

  • Implementation in C++
  • Ensure fitting works (expand tests)
  • Exponential loss: input must be 1/-1 rather than 1/0
  • For predict():
    • Decide prediction behavior for type in numeric/link
    • Adjust predict_rpf_prob(), likely only switch for length(outcome_levels) needed?
    • Current behavior for class is expected to work unmodified for multiclass
  • Adjust documentation

Plot method

I assume we'd need to tidy up the output from $fit$get_model() in some way to be able to visualize it informatively.
Also depending on the status of purify (need more info on that when the time comes).

So far I shortly played around with the output to see what I can make of it, and while I at least recognize values I can't "think them" into a tree structure yet.
We might also be in for some fiddly bits regarding factor variables and associated levels, but we should have all the relevant information stored in $blueprint$ptypes$predictors for the original order and $factor_levels, which.. also contains the original levels? Okay we might have to check again how we do that.

library(randomPlantedForest)
train <-  tibble::tibble(
  x1 = rnorm(100, 30, 2),
  x2 = rnorm(100, -50, 2),
  x3 = factor(rbinom(100, 2, 1/2), labels = LETTERS[1:3]),
  y = 1.5 * x1 + -3 * x2 + 2 * as.numeric(x3) + rnorm(100)
)

set.seed(23)
rpfit <- rpf(y ~ ., data = train, max_interaction = 1, ntrees = 2, splits = 5)
mod <- rpfit$fit$get_model()

tree1 <- mod[[1]]
tree1$variables[[2]]
#> [1] 1
tree1$values[[2]]
#> [[1]]
#> [1] -8.203462
#> 
#> [[2]]
#> [1] 2.854149
#> 
#> [[3]]
#> [1] -0.6127117
tree1$intervals[[2]]
#> [[1]]
#>          [,1]      [,2] [,3]
#> [1,] 24.25602 -54.52766    1
#> [2,] 27.85953 -46.34593    3
#> 
#> [[2]]
#>          [,1]      [,2] [,3]
#> [1,] 31.66638 -54.52766    1
#> [2,] 36.46406 -46.34593    3
#> 
#> [[3]]
#>          [,1]      [,2] [,3]
#> [1,] 27.85953 -54.52766    1
#> [2,] 31.66638 -46.34593    3

Created on 2022-08-31 with reprex v2.0.2

Cross-platform reproducibility

I assumed the following result would be static across platforms due to the fixed seed:

library(randomPlantedForest)
train <-  mtcars[1:20, ]
test <-  mtcars[21:32, ]

set.seed(23)
rpfit <- rpf(mpg ~., data = train, max_interaction = 3)
pred <- predict(rpfit, test)

mse <- rpfit$fit$MSE(as.matrix(pred$.pred), as.matrix(test$mpg))
mse
#> [1] 17.19055

Created on 2022-08-30 with reprex v2.0.2

...but according to GitHub actions on all non-macOS platforms, that is not the case and now I'm wondering if it's macOS being weird or something with the RNG seeding not going the way we expect, or if there's a random component in the process that's overlooked.

Probability prediction for multiclass with L1/L2 loss

Since L1 and L2 losses are not guaranteed to return valid probability predictions, i.e. predictions can exceed the [0, 1] interval, we have to figure out how to handle that in the multiclass setting.

In the binary case, our previous approach of truncating class probabilities to [0, 1] was sufficient, since we only make predictions p for the positive class and define the prediction for the negative class as 1 - p, ensuring a sum of 1.

In the multiclass setting, this truncation still applies, so individual class predictions for k classes don't fall out of the plausibility range, but the sum is not.

What do we do?

Instead of truncating, we could divide predicted class probabilities by the sum of the predictions, normalizing to a maximum of 1.
However, this would skew the individual class predictions in one way or another, and if we implement something like this on the R side then the C++ implementation and the R wrapper would yield different results - so I'd argue this needs to be addressed at a low level.

Example cases for testing/reference:

library(randomPlantedForest)
xdf <- na.omit(palmerpenguins::penguins)
rpfit <- randomPlantedForest::rpf(species ~ ., data = xdf, loss = "L1")
pred <- predict(rpfit, xdf, type = "prob")
pred$sum <- rowSums(pred)

# Implausible cases are returned as-is:
pred[pred$sum > 1, ]
#> # A tibble: 205 × 4
#>    .pred_Adelie .pred_Chinstrap .pred_Gentoo   sum
#>           <dbl>           <dbl>        <dbl> <dbl>
#>  1        1            0.00970        0       1.01
#>  2        1            0.00321        0       1.00
#>  3        0.954        0.0715         0       1.03
#>  4        0.996        0.0283         0       1.02
#>  5        0.995        0.0419         0       1.04
#>  6        1            0.00931        0       1.01
#>  7        0.964        0.0455         0       1.01
#>  8        0.976        0.0604         0       1.04
#>  9        1            0.000516       0       1.00
#> 10        0.574        0.396          0.0502  1.02
#> # … with 195 more rows

And with mlr3, we don't get any predictions at all (I'd argue predict.rpf() probably should also at least warn):

library(mlr3)
library(mlr3extralearners) # for rpf learner
# Penguins, dropping missing obs
test_task <- tsk("penguins")
ids <- complete.cases(test_task$data())
test_task$filter(which(ids))

lrn_rpf <- lrn("classif.rpf", loss = "L1", predict_type = "prob")
lrn_rpf$train(test_task)

# mlr3 correctly refuses to accept sum of class probabilities exceed 1:
pred <- lrn_rpf$predict(test_task)
#> Error: Probabilities for observation 1 do sum up to 1.012296 != 1

# So we don't get any predictions at all:
pred
#> Error in eval(expr, envir, enclos): object 'pred' not found

Created on 2022-11-08 with reprex v2.0.2

More flexible parallelisation

Currently parallelisation is a boolean, defaulting to k - 1 threads for k available threads in hardware I think?):

if(parallelize){
int n_threads = std::thread::hardware_concurrency() - 1;

Ideally we would switch to an integer argument nthreads or something along those lines, which could default (on the R side) to options("Ncpus"), and 1 for the mlr3 learner.

Classif loss functions: Check which exponential is correct

Check back with @JosephTMeyer about exponential loss.

Affects lines

Other cases where loss is checked expect it to be in the set c("logit", "logit_2", "exponential", "exponential_2"), so there's no immediate need to remove unneeded loss functions there.

Build errors on macOS

Unfortunately I don't understand most of what's going on here.
I thought maybe I could at least reproduce these errors on Linux by switching to clang++, but on Ubuntu 20.04 with clang++ 10.0 it works just as well as with g++.
On macOS I have clang++ 13.0, with g++ being basically an alias for clang++ from what I understand.

I'm also running an arm64 installation of R for whatever that's worth - point being I can't tell how to fix these issues or how to help debug/narrow it down :/

★ devtools::install()
✓  checking for file ‘/Users/Lukas/repos/github/PlantedML/randomPlantedForest/DESCRIPTION’ ...
─  preparing ‘randomPlantedForest’:
✓  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
   Removed empty directory ‘randomPlantedForest/tests/testthat’
   Removed empty directory ‘randomPlantedForest/tests’
─  building ‘randomPlantedForest_0.0.0.9000.tar.gz’
   
Running /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL \
  /var/folders/n1/p1hxy7856nndrd0njv0lxzgw0000gn/T//RtmpvJDgra/randomPlantedForest_0.0.0.9000.tar.gz --install-tests 
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library’
* installing *source* package ‘randomPlantedForest’ ...
** using staged installation
** libs
ccache clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I'/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/RcppParallel/include' -I/opt/homebrew/include -Xclang -fopenmp   -fPIC  -falign-functions=64 -Wall -g -O2  -c RcppExports.cpp -o RcppExports.o
ccache clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG  -I'/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/RcppParallel/include' -I/opt/homebrew/include -Xclang -fopenmp   -fPIC  -falign-functions=64 -Wall -g -O2  -c randomPlantedForest.cpp -o randomPlantedForest.o
randomPlantedForest.cpp:1529:24: warning: loop variable 'interval' creates a copy from type 'const std::pair<double, double>' [-Wrange-loop-construct]
        for(const auto interval: leaf.intervals){
                       ^
randomPlantedForest.cpp:1529:13: note: use reference type 'const std::pair<double, double> &' to prevent copying
        for(const auto interval: leaf.intervals){
            ^~~~~~~~~~~~~~~~~~~~
                       &
randomPlantedForest.cpp:1526:22: warning: loop variable 'leaf' creates a copy from type 'const Leaf' [-Wrange-loop-construct]
      for(const auto leaf: tree.second->leaves){
                     ^
randomPlantedForest.cpp:1526:11: note: use reference type 'const Leaf &' to prevent copying
      for(const auto leaf: tree.second->leaves){
          ^~~~~~~~~~~~~~~~
                     &
randomPlantedForest.cpp:1522:20: warning: loop variable 'tree' creates a copy from type 'const std::pair<const std::set<int>, std::shared_ptr<DecisionTree>>' [-Wrange-loop-construct]
    for(const auto tree: family){
                   ^
randomPlantedForest.cpp:1522:9: note: use reference type 'const std::pair<const std::set<int>, std::shared_ptr<DecisionTree>> &' to prevent copying
    for(const auto tree: family){
        ^~~~~~~~~~~~~~~~
                   &
randomPlantedForest.cpp:1520:18: warning: loop variable 'family' creates a copy from type 'const std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>' [-Wrange-loop-construct]
  for(const auto family: tree_families){
                 ^
randomPlantedForest.cpp:1520:7: note: use reference type 'const std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp> &' to prevent copying
  for(const auto family: tree_families){
      ^~~~~~~~~~~~~~~~~~
                 &
In file included from randomPlantedForest.cpp:7:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:518:17: error: no matching function for call to object of type 'const setComp'
        {return static_cast<const _Compare&>(*this)(__x.__get_value().first, __y);}
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:2537:14: note: in instantiation of member function 'std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>::operator()' requested here
        if (!value_comp()(__root->__value_, __v))
             ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:2466:20: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__lower_bound<std::set<int>>' requested here
    iterator __p = __lower_bound(__v, __root(), __end_node());
                   ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1380:68: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::find<std::set<int>>' requested here
    iterator find(const key_type& __k)             {return __tree_.find(__k);}
                                                                   ^
randomPlantedForest.cpp:314:20: note: in instantiation of member function 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::find' requested here
    if(tree_family.find(split_dims) != tree_family.end()) return tree_family[split_dims];
                   ^
randomPlantedForest.cpp:102:8: note: candidate function not viable: 'this' argument has type 'const setComp', but method is not marked const
  bool operator()(const std::set<int> &a, const std::set<int> &b){
       ^
In file included from randomPlantedForest.cpp:7:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:521:17: error: no matching function for call to object of type 'const setComp'
        {return static_cast<const _Compare&>(*this)(__x, __y.__get_value().first);}
                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:2467:26: note: in instantiation of member function 'std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>::operator()' requested here
    if (__p != end() && !value_comp()(__v, *__p))
                         ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1380:68: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::find<std::set<int>>' requested here
    iterator find(const key_type& __k)             {return __tree_.find(__k);}
                                                                   ^
randomPlantedForest.cpp:314:20: note: in instantiation of member function 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::find' requested here
    if(tree_family.find(split_dims) != tree_family.end()) return tree_family[split_dims];
                   ^
randomPlantedForest.cpp:102:8: note: candidate function not viable: 'this' argument has type 'const setComp', but method is not marked const
  bool operator()(const std::set<int> &a, const std::set<int> &b){
       ^
In file included from randomPlantedForest.cpp:6:
In file included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/set:429:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:1975:17: error: no matching function for call to object of type 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::value_compare' (aka 'std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>')
            if (value_comp()(__v, __nd->__value_))
                ^~~~~~~~~~~~
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:2091:36: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__find_equal<std::set<int>>' requested here
    __node_base_pointer& __child = __find_equal(__parent, __k);
                                   ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1521:20: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__emplace_unique_key_args<std::set<int>, const std::piecewise_construct_t &, std::tuple<const std::set<int> &>, std::tuple<>>' requested here
    return __tree_.__emplace_unique_key_args(__k,
                   ^
randomPlantedForest.cpp:314:77: note: in instantiation of member function 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::operator[]' requested here
    if(tree_family.find(split_dims) != tree_family.end()) return tree_family[split_dims];
                                                                            ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:514:10: note: candidate function not viable: no known conversion from 'const std::set<int>' to 'const std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>' for 1st argument
    bool operator()(const _CP& __x, const _CP& __y) const
         ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1091:30: warning: the specified comparator type does not provide a viable const call operator [-Wuser-defined-warnings]
        static_assert(sizeof(__diagnose_non_const_comparator<_Key, _Compare>()), "");
                             ^
randomPlantedForest.cpp:396:16: note: in instantiation of member function 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::~map' requested here
    TreeFamily m;
               ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:970:5: note: from 'diagnose_if' attribute on '__diagnose_non_const_comparator<std::set<int>, setComp>':
    _LIBCPP_DIAGNOSE_WARNING(!__invokable<_Compare const&, _Tp const&, _Tp const&>::value,
    ^                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__config:1385:21: note: expanded from macro '_LIBCPP_DIAGNOSE_WARNING'
     __attribute__((diagnose_if(__VA_ARGS__, "warning")))
                    ^           ~~~~~~~~~~~
In file included from randomPlantedForest.cpp:6:
In file included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/set:429:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:2021:28: error: no matching function for call to object of type 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::value_compare' (aka 'std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>')
    if (__hint == end() || value_comp()(__v, *__hint))  // check before
                           ^~~~~~~~~~~~
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:2112:36: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__find_equal<std::set<int>>' requested here
    __node_base_pointer& __child = __find_equal(__p, __parent, __dummy, __k);
                                   ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:1255:16: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__emplace_hint_unique_key_args<std::set<int>, const std::pair<const std::set<int>, std::shared_ptr<DecisionTree>> &>' requested here
        return __emplace_hint_unique_key_args(__p, _NodeTypes::__get_key(__v), __v).first;
               ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1180:29: note: in instantiation of member function 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__insert_unique' requested here
            {return __tree_.__insert_unique(__p.__i_, __v);}
                            ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1201:17: note: in instantiation of member function 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::insert' requested here
                insert(__e.__i_, *__f);
                ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1009:13: note: in instantiation of function template specialization 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::insert<std::__map_const_iterator<std::__tree_const_iterator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__tree_node<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, void *> *, long>>>' requested here
            insert(__m.begin(), __m.end());
            ^
randomPlantedForest.cpp:403:13: note: in instantiation of member function 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::map' requested here
    getKeys(m);
            ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:514:10: note: candidate function not viable: no known conversion from 'const std::set<int>' to 'const std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>' for 1st argument
    bool operator()(const _CP& __x, const _CP& __y) const
         ^
In file included from randomPlantedForest.cpp:6:
In file included from /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/set:429:
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:1896:17: error: no matching function for call to object of type 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::value_compare' (aka 'std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>')
            if (value_comp()(__v, __nd->__value_))
                ^~~~~~~~~~~~
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:2226:36: note: in instantiation of member function 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__find_leaf_high' requested here
    __node_base_pointer& __child = __find_leaf_high(__parent, _NodeTypes::__get_key(__nd->__value_));
                                   ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:1661:13: note: in instantiation of member function 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__node_insert_multi' requested here
            __node_insert_multi(__cache.__get());
            ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__tree:1617:9: note: in instantiation of function template specialization 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::__assign_multi<std::__tree_const_iterator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__tree_node<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, void *> *, long>>' requested here
        __assign_multi(__t.begin(), __t.end());
        ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:1016:21: note: in instantiation of member function 'std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::operator=' requested here
            __tree_ = __m.__tree_;
                    ^
randomPlantedForest.cpp:912:20: note: in instantiation of member function 'std::map<std::set<int>, std::shared_ptr<DecisionTree>, setComp>::operator=' requested here
  tree_families[n] = curr_family;
                   ^
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/map:514:10: note: candidate function not viable: no known conversion from 'const std::__tree<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, std::__map_value_compare<std::set<int>, std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>, setComp, true>, std::allocator<std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>>>::key_type' (aka 'const std::set<int>') to 'const std::__value_type<std::set<int>, std::shared_ptr<DecisionTree>>' for 1st argument
    bool operator()(const _CP& __x, const _CP& __y) const
         ^
In file included from randomPlantedForest.cpp:13:
In file included from /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include/Rcpp.h:46:
/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include/Rcpp/XPtr.h:31:5: warning: delete called on non-final 'RandomPlantedForest' that has virtual functions but non-virtual destructor [-Wdelete-non-abstract-non-virtual-dtor]
    delete obj;
    ^
/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include/Rcpp/XPtr.h:54:26: note: in instantiation of function template specialization 'Rcpp::standard_delete_finalizer<RandomPlantedForest>' requested here
    void Finalizer(T*) = standard_delete_finalizer<T>,
                         ^
/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include/Rcpp/XPtr.h:31:5: warning: delete called on non-final 'ClassificationRPF' that has virtual functions but non-virtual destructor [-Wdelete-non-abstract-non-virtual-dtor]
    delete obj;
    ^
/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include/Rcpp/XPtr.h:54:26: note: in instantiation of function template specialization 'Rcpp::standard_delete_finalizer<ClassificationRPF>' requested here
    void Finalizer(T*) = standard_delete_finalizer<T>,
                         ^
7 warnings and 5 errors generated.
make: *** [randomPlantedForest.o] Error 1
ERROR: compilation failed for package ‘randomPlantedForest’
* removing ‘/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/randomPlantedForest’

Merge Bug-Fixes branch

See #27

  • Fixes issues with wonky results
  • Diff is somewhat broken, need to figure out how ti fix at some point
  • Everything else (speedup tweaks) should build on that branch

Forest serialization and deserialization

It needs to be possible to safely store an rpf for practical applications where refitting ad-hoc is not feasible, i.e. the following needs to run without error:

library(randomPlantedForest)

# Normal use
rpfit <- rpf(mpg ~ wt + cyl, data = mtcars)
rpfit
#> -- Regression Random Planted Forest --
#> 
#> Formula: mpg ~ wt + cyl 
#> Fit using 2 predictors and main effects only.
#> Forest is _not_ purified!
#> 
#> Called with parameters:
#> 
#>             loss: L2
#>           ntrees: 50
#>  max_interaction: 1
#>           splits: 30
#>        split_try: 10
#>            t_try: 0.4
#>            delta: 0
#>          epsilon: 0.1
#>    deterministic: FALSE
#>         nthreads: 1
#>           purify: FALSE
#>               cv: FALSE
predict(rpfit, mtcars)
#> # A tibble: 32 × 1
#>    .pred
#>    <dbl>
#>  1  20.4
#>  2  20.3
#>  3  26.1
#>  4  20.8
#>  5  16.5
#>  6  18.6
#>  7  15.1
#>  8  23.8
#>  9  23.1
#> 10  19.0
#> # ℹ 22 more rows

# Serialize and cleanup
temp_loc <- tempfile()
saveRDS(rpfit, file = temp_loc)
rm(rpfit)

# Attempt to restore: Not working
rpfit <- readRDS(temp_loc)
rpfit
#> -- Regression Random Planted Forest --
#> 
#> Formula: mpg ~ wt + cyl 
#> Fit using 2 predictors and main effects only.
#> Error in .External(structure(list(name = "CppMethod__invoke_notvoid", : NULL value passed as symbol address
predict(rpfit, mtcars)
#> Error in .External(structure(list(name = "CppMethod__invoke_notvoid", : NULL value passed as symbol address

Created on 2023-11-21 with reprex v2.0.2

As of now I don't know how the internals work for that, or if it's even possible to hook into saveRDS for that or if we need to write a separate serialization mechanism.

It's not a high priority now, but in the long run this needs to be possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.