Giter Club home page Giter Club logo

rforestry's Introduction

R

Rforestry: Random Forests, Linear Trees, and Gradient Boosting for Inference and Interpretability

Sören Künzel, Theo Saarinen, Simon Walter, Edward Liu, Sam Antonyan, Allen Tang, Jasjeet Sekhon

Introduction

Rforestry is a fast implementation of Honest Random Forests, Gradient Boosting, and Linear Random Forests, with an emphasis on inference and interpretability.

How to install

  1. The GFortran compiler has to be up to date. GFortran Binaries can be found here.
  2. The devtools package has to be installed. You can install it using, install.packages("devtools").
  3. The package contains compiled code, and you must have a development environment to install the development version. You can use devtools::has_devel() to check whether you do. If no development environment exists, Windows users download and install Rtools and macOS users download and install Xcode.
  4. The latest development version can then be installed using devtools::install_github("forestry-labs/Rforestry"). For Windows users, you'll need to skip 64-bit compilation devtools::install_github("forestry-labs/Rforestry", INSTALL_opts = c('--no-multiarch')) due to an outstanding gcc issue.

Documentation

For the Python package, see the documentation here and install from PyPI here. For the R package, see the documentation here and install from CRAN here. For the source code for both packages, see the Github here

Usage

library(Rforestry)

set.seed(292315)
test_idx <- sample(nrow(iris), 3)
x_train <- iris[-test_idx, -1]
y_train <- iris[-test_idx, 1]
x_test <- iris[test_idx, -1]

rf <- forestry(x = x_train, y = y_train, nthread = 2)

predict(rf, x_test)

Monotonic Constraints

The parameter monotonicConstraints strictly enforces monotonicity of partition averages when evaluating potential splits on the indicated features. This parameter can be used to specify both monotone increasing and monotone decreasing constraints.

library(Rforestry)

set.seed(49)
x <- rnorm(150)+5
y <- .15*x + .5*sin(3*x)
data_train <- data.frame(x1 = x, x2 = rnorm(150)+5, y = y + rnorm(150, sd = .4))

monotone_rf <- forestry(x = data_train[,-3],
                        y = data_train$y,
                        monotonicConstraints = c(1,1),
                        nodesizeStrictSpl = 5,
                        nthread = 1,
                        ntree = 25)
                        
predict(monotone_rf, newdata = data_train[,-3])

OOB Predictions

We can return the predictions for the training data set using only the trees in which each observation was out-of-bag (OOB). Note that when there are few trees, or a high proportion of the observations sampled, there may be some observations which are not out-of-bag for any trees. The predictions for these are returned as NaN.

library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
               y = iris[,1],
               nthread = 2,
               ntree = 500)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds -  iris[,1])^2)
getOOB(rf)

If OOB predictions are going to be used, it is advised that one use OOB honesty during training (OOBhonest=true). In this version of honesty, the OOB observations for each tree are used as the honest (averaging) set. OOB honesty also changes how predictions are constructed. When predicting for observations that are out-of-sample (using Predict(..., aggregation = "average")), all the trees in the forest are used to construct predictions. When predicting for an observation that was in-sample (using predict(..., aggregation = "oob")), only the trees for which that observation was not in the averaging set are used to construct the prediction for that observation. aggregation="oob" (out-of-bag) ensures that the outcome value for an observation is never used to construct predictions for a given observation even when it is in sample. This property does not hold in standard honesty, which relies on an asymptotic subsampling argument. OOB honesty, when used in combination with aggregation="oob" at the prediction stage, cannot overfit IID data, at either the training or prediction stage. The outputs of such models are also more stable and more easily interpretable. One can observe this if one queries the model using interpretation tools such as ALEs, PDPs, LIME, etc.

library(Rforestry)

# Train a forest
rf <- forestry(x = iris[,-1],
               y = iris[,1],
               nthread = 2,
               ntree = 500,
               OOBhonest=TRUE)

# Get the OOB predictions for the training set
oob_preds <- predict(rf, aggregation = "oob")

# This should be equal to the OOB error
mean((oob_preds -  iris[,1])^2)
getOOB(rf)

Saving + Loading a model

In order to save a trained model, we include two functions in order to save and load a model we have built. The following code shows how to use saveForestry and loadForestry to save and load a forestry model.

library(Rforestry)

# Train a forest
forest <- forestry(x = iris[,-1],
                   y = iris[,1],
                   nthread = 2,
                   ntree = 500,
                   OOBhonest=TRUE)
               
# Get predictions before save the forest
y_pred_before <- predict(forest, iris[,-1])

# Save the forest
saveForestry(forest, filename = file.path("forest.Rda"))

# Delete the forest
rm(forest)

# Load the forest
forest_after <- loadForestry(file.path("forest.Rda"))

# Predict after loading the forest
y_pred_after <- predict(forest_after, iris[,-1])

Ridge Random Forest

A fast implementation of random forests using ridge penalized splitting and ridge regression for predictions. In order to use this version of random forests, set the linear option to TRUE.

library(Rforestry)

set.seed(49)
n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)
forest <- forestry(x, y, linear = TRUE, nthread = 2)
predict(forest, x)

rforestry's People

Contributors

actang avatar colleenchan avatar edwardwliu avatar ilia-shutov avatar jasjeetsekhon avatar linanqiu avatar petrovicboban avatar rocassius avatar rowancassius avatar sam20190380 avatar sidc12321 avatar sidc321 avatar soerenkuenzel avatar theo-s avatar zuhdydufjdj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

rforestry's Issues

Compatibility with sklearn estimators

We're interested in compatibility with sklearn estimators (see instructions) for integration with other libraries like Ray. The goal would be to pass the check_estimator() check. A few specific items to respect requirements include: class __init__ parameters should be explicit and changing the newdata arg to x , adhered by subsequent methods.

Otuput a Rforestry tree in `treelite` format so that inferencing is blazing fast

treelite is a format for serializing trees for prediction only. It takes xgboost, lightgbm, and sklearn trees out of the box. It basically copies the internal structure of a tree, converts it into blazingly fast C code and makes predicting through the final structure extremely fast. There's also a CUDA treelite wrapper that converts any treelite model into one that works on GPUs.

One can also construct a treelite model from a custom trained model. All one needs are the following details for each tree: split points, feature at split, numerical vs categorical, and leaf nodes.

If we can get Rforestry to dump its internal structure into a JSON or something like that, I can work with that to convert it into a treelite tree. That'd give us most prediction performance perks + sklearn perks.

@JasjeetSekhon @theo-s

Plotting with linear = TRUE sometimes fails

When plotting a model with linFeats = TRUE, I get the error: Error in elnet(x, is.sparse, ix, yx, y, weights, offset, type.gaussian, y is constant; gaussian glment fails at the standardization step

Here is code that produces it, largely taken from an example in ?forestry.

n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)

forest <- forestry(
  x,
  y,
  ntree = 10,
  replace = TRUE,
  maxDepth = 100,
  nodesizeStrictSpl = 1,
  nodesizeStrictAvg = 1,
  nthread = 2,
  linear = TRUE
)

plot(forest)

Getting weights variables

Can you fix function definition here

    def _get_weights_variables(self, weights: np.ndarray) -> np.ndarray:
        weights_variables = [i for i in range(weights.size) if weights[i] > max(weights) * 0.001]
        if len(weights_variables) < self.mtry:
            raise ValueError("mtry is too large. Given the feature weights, can't select that many features.")

        weights_variables = np.array(weights_variables, dtype=np.ulonglong)
        return weights_variables

for cases where weights is None (which is default value in fit() for feature_weights and deep_feature_weights) ?
Also, function argument weight should be Optional[np.ndarray]

@edwardwliu

[Runtime] getVI

It appears when P and N are both large, getVI has an abnormally long runtime.

Evaluate adding a simple implementation of data adaptive feature weights

The idea is:
(1) train a forest with a smallish number of trees
(2) see how often each feature is split on in this forest
(3) train a new forest with feature weights in proportion to the probabilities each feature was split on in the original forest
(4) combine the trees from (1) and (3) into single forest to produce a final prediction.

This idea arose from a conversation with Jas.

OBBpreds

The order of the arguments for getOOBpreds should be changed to:

getOOBpreds(object, feature.new = NULL, noWarning)

from:

getOOBpreds(object, noWarning, feature.new = NULL)

In common usage people are going to call it with:

getOOBpreds(objects, new.features)

and run into trouble.

Export model to JSON string

Support exporting trained models to JSON. The preferred spec should be compatible with Treelite, see documentation. Since serialization can be relatively slow as well as need to be supported in R and Python, I would consider exposing a C++ method for the JSON export. Otherwise, if we change any underlying model representations, we would need to update the wrappers separately. cc @theo-s.

Leaf nodes should not recalculate `predictedMean`

Hey @theo-s if I'm understanding the code correctly, predictedMean below isn't getting re-calculated for leaf nodes right?

Rforestry/src/RFNode.cpp

Lines 174 to 226 in 67f4c3f

if (is_leaf()) {
if (linear) {
//Use ridgePredict (fit linear model on leaf avging obs + evaluate it)
ridgePredict(outputPrediction,
outputCoefficients,
updateIndex,
xNew,
trainingData,
lambda);
} else {
double predictedMean;
// Calculate the mean of current node
if (getAverageCount() == 0 && !getTrinary()) {
predictedMean = std::numeric_limits<double>::quiet_NaN();
} else if (getTrinary()) {
predictedMean = 0;
} else {
predictedMean = (*trainingData).partitionMean(getAveragingIndex());
}
// Give all updateIndex the mean of the node as prediction values
for (
std::vector<size_t>::iterator it = (*updateIndex).begin();
it != (*updateIndex).end();
++it
) {
if (getTrinary()) {
// In this case, we test the feature value and use the correct
// pseudo outcome by weight
std::vector<bool> signs;
for (size_t i = 0; i < (*trainingData->getSymmetricIndices()).size(); i++) {
if ((*xNew)[(*trainingData->getSymmetricIndices())[i]][*it] > 0) {
signs.push_back(true);
} else {
signs.push_back(false);
}
}
// Pull the correct pseudo outcome
size_t weight_idx = bin_to_idx(signs);
outputPrediction[*it] =(this->getWeights())[weight_idx];
} else {
outputPrediction[*it] = predictedMean;
}
}
}

Since getAveragingIndex() always returns the same index and (*trainingData) doesn't change. In that case, predictedMean should just be calculated for each leaf node during training and accessed

Implement Custom Sampling in Python

While implementing saving + loading in Python, I realized that that we don't have customAveragingSample, customSplittingSample, and customExcludedSample implemented in the Python wrapper. It would be useful to have these, but I will not implement them in the saving + loading PR.

`seed` same in different objects

Let's have

form random_forestry import RandomForest

rf_1 = RandomForest()
rf_2 = RandomForest()

Since default value for seed is random number, we expect:

assert rf_1.seed != rf_2.seed

but that's not what's happening.

although rf_1 and rf_2 are really different objects:

assert id(rf_1) != id(rf_2)

their attributes which have default value set are not:

assert id(rf_1.seed) == id(rf_2.seed)

Python does copy on write here, so if value is changed later, it will create new seed object then:

rf_1.seed = 5
assert id(rf_1.seed) != id(rf_2.seed)

The implication is that if you create different RandomForest objects with seed not explicitly set, seed will be random but all objects will have the same value.
Is this acceptable to you? @edwardwliu @theo-s

forestry function not found

Trying to test the package. Despite proper download & install, I keep on getting the [could not find function "forestry"] error message. Thanks !

Floating point imprecision

When passing a data frame with 0 in training, the resulting processed_x occasionally replaces 0 with 1.387e-17. This is
likely due to floating point behavior either in the C++ logic or wrapper APIs. This has also led to flakiness in some exact tests.

Training data column with only NAs

Currently forestry errors out cryptically with a colSd error when a column in the training data contains only NAs. It should either make that error more informative or just drop the column from its training data.

segmentation faults when splitratio = 0.5

When tuning a random forest using the caret package, segmentation faults occur often when splitratio = 0.5.

For a reproducible example, see: https://github.com/forestry-labs/GI-Collaboration/blob/main/code/3-Modeling/3.2-runPipeline.R

A segfault will occur with the parameters that are currently set:

dataset = "ehrbase"
estimator = "randomforest"
predictor = estimator
seed = 101
force = 0
embedding_dict = NULL
vi = 0

When running in a bash script, the error message would look something like:

  • Fold7.Rep1: mtry=12, nodesizeStrictSpl=2, ntree=500, splitratio=0.5521
  • Fold7.Rep1: mtry=12, nodesizeStrictSpl=4, ntree=500, splitratio=0.5000
    mtry nodesizeStrictSpl ntree splitratio
    4 12 4 500 0.5

*** caught segfault ***
address (nil), cause 'memory not mapped'

*** caught segfault ***
address 0x8e0, cause 'memory not mapped'

*** caught segfault ***
address (nil), cause 'memory not mapped'

*** caught segfault ***
address (nil), cause 'unknown'

*** caught segfault ***

*** caught segfault ***
address (nil), cause 'memory not mapped'

*** caught segfault ***
address (nil), cause 'memory not mapped'

*** caught segfault ***
address 0x448, cause 'memory not mapped'

*** caught segfault ***
address (nil), cause 'memory not mapped'

*** caught segfault ***
address (nil), cause 'memory not mapped'

*** caught segfault ***
address 0x88f, cause 'memory not mapped'

*** caught segfault ***
address (nil), cause 'memory not mapped'
Error: no more error handlers available (recursive errors?); invoking 'abort' restart

*** caught segfault ***
address (nil), cause 'memory not mapped'

*** caught segfault ***
address (nil), cause 'unknown'
An irrecoverable exception occurred. R is aborting now ...
An irrecoverable exception occurred. R is aborting now ...

feature.new

Are we sure that this is a good name for the argument in predict? For example, how about newx as in glmnet?

[Runtime ] relinkCPP_ptr

In some cases, it appears that the runtime of relinkCPP_ptr is scaling nonlinearly in the number of trees in the forest.

Add Exact Predict Flag

Add flag to R predict wrapper in order to aggregate tree predictions in a standardized way.

R crashes when linFeats is a singleton

Here's an example. Note that this code will cause R to crash.

n <- c(100)
a <- rnorm(n)
b <- rnorm(n)
c <- rnorm(n)
y <- 4*a + 5.5*b - .78*c
x <- data.frame(a,b,c)

forest <- forestry(
  x,
  y,
  ntree = 10,
  replace = TRUE,
  maxDepth = 100,
  nodesizeStrictSpl = 10,
  nodesizeStrictAvg = 10,
  nthread = 2,
  linear = TRUE,
  linFeats = 1
)

predict(forest, aggregation = "oob")

relinkCPP_prt cannot be parallelized in a straightforward way

Running relinkCPP_prt in parallel fails. I think it is caused by this issue:
https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html

Here is an example:

library(Rforestry)
library(parallel)
n <- 100
p <- 2
x <- matrix(rnorm(n * p), ncol = p)
y <- rnorm(n)

forest1 <- make_savable(forestry(x, y))
forest2 <- make_savable(forestry(x, y))

saveRDS(forest1, file = "forest1.RDS")
saveRDS(forest2, file = "forest2.RDS")

forest1 <- readRDS("forest1.RDS")
forest2 <- readRDS("forest2.RDS")

# Reconstruction fails here:
relinked <- mclapply(list(forest1, forest2), relinkCPP_prt, mc.cores = 2)
relinked[[1]]@forest


# Checking that it works serially:
relinkCPP_prt(forest1)@forest

Determinism for correctedPredict

Should add a seed parameter to corrected predict in order to pass to the forestry calls so that corrected predictions can be deterministic

forestry sometimes yields error in C++ code when undefined variable is passed as argument

Would be good to change this so we check for the error in R code and print a clearer error message. Small reproducible example:

n <- 10
p <- 10
# Generate Data
x <- matrix(runif(n*p,min=-2,max=2), ncol=p)
y <- x[,1]**3
colnames(x) <- c("V1","V2","V3","V4","V5","V6","V7","V8","V9","V10")
# Regress
rf <- forestry(x=x,
               y=y,
               ntree=500,
               seed=seed_i,
               OOBhonest = TRUE,
               scale = FALSE,
               monotonicConstraints = c(1,1,-1,rep(0,7)),
               monotoneAvg = TRUE,
               symmetric = TRUE
)

Output:

<simpleError in rcpp_cppBuildInterface(processed_x, y, categoricalFeatureCols_cpp,     linFeats, nObservations, numColumns, ntree, replace, sampsize,     mtry, splitratio, OOBhonest, doubleBootstrap, nodesizeSpl,     nodesizeAvg, nodesizeStrictSpl, nodesizeStrictAvg, minSplitGain,     maxDepth, interactionDepth, seed, nthread, verbose, middleSplit,     maxObs, featureWeights, featureWeightsVariables, deepFeatureWeights,     deepFeatureWeightsVariables, observationWeights, monotonicConstraints,     groupVector, minTreesPerGroup, monotoneAvg, hasNas, linear,     symmetric, overfitPenalty, doubleTree, TRUE, rcppDataFrame): object 'seed_i' not found>

This is a bit of a weird edge case because we're passing a variable name seed_i that does not exist in the calling environment.

[Feature requests] Precision refinements

  • Sometimes a potential split appears to break monotonicity but if there were no precision issues it would not. To avoid this issue consider implementing a threshold that permits splits that break monotonicity provided they do not break it by more than can be plausibly explained by the accumulation of imprecision. This should be a flag settable in the R call to forestry in case we want guaranteed monotonicity. This will only come into play when only poor (in terms of mse reduction) splits are available but below the poor split they make be better splits if better features are in the subsample of the features, etc.
  • Scale and center the data before any computation is done. This is done in other packages, eg glmnet, as it improves precision. It should only make a difference for the outcome as the forest only depends on the ranks of the features. If it does change the ranks of the features that is probably a negative. Before doing this we need to make sure it doesn't affect any of the other options, eg enforcement of monotonicity, but I think it should be fine.

Python Predictions not being rescaled when forest is trained with scale = True

We may want to remove scaling entirely, but currently the predictions are not being rescaled (for any aggregation options when the forest is trained with scale = True).

A minimum reproducible example is as follows:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.metrics import mean_squared_error
from random_forestry import RandomForest

data = load_iris()
X = pd.DataFrame(data["data"], columns=data["feature_names"])
y = data["target"]

# Create a RandomForest object
fr = RandomForest(ntree=100, max_depth=5, seed=1,oob_honest = True,scale=False)

fr.fit(X.iloc[:, 1:], X.iloc[:, 0])

print("Aggregation = average")
print(np.sqrt(mean_squared_error(X.iloc[:, 0],fr.predict(X.iloc[:, 1:], aggregation="average", exact = True))))
# Predictions are not on the scale of 1st column of iris, should be scaled + centered
print(fr.predict(X.iloc[:, 1:], aggregation="average", exact = True)[[0,1,50,51,100,101]])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.