Comments (7)
Unfortunately this isn't currently possible. It seems very useful and doesn't look too difficult to add though -- hopefully we'll be able to get to it in the next week or so!
from grf.
Great! I imagine it could improve the performance a lot too
from grf.
I was looking at the code again and I think the only time the input NumericMatrix object is used is in convert_data
Data* RcppUtilities::convert_data(Rcpp::NumericMatrix input_data,
const std::vector<std::string>& variable_names) {
size_t num_rows = input_data.nrow();
size_t num_cols = input_data.ncol();
Data* data = new Data(input_data.begin(), variable_names, num_rows, num_cols);
data->sort();
return data;
}
Do you just need a convert_data that takes in a sparse matrix instead? I think the sparse matrix types will still have an iterator like .begin(). If you think it is a simple change I can even help with a PR, although I don't know exactly how else the class Data
will use the iterator
from grf.
Unfortunately the Data
constructor takes in an array with one item per element in the matrix, not an iterator. So I think we'll need a new subtype of Data
that works based on a sparse matrix (similar to this file from ranger, which is what grf is originally based on: https://github.com/imbs-hl/ranger/blob/master/src/DataSparse.h).
If the sparse matrix is primarily to handle the one hot encoding, you could try an alternate approach to handling categorical variables suggested in ESL: represent the categories from 1 .. n, with the categories sorted by their mean outcome. For this to be true to the recommendation, we should perform a new ordering at every split (and likely take the mean of gradients and not outcomes), but this may work pretty well in the short-term.
from grf.
Closing, as we've now added support for passing in a sparse matrix of type dgCMatrix
.
This should help cut down on memory usage, but we still have a lot of work to do to improve speed when there is a large number of features. I also wanted to note that we now set a much more reasonable default for mtry, the number of parameters to consider in each split (#121). So it's worth upgrading to release v0.9.4, or setting mtry explicitly.
from grf.
Hi, I am wondering if allowing sparse matrix as inputs for the grf functions is still being implemented. I am using a moderately large dataset (~5G) and realized that causal_forest takes a lot memory. Being able to use sparse matrix could help reduce memory consumption and speed up processing. Any information will be highly appreciated! Thanks a lot!
from grf.
from grf.
Related Issues (20)
- Question : Variance from quantile forest and multi_regression_forest HOT 1
- Overlap assumption and Calibration test HOT 1
- Typo X2 -> X1 in RATE doc example
- LATE implementation details HOT 2
- How to calculate the R2 of a local linear forest? HOT 2
- Add support for missing values on predictors features HOT 2
- ll_regression_forest slows down console in Rstudio HOT 2
- Getting negative values for test_calibration(tau.forest) HOT 1
- cross-fitted calibration test - Athey et.al. 2024 HOT 4
- Questions Regarding Causal Forest Variable Types and "test_calibration" Interpretation HOT 1
- Allow `Y.hat` input for `causal_survival_forest()`? HOT 4
- Variable importance () plot HOT 6
- Access to value of criterion for splits HOT 1
- best_linear_projection returns NaN values HOT 2
- ability to generate a dose-response (or causal) curve with GRF HOT 2
- vignette splitting question HOT 2
- Consider an OptimizedPredictionStrategy for survival forest
- Growing a CF - Prepare matrices HOT 1
- Standard errors calculation in average_treatment_effect() and best_linear_projection() HOT 4
- Is there a way to estimate the standard errors for a linear combination of the ATE across multiple arms?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from grf.