Giter Club home page Giter Club logo

Comments (7)

jtibshirani avatar jtibshirani commented on August 27, 2024

Unfortunately this isn't currently possible. It seems very useful and doesn't look too difficult to add though -- hopefully we'll be able to get to it in the next week or so!

from grf.

jeffwong avatar jeffwong commented on August 27, 2024

Great! I imagine it could improve the performance a lot too

from grf.

jeffwong avatar jeffwong commented on August 27, 2024

I was looking at the code again and I think the only time the input NumericMatrix object is used is in convert_data

Data* RcppUtilities::convert_data(Rcpp::NumericMatrix input_data,
                                  const std::vector<std::string>& variable_names) {
  size_t num_rows = input_data.nrow();
  size_t num_cols = input_data.ncol();

  Data* data = new Data(input_data.begin(), variable_names, num_rows, num_cols);
  data->sort();
  return data;
}

Do you just need a convert_data that takes in a sparse matrix instead? I think the sparse matrix types will still have an iterator like .begin(). If you think it is a simple change I can even help with a PR, although I don't know exactly how else the class Data will use the iterator

from grf.

jtibshirani avatar jtibshirani commented on August 27, 2024

Unfortunately the Data constructor takes in an array with one item per element in the matrix, not an iterator. So I think we'll need a new subtype of Data that works based on a sparse matrix (similar to this file from ranger, which is what grf is originally based on: https://github.com/imbs-hl/ranger/blob/master/src/DataSparse.h).

If the sparse matrix is primarily to handle the one hot encoding, you could try an alternate approach to handling categorical variables suggested in ESL: represent the categories from 1 .. n, with the categories sorted by their mean outcome. For this to be true to the recommendation, we should perform a new ordering at every split (and likely take the mean of gradients and not outcomes), but this may work pretty well in the short-term.

from grf.

jtibshirani avatar jtibshirani commented on August 27, 2024

Closing, as we've now added support for passing in a sparse matrix of type dgCMatrix.

This should help cut down on memory usage, but we still have a lot of work to do to improve speed when there is a large number of features. I also wanted to note that we now set a much more reasonable default for mtry, the number of parameters to consider in each split (#121). So it's worth upgrading to release v0.9.4, or setting mtry explicitly.

from grf.

laixx214 avatar laixx214 commented on August 27, 2024

Hi, I am wondering if allowing sparse matrix as inputs for the grf functions is still being implemented. I am using a moderately large dataset (~5G) and realized that causal_forest takes a lot memory. Being able to use sparse matrix could help reduce memory consumption and speed up processing. Any information will be highly appreciated! Thanks a lot!

from grf.

erikcs avatar erikcs commented on August 27, 2024

Hi @laixx214, see #939

from grf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.