Giter Club home page Giter Club logo

Comments (10)

m-clark avatar m-clark commented on June 9, 2024 1

Thanks, and glad you enjoyed it!

I haven't played with Julia's MixedModels anytime recently, but I do follow its development and know that it is still actively developed (by one of the primary authors of R's lme4). I'd be surprised if it has a notable gain, and if so, why it wouldn't also be applied to lme4 also given the development entanglement, but that is based on very dated knowledge at this point. From my recollections of lme4 issues/development, it's difficult to parallelize some of the problem parts, and besides that, very little of lme4 is even in R to begin with, so aside from parallelization it's not going to get much faster than the underlying C code and computational tricks employed there.

For Python I'm mostly aware of the statsmodels implementation (my former boss contributed a lot to it, but now seems to be playing with Julia 😄), and I'm not aware of any speed advantage for any statsmodels implementations over R, and I want to say that in previous testing (now also long ago) it seemed a bit slower in fact.

Other alternatives would something like doing a GLM in pytorch or similar, the main difference being that the fixed effects are penalized by default as well, but you could fiddle with that. The larger issue would be doing anything with it after the fact, as you would lack all the easy diagnostic and model exploration tools, but that too could be overcome, and may be worth it if the data is extremely large.

from mixed-models-with-r.

m-clark avatar m-clark commented on June 9, 2024 1

Definitely keep me posted, sounds great!

from mixed-models-with-r.

AdrianAntico avatar AdrianAntico commented on June 9, 2024

Thanks for the responses! Given the data size limitations I assume sampling is a natural next step. Do you have any advice or resources about optimal sampling strategies for these types of models? For example, would it be better to sample a list of levels from the random effects and then make sure to include full history of them, or run some sort of stratified sampling over the random effect levels?

from mixed-models-with-r.

m-clark avatar m-clark commented on June 9, 2024

I don't have any real intuition there, though in the past I think I have sampled to include all levels since problems may arise when not including all levels depending on the tool used, but that's not necessarily a good reason. I'd maybe suggest consulting with someone more well versed in survey sampling type of approaches for an alternative view.

But I came across this yesterday regarding Julia MixedModels, which sounds great, so maybe try it first! Though I'm not convinced entirely on it being faster than lme4, and I'd be concerned that one tool converges and another doesn't, it sounds like Julia worked well in their very large situation (millions with at least two random effects).

from mixed-models-with-r.

m-clark avatar m-clark commented on June 9, 2024

from mixed-models-with-r.

AdrianAntico avatar AdrianAntico commented on June 9, 2024

@m-clark I figured out a method for generating predictions for a couple of model structures thus far. The biggest data I was able to use, given a 256GB memory limit, is a 500m record data set, with two random effects: one with 50m levels and the second with 5m levels. Took 3.9 minutes to run and return the predictions. Do you think there would be some interest in that?

from mixed-models-with-r.

AdrianAntico avatar AdrianAntico commented on June 9, 2024

@m-clark I went ahead and submitted an issue to the lme4 github repo and tagged Bolker. I'll keep you posted about any responses, if you're interested.

from mixed-models-with-r.

AdrianAntico avatar AdrianAntico commented on June 9, 2024

Here's a link to the issue. I shared the idea with some supporting materials.

lme4/lme4#696

from mixed-models-with-r.

bbolker avatar bbolker commented on June 9, 2024

I am interested in issues of performance on large data sets (despite my lack of interest in incorporating the particular proposal in the lme4 package). I would be happy to work on developing a list of references/pointers to go in the GLMM FAQ if that is an appropriate venue, or elsewhere ...

From the questions above it seems you are most interested in fast fitting/inference on in-memory data, rather than recipes that will handle out-of-memory data (map/reduce etc.), typically with a corresponding cost in time ...

Some possibly interesting references:


Gao, Katelyn. “Scalable Estimation and Inference for Massive Linear Mixed Models with Crossed Random Effects.” PhD Thesis, Stanford University, 2017. https://statweb.stanford.edu/~owen/students/KatelynGaoThesis.pdf.

Gao, Katelyn, and Art Owen. “Efficient Moment Calculations for Variance Components in Large Unbalanced Crossed Random Effects Models.” Electronic Journal of Statistics 11, no. 1 (2017): 1235–96. https://doi.org/10.1214/17-EJS1236.

Gao, Katelyn, and Art B. Owen. “Estimation and Inference for Very Large Linear Mixed Effects Models.” Statistica Sinica, 2020. https://doi.org/10.5705/ss.202018.0029.

“Diamond: Python Solver for Mixed-Effects Models.” Python. 2017. Reprint, Stitch Fix Technology, September 20, 2017. https://github.com/stitchfix/diamond.

Sweetser, Tim, and Aaron Bradley. “Diamond Part II: Stitch Fix Technology.” Stitch Fix Technology: Multithreaded, August 7, 2017. https://multithreaded.stitchfix.com/blog/2017/08/07/diamond2/.

from mixed-models-with-r.

AdrianAntico avatar AdrianAntico commented on June 9, 2024

My thinking is that fast inference can open doors for mixed effects frameworks that don't exist today. Consider the
"Follow the Regularization Leader" modeling framework that Google and others have. I would think mixed effects frameworks could be beneficial to systems like these. Considering how much revenue is generated in the world of ad tech, I would think there would be some research dollars behind exploring these methods (assuming they haven't already).

The distributed computing aspect sounds interesting. I would think something like that would be needed to run a model, such as, predicting the effects of covid vaccinations on people.

Anyway, thanks for the references. I'm going to dig into them and see what kind of performance improvements I squeeze out of them.

from mixed-models-with-r.

Related Issues (19)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.