Comments (10)
Thanks, and glad you enjoyed it!
I haven't played with Julia's MixedModels anytime recently, but I do follow its development and know that it is still actively developed (by one of the primary authors of R's lme4). I'd be surprised if it has a notable gain, and if so, why it wouldn't also be applied to lme4 also given the development entanglement, but that is based on very dated knowledge at this point. From my recollections of lme4 issues/development, it's difficult to parallelize some of the problem parts, and besides that, very little of lme4 is even in R to begin with, so aside from parallelization it's not going to get much faster than the underlying C code and computational tricks employed there.
For Python I'm mostly aware of the statsmodels implementation (my former boss contributed a lot to it, but now seems to be playing with Julia 😄), and I'm not aware of any speed advantage for any statsmodels implementations over R, and I want to say that in previous testing (now also long ago) it seemed a bit slower in fact.
Other alternatives would something like doing a GLM in pytorch or similar, the main difference being that the fixed effects are penalized by default as well, but you could fiddle with that. The larger issue would be doing anything with it after the fact, as you would lack all the easy diagnostic and model exploration tools, but that too could be overcome, and may be worth it if the data is extremely large.
from mixed-models-with-r.
Definitely keep me posted, sounds great!
from mixed-models-with-r.
Thanks for the responses! Given the data size limitations I assume sampling is a natural next step. Do you have any advice or resources about optimal sampling strategies for these types of models? For example, would it be better to sample a list of levels from the random effects and then make sure to include full history of them, or run some sort of stratified sampling over the random effect levels?
from mixed-models-with-r.
I don't have any real intuition there, though in the past I think I have sampled to include all levels since problems may arise when not including all levels depending on the tool used, but that's not necessarily a good reason. I'd maybe suggest consulting with someone more well versed in survey sampling type of approaches for an alternative view.
But I came across this yesterday regarding Julia MixedModels, which sounds great, so maybe try it first! Though I'm not convinced entirely on it being faster than lme4, and I'd be concerned that one tool converges and another doesn't, it sounds like Julia worked well in their very large situation (millions with at least two random effects).
from mixed-models-with-r.
from mixed-models-with-r.
@m-clark I figured out a method for generating predictions for a couple of model structures thus far. The biggest data I was able to use, given a 256GB memory limit, is a 500m record data set, with two random effects: one with 50m levels and the second with 5m levels. Took 3.9 minutes to run and return the predictions. Do you think there would be some interest in that?
from mixed-models-with-r.
@m-clark I went ahead and submitted an issue to the lme4 github repo and tagged Bolker. I'll keep you posted about any responses, if you're interested.
from mixed-models-with-r.
Here's a link to the issue. I shared the idea with some supporting materials.
from mixed-models-with-r.
I am interested in issues of performance on large data sets (despite my lack of interest in incorporating the particular proposal in the lme4
package). I would be happy to work on developing a list of references/pointers to go in the GLMM FAQ if that is an appropriate venue, or elsewhere ...
From the questions above it seems you are most interested in fast fitting/inference on in-memory data, rather than recipes that will handle out-of-memory data (map/reduce etc.), typically with a corresponding cost in time ...
Some possibly interesting references:
Gao, Katelyn. “Scalable Estimation and Inference for Massive Linear Mixed Models with Crossed Random Effects.” PhD Thesis, Stanford University, 2017. https://statweb.stanford.edu/~owen/students/KatelynGaoThesis.pdf.
Gao, Katelyn, and Art Owen. “Efficient Moment Calculations for Variance Components in Large Unbalanced Crossed Random Effects Models.” Electronic Journal of Statistics 11, no. 1 (2017): 1235–96. https://doi.org/10.1214/17-EJS1236.
Gao, Katelyn, and Art B. Owen. “Estimation and Inference for Very Large Linear Mixed Effects Models.” Statistica Sinica, 2020. https://doi.org/10.5705/ss.202018.0029.
“Diamond: Python Solver for Mixed-Effects Models.” Python. 2017. Reprint, Stitch Fix Technology, September 20, 2017. https://github.com/stitchfix/diamond.
Sweetser, Tim, and Aaron Bradley. “Diamond Part II: Stitch Fix Technology.” Stitch Fix Technology: Multithreaded, August 7, 2017. https://multithreaded.stitchfix.com/blog/2017/08/07/diamond2/.
from mixed-models-with-r.
My thinking is that fast inference can open doors for mixed effects frameworks that don't exist today. Consider the
"Follow the Regularization Leader" modeling framework that Google and others have. I would think mixed effects frameworks could be beneficial to systems like these. Considering how much revenue is generated in the world of ad tech, I would think there would be some research dollars behind exploring these methods (assuming they haven't already).
The distributed computing aspect sounds interesting. I would think something like that would be needed to run a model, such as, predicting the effects of covid vaccinations on people.
Anyway, thanks for the references. I'm going to dig into them and see what kind of performance improvements I squeeze out of them.
from mixed-models-with-r.
Related Issues (19)
- Add demo of prediction intervals for random effects
- Finish Bayesian demo
- add bit about AR as an additional random effect
- add link to big data post
- other formats HOT 1
- No notes on hypothesis testing with mixed models :( HOT 2
- link is not working HOT 1
- add discussion of variance accounted for somewhere HOT 1
- Update 2022 HOT 5
- covariance types HOT 1
- Related code examples HOT 2
- add discussion of convergence problems in issues section
- add heterogeneous variances demo
- add clarity to predicted random effects
- fix workshop zip
- nesting vs. crossed HOT 1
- update spaghetti plots
- convergence
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mixed-models-with-r.