Comments (13)
Alright, now everything is working the same in @develop as well. The last commit addresses this issue (I checked your model manually @andymilne). I am as well thinking on allowing the user to pass a custom exploration function for the search, I'll see if there's a nice way to build the function internally as much as possible.
from projpred.
The plots indicate that there must be something fishy in the projections. The output you get from cv_varsel
indicates that the reference model may be ill behaved, given that some log likelihoods are infinite. The very large standard deviations indicate that the projections are a bit too vague, what is your reference model?
from projpred.
The reference model is my own (not sure if that's what you are asking) -- it is described in my post, but I attach the full summary for clarity:
Model Info:
function: stan_glm
family: binomial [logit]
formula: cbind(tap, n_pulses - tap) ~ cue * (tap_l_1 + presentation_sc +
repetition_sc + N_sc + K_sc + mean_IOI_sc + evenness_sc +
all_ent_sc + step_ent_sc + SQ_sc + CQ_sc + balance_sc + is_high_prime +
reg_exp_non_zero_sc + bal_wt_sc + height_sc + deja_vu2_sc +
APM_sc + edge_sc)
algorithm: sampling
sample: 4000 (posterior sample size)
priors: see help('prior_summary')
observations: 48282
predictors: 40
It fits relatively quickly, with good n_effs and Rhats and the bayes_R2 is 0.92. The pp_check is a bit off but that is because I am not including random effects in the ref model, and ideally the family would be betabinomial (the fit to the data when using both of those is very good).
from projpred.
Yeah, I was referring to the reference model's posterior.
Given that it does not include group effects the default search method would be l1 search, so you can try varsel(fit, method="forward")
and see if its output differs (it is usually a bit more accurate, even though it's more computationally demanding). Another thing to check is the loglikelihood of the reference model, so you can extract the refmodel object calling ref <- get_refmodel(fit)
and take a look at ref$loglik
, which, judging from cv_varsel
's output, should contain some Inf or NaN. If that's the case, you can try narrower priors on your reference model (horseshoe?) and look that way. My experience with some binomial models is that sometimes if the model is too sure of some observation that can become problematic (to see this, inspecting ref$mu
is usually helpful, that's the mean prediction from your reference model) for the projections, so you can also take a look at vsel$search_path$sub_fits
, where vsel
is the output from varsel
or cv_varsel
.
I'm sorry my answer is a bit exploratory, I'm trying to give you detailed feedback based on what I've encountered when working with these kinds of models. I hope this helps in narrowing the issue.
from projpred.
Thanks -- I will get back to you when I get a chance to try these.
from projpred.
If the varsel-plot is correct, it indicates that the reference model is really bad (worse than model without any variables). So I would definitely check that first. So estimate the performance of 1) the reference model and 2) the null model (no variables) on training data and using cross-validation (either LOO or k-fold) to understand what's going on with the reference model. This is separate from projpred
.
If this check shows the reference model is actually good (at least better than the null model), then we can suspect that there's something wrong with projpred
, but based that varsel-plot one cannot really say much about that.
from projpred.
Juho -- the ref model is definitely good (see my comments to Alejandro above) and here's the loo comparison with the null (intercept only) model:
Model comparison based on LOO-CV:
elpd_diff se_diff
mdl_pluse_tap_ref_rstan 0.0 0.0
mdl_pluse_tap_null_rstan -128755.3 891.1
I am happy to make the rstanarm reference model available if useful.
Alejandro -- method = "forward"
produces essentially identical results.
There are -Infs in the ref$loglik
but no NaNs. I don't understand what that means or implies. I am refitting the model with a horseshoe prior but that is going very slowly (it's been running through the night and is only at 200 iterations -- when I use student(3, 0, 1)
priors, the model completes in about 15 minutes). I don't understand the meaning of ref$mu
and vsel$search_path$sub_fits
so I don't know what to look for there.
from projpred.
I believe the issue lies in the projections due to the singular (multicollinear) variables. In a nutshell, I believe the reason this happens is because the inner method implemented in develop tries to achieve maximum likelihood estimates as close as possible to the true ones without adding regularization, which in cases like this helps stabilizing the estimates. To test this hypothesis, you can run the same model with master’s varsel method and see if the output is different. If the projections behave correctly it would point in this direction.
Regarding the reference model, it seems that the posterior would be quite vague (large uncertainty in the parameters due the singularities), and therefore the horseshoe prior should help the projections in removing some collinearities. While the reference model can still behave properly, the projections suffer much more, so I would expect the resulting reference model to behave better (it’s still a hypothesis tho).
I’m going to play around with this and it would be helpful to have this model if possible, I will try to simulate very correlated data myself as well so I have something to test. Thanks for the feedback!
from projpred.
Yes, the results from the non-dev version of projpred look much more reasonable! I will email you a link to the model.
An unrelated question ... Given that, for this model, it is essential that any given term and its interaction must both be included, is there a way to enforce this in projpred or a principled workaround such as iteratively changing the penalty values in the call to varsel or cv_varsel?
from projpred.
from projpred.
With regard to interactions of the population terms, the most common scenario would be where the inclusion of any interaction with a given term requires all lower-order interactions of that term, and the term itself also to be included. E.g., if the ref model has X1*X2*X3
, including X1:X2:X3
, requires X1:X2
, X1:X3
, X2:X3
, X1
, X2
, and X3
also to be included. This is required to ensure the predictors' coefficients are invariant to shifts in the predictors' values.
However, in some scenarios, additional inclusion "implications" can be important for interpretive purposes. For example, the ref model discussed here takes the form X1*(X2 + X3 + X4 + ... )
and my requirement would be that if any of the terms inside the bracket are included, the interaction with X1
should also be included. E.g., if X2
is included, X1:X2
should also be included. I am not sure how such custom implications could be best specified -- perhaps the user could add, as an argument, a list of pairs of variables, which mean that if the first variable is included by varsel
/cv_varsel
, the second must also be included. E.g., c("X2, X1:X2", "X3, X1:X3", "X4, X1:X4")
.
from projpred.
That’s an interesting proposal. Currently what we do is, at any given point, expanding the current projection in every possible direction that makes sense. I’m thinking such a function could be passed your the user. I will try to think about more user friendly ways of providing the required information for us to do the expansion internally asking only minimal input for the user.
from projpred.
Thanks for posting those cv-results. Looks sensible.
Yes, the results from the non-dev version of projpred look much more reasonable!
Ok, if this problem happens only with the dev-version, then Alejandro can help more. This should be a good case for debugging since, as far as I understand, one should get the same result with both versions (as it's just simple glm, no hierarchy or anything).
from projpred.
Related Issues (20)
- Test Suite Crashes HOT 5
- variable selection for uncorrelated random variables HOT 3
- Incompatible with brms when random effects are specified with || syntax HOT 2
- Error from multivariate brms models HOT 6
- Memory issues in multilevel models HOT 2
- Predictive performance gap / jumpy behavior at full size in Gaussian multilevel model HOT 1
- For a stan_glm model cv_varsel with loo works, but kfold gives an error HOT 4
- plot.vsel shows extra xtick label HOT 1
- brmsfit cumulative model & Latent projection predictive feature selection HOT 3
- Convergence checks for additive submodels HOT 4
- Allow dispersion parameter to be observation-specific HOT 1
- Smoothing of cross-validated predictive performance HOT 1
- Add R2 as performance statistic HOT 4
- Error when getting reference model with k-fold cross validation for cv_varsel HOT 9
- Truncated response distributions
- Clarify doc for refit_prj HOT 2
- Don't warn about slightly large khats HOT 1
- Remove QR=TRUE from the vignette
- Guide on latent_ll_oscale for other families e.g. zero-inflated negative binomial, hurdle negbinom HOT 5
- Using projpred in multi-response models with group correlated hierarchical effects HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from projpred.