Having done varsel(mdl_sim_ref, method = "forward") ,

The argument should be nspred instead of <code class=

suggest_size() value seems too large about projpred HOT 11 CLOSED

stan-dev commented on June 5, 2024

suggest_size() value seems too large

from projpred.

Comments (11)

jpiironen commented on June 5, 2024

Can you show what varsel_plot looks like for the differences in elpd? So try something like

vs <- varsel(...)
varsel_plot(vs, deltas = T)

If the errorbars are so small that you cannot see them properly, you can try to zoom in to see where they start overlapping zero, e.g. varsel_plot(vs, deltas = T) + coord_cartesian(ylim=c(-3,3))

from projpred.

andymilne commented on June 5, 2024

They all fall well below the line, which seems odd given that the model at the end with 12 predictors is the reference model.

from projpred.

andymilne commented on June 5, 2024

I think this raises two questions. Firstly, why is the 12-predictor model's ELPD so miscalibrated to the baseline (this was run with the default settings for varsel, and the same results are obtained when setting ns to the number of samples in the model, 8000)? Second, if such miscalibrations are common, would it be safer for the default – when using suggest_size (possibly elsewhere) – to be baseline = "best" instead of "ref"?

from projpred.

jpiironen commented on June 5, 2024

The zoomed difference plot seems about what I expected. And it shows suggest_size is actually doing what it should based on these numbers.

What's going on here is that the differences between the models at sizes 6-12 are very small (something like 5 units in elpd, so per data point 5/n ~= 0.001) but due to the large number of observations, the uncertainties are also very small. The reason why the projected model performance at 12 variables does not match to that of the reference model is most likely due to the clustered projection (default used by the package) which causes small inaccuracy when projecting the reference model onto itself. That combined with the default rule used by suggest_size can unfortunately occasionally cause behaviour like this.

You could confirm this by running the selection and plots again with argument nspred set exactly or close to the number of posterior draws you have in the reference model (so something like varsel(..., nspred=4000)). That should get rid of the projection inaccuracies.

I would not recommend baseline = "best" as default since it has it's own problems, but since this problem has come up also earlier, I might consider changing the package default to nspred=1000 or something. That will make the computation much slower for other than Gaussian models especially in cv_varsel but that's the price for more accurate computation. In any case I would advice not to trust suggest_size blindly since it is just a heuristic rule no matter what. It always makes sense to study the results manually.

from projpred.

andymilne commented on June 5, 2024

The same calibration issue is there even when setting nspred to the number of samples in the model (which is 8000). Running the following code, I get the following plot

sim_vs <- 
  varsel(
    mdl_sim_ref, 
    method = "forward",
    npred = 8000,
    ns = 8000)
sim_vs$vind
varsel_plot(
  sim_vs, 
  deltas = TRUE) + 
  coord_cartesian(ylim = c(-24, 1))

I can make the brms reference model available, if helpful.

from projpred.

AlejandroCatalina commented on June 5, 2024

The argument should be nspred instead of npred if the code snippet you pasted is what you run! (small typo). The argument ns affects only to the selection phase and it doesn't seem a problem here.

from projpred.

andymilne commented on June 5, 2024

With nspred = 8000 they now line up perfectly! And it looks pretty good when using nspred = 1000, so maybe the latter is a safer choice for the default (though I do not know what the speed trade-offs are so I cannot judge). Perhaps more important is that the user understands that when nspred is less than the number of samples, there may be inaccuracies in the suggest_size value (it's not something I expected and I don't recall seeing any advice on this in the help or the vignette). Perhaps the easiest (best?) solution is to make it clear in the help files and other documentation that when nspred < num samples, the baseline may differ from the reference model projection sufficiently to make suggest_size inaccurate and that manual inspection may be useful, or setting baseline to "best", or whatever other advice might be appropriate.

from projpred.

AlejandroCatalina commented on June 5, 2024

There may be different sources for the projections' performance to not line up perfectly with the reference model, particularly when running varsel (as opposed to cv_varsel which should be a bit more robust regarding the reference model's performance as well). In this case it just seems that the posterior is not very concentrated and therefore projecting a small number of draws results in this slight deviation.

To address this, we have added a warning on varsel when the projections' performance does not match the reference model to help users moving forward (on develop branch for now). Usually the current default is enough to ensure a good behaviour, and the speed can really be a problem for complex models (we are working on adding support to multilevel models now and in that case setting even 100 draws for the predictions is already very expensive). In those cases the user might be better off reworking the reference model and ensuring a more concentrated posterior. In any case, we will try to make documentation clearer as well!

from projpred.

andymilne commented on June 5, 2024

Sounds good -- and it will be great to have support for multilevel models!

from projpred.

AlejandroCatalina commented on June 5, 2024

Thanks! You can already take a look at develop branch, although it's still under development.

from projpred.

jpiironen commented on June 5, 2024

Thanks for the feedback @andymilne , I agree we could have an example of this and could think more carefully what should be the default input arguments. What I know is that having default nspred = 4000 or something will be awfully slow with CV and non-Gaussian model especially with many features.

I like the fast computation and study the results manually. For example in this case one would see easily that appropriate model size is about 4-5 variables. I don't think we can easily make automatic rule in suggest_size ever so good that it will work well for every single problem.

from projpred.

suggest_size() value seems too large about projpred HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent