I was wondering if there is any way to tell abess the original problem size in case on

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Any way to work with preselected sets of variables in abess & with MBIC criterion? about abess HOT 8 CLOSED

tomwenseleers commented on July 21, 2024

Any way to work with preselected sets of variables in abess & with MBIC criterion?

from abess.

Comments (8)

bbayukari commented on July 21, 2024 1

Thanks for your code, I have run the experiment with different ICs.

First of all, for this question:

"I thought I could perhaps simulate abess using MBIC by specifying "aic" as tune.type but using ic.scale = (1/2)log(n(p^2)/16) so that 2 (the penalty factor implied by AIC multiplied by ic.scale would return the penalty implied by MBIC, but that didn't seem to work. "

This idea is almost correct, but ic.scale won't be used when tune.type is "aic" because we think ic.scale of "aic" must be a constant so that nobody needs to modify it. We can specify "bic" as tune.type but using ic.scale = log(n*(p^2)/16) / log(n) to implement MBIC.

The results are the num of true positive variables and the size of support-set considered by the algorithm under 5 ICs and subsetted/original dim settings.

TP/ best.size	p=76	p=1000000
aic	41/76	41/76
bic	41/76	41/76
gic	41/76	34/34
ebic	38/43	33/33
mbic	41/76	33/33

from abess.

bbayukari commented on July 21, 2024 1

Let's take '41/76' as an example to explain the result. Its corresponding confusion matrix is

estimated/true	FALSE	TRUE
FALSE	999915	9
TRUE	35	41

'76' refers that the algorithm believes that the size of support-set is best at 76. The fact that best.size equals the total num of variables implies that the penalty of IC is too little to select correct variables.

The results above implies that using true dim (p=1e6) can increase the penalty so that improve the precision except for AIC and BIC.

Maybe slightly counterintuitive that ic.scale would work on all IC except AIC - would it not be more logical to apply them to all? E.g. so that if ic.scale would be set at 1/2 and tune.type="aic" the penalisation would be half as strong as implied by AIC?

Thanks for your suggestion, we will align the behavior of AIC with other ICs soon.

from abess.

Mamba413 commented on July 21, 2024

@tomwenseleers thanks for this question. We are working on it now. I not pretty sure what is MBIC you mentioned？Can you provide any reference?

from abess.

tomwenseleers commented on July 21, 2024

Many thanks for that! The specific version of mBIC I was using was cited in Frommlet & Nuel (2016) & I was using their choice of c=4, which is actually a hyperparameter:
https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0148620&type=printable
But it's a bit confusing since there is several versions of mBIC available:
https://onlinelibrary.wiley.com/doi/epdf/10.1002/qre.936
https://link.springer.com/chapter/10.1007/978-3-642-29210-1_39

And there is still a whole zoo of other IC that have been suggested, e.g. (hope I am getting these formulae right)

hq = min2LL + c*log(log(n))*edf : the Hannan and Quinnn information criterion
ric = min2LL + 2 * log(p) # risk inflation criterion
mric = min2LL + 2 * sum(log(p/(1:edf))) # modified risk inflation criterion
cic = min2LL + 4 * sum(log(p/(1:edf))) # covariance inflation criterion
bicg =  min2LL + log(n)*edf + 2*g*lchoose(p,round(edf)) # g =1 suggested as default
# (https://www.proquest.com/openview/918b8b1efc7e0a0aa4d565ed54fa37dd/1?cbl=18750&diss=y&pq-origsite=gscholar)
bicq =  min2LL + log(n)*edf - 2*edf*log(q/(1-q))
# (see Xu, C. and McLeod, A.I. (2009). Bayesian Information Criterion with Bernouilli Prior. and https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=cb9a4547704f6a401116e04263e0445feab10cba)

For aic, bic, gic, ebic & mbic I was using

  aic = min2LL + 2 * edf # Akaike information criterion
  # aicc = ifelse(n > edf, min2LL + 2*edf*n/(n-edf), NA) # small sample AIC
  bic = min2LL + log(n) * edf # Bayesian information criterion
  gic = min2LL + log(p) * log(log(n)) * edf # generalized information criterion GIC = SIC in https://www.pnas.org/doi/10.1073/pnas.2014241117
  ebic = min2LL + (log(n) + 2 * (1 - log(n) / (2 * log(p))) * log(p)) * edf # extended BIC, Chen, J. and Chen, Z. (2008). Extended Bayesian information criterion for model selection with large model space. Biometrika, 94, 759-771., https://arxiv.org/abs/1107.2502 (note original still has an additional tuning parameter)
  mbic = min2LL + log(n * (p ^ 2) / 16) * edf # see Frommlet & Nuel 2016

For gic several versions have also been suggested though, so gic as a name is in fact a little ambiguous.

No idea though which ones are now in general recommended to achieve either optimal predictive performance or optimal variable selection consistency for either the n>p or p>n setting... Myself I use for n > p AIC & BIC when I am interested in optimal predictive performance (as optimising AIC asymptotically is equivalent to minimising leave one out cross validation error, Stone M. (1977) An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion. Journal of the Royal Statistical Society Series B. 39, 44–7) and optimal variable selection consistency (Shao J. (1997) An asymptotic theory for linear model selection. Statistica Sinica 7, 221-242.), respectively, and for p > n I was using mBIC for variable selection (with c=4, following Frommlet & Nuel) - but not sure what's best to get optimal predictive performance when p >> n. What IC do you find generally perform best for either purpose? Some of the IC also have a hyperparameter related to the actual nr of variables that you think have nonzero coefficients, which also makes sense.

Maybe an easy way to support any of these IC could be to allow for some argument ic.factor, which the user could set in function of the true p & n of the original problem him/herself & any other hyperparrameters. So that if you would pass 2 or log(n) or log(n * (p ^ 2) / 16) that it would correspond to AIC, BIC or mBIC etc? Maybe that could be used instead of argument ic.scale, which I find a little ambiguous in terms of what exactly it does. So that would address both the support of other alternative IC as well as allowing passing the correct penalization in case the variables have already been subsetting via another method.

Note that the leave one out cross validation error can also be calculated analytically from the residuals and the diagonal of the hat matrix, without having to actually carry out any cross validation (see also https://www.efavdb.com/leave-one-out-cross-validation). This would always be an alternative to the AIC, as that is only an asymptotic approximation of the LOOCV error. I suppose that would also work for generalized linear models if one works on the adjusted z scale of the GLM and if one uses the working observation weights.

from abess.

tomwenseleers commented on July 21, 2024

Ha many thanks - great! So what are the numbers after the slash? The nr of true positives is the nr before the slash, but what is the second? Is 76 not always the set considered by the algorithm, given that that's the nr of MCP preselected variables? Or are some variables kicked out from the very start by the algo, with that nr being dependent on the penalization?

So would you say GIC works best based on this? But what were the nr of false positives, as I imagine that would be far too high with AIC and maybe also with GIC?

Maybe slightly counterintuitive that ic.scale would work on all IC except AIC - would it not be more logical to apply them to all? E.g. so that if ic.scale would be set at 1/2 and tune.type="aic" the penalisation would be half as strong as implied by AIC?

from abess.

tomwenseleers commented on July 21, 2024

Ha OK makes sense! So AIC & BIC always provide insufficient penalisation for the high dimensional case, even when working with the original problem size, while GIC here seems best - giving 34 true pos & 0 false pos when working with the original problem size! That's cool! Amazes me that you can pick out 34 true positives from a set of 1 million possible variables with zero false positives, even when the effect size of many of the selected variables is relatively modest in terms of Cohen's d. If you use "gic" on the full dataset without MCP preselection I noticed abess selects 33 true positives (0 false positives). So it seems both gic & ebic work here... That runs in 56s on my laptop but only if you specify support.size = c(1:76) - c(1:(n-1)) would be much much slower... And if you specify the max support size of the MCP fit one might of course as well use that as the initial active set or subset to those variables...

from abess.

bbayukari commented on July 21, 2024

I'm sorry that the experimental results were not clear enough and may cause a misunderstanding.

TP/ best.size	p=76	p=1000000
aic	41/76	41/76
bic	41/76	41/76
gic	41/76	34/34
ebic	38/43	33/33
mbic	41/76	33/33

All the experiments used MCP for pre-screening, so abess only selected some of the 76 variables using different ICs. The `p` in the first row of the table refers to the total number of variables `p` in the IC, that is, > the information criteria being calculated with respect to the subsetted problem size (with p=76) & not with respect to the original problem size (with p=1000000).

So, AIC and BIC aren't able to provide sufficient penalties in this case. EBIC is the best one without correcting the value of p. After correcting, EBIC, GIC, MBIC give better and similar results.

It implies IC needs to be corrected after pre-screening. May I ask if there is anything else need to discuss？

from abess.

tomwenseleers commented on July 21, 2024

No thanks that's clear! I'll close this then!

from abess.

Any way to work with preselected sets of variables in abess & with MBIC criterion? about abess HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent