stan-dev / projpred Goto Github PK

View Code? Open in Web Editor NEW

108.0 108.0 26.0 24.5 MB

Projection predictive variable selection

Home Page: https://mc-stan.org/projpred/

License: Other

R 97.86% C++ 1.72% TeX 0.42%

bayesian bayesian-inference r-package rstanarm stan statistics variable-selection

projpred's People

Contributors

Stargazers

Watchers

projpred's Issues

Unable to convert projected model to matrix in develop

Hi,

I've been using the develop branch of projpred with some success, but have hit a snag trying to plot the projected posteriors for my own logistic regression model, with interactions and random effects. I'm unsure if this is a bug, or something wrong with my model.

The command as.matrix(proj) fails when applying to my own model (fit - see below):

priorh <- set_prior("horseshoe(scale_global = tau0, scale_slab=1)", class="b")
cfull_formula_fixed <-  as.formula( y ~ v1_cts * f1 + v2_cts * f2 + f1:f2 + f3 + v3_cts + v4_cts   )
fit <- brm(cfull_formula_fixed, data=data, family="bernoulli", iter=5000, prior =  priorh,  sample_prior = T, control=list(adapt_delta=0.999,max_treedepth = 15), save_all_pars = F, refresh=0)
#   Data: data (Number of observations: 186) 

vs <- varsel(fit, method='forward') # fails without method='forward' (using method='L1')
proj <- project(vs, nterms = 3, ns = 500)
mcmc_areas(as.matrix(proj)) +  coord_cartesian(xlim = c(-2, 2))

Error:

Error in dimnames(x) <- dn : 
  length of 'dimnames' [2] not equal to array extent

I've been unable to diagnose the problem by applying as.matrix.projection directly, since the line:

res <- t(do.call(cbind, lapply(proj$sub_fit, as.matrix.lm)))

returns

Error in t.default(do.call(cbind, lapply(proj$sub_fit, as.matrix.lm))) : 
  argument is not a matrix

since proj$sub_fit is a list.

Apologies for not providing a reproducible example; I can upload the projected (proj) or fitted model (fit) if necessary. The logistic regression example at the bottom of this post (adapted from the vignettes) works as expected, so this must be an issue with my model. I suspect it may be because my model contains multi-level factors and interactions?

Have you seen similar behaviour elsewhere? Any suggestions of how to diagnose this issue would be very much appreciated. Many thanks in advance for your help and the great package.

This works:

library(projpred)
library(brms)
library(tidyr)
library(dplyr)
library(ggplot2)
library(bayesplot)

# Binomial example
data('df_binom', package = 'projpred')
split_structure <- break_up_matrix_term(y ~ x, data = df_binom)
df_binom <- split_structure$data
formula <- split_structure$formula
d <- df_binom
n <- nrow(df_binom) # 100
D <- ncol(df_binom[, -1]) # 20
p0 <- 5 # prior guess for the number of relevant variables
tau0 <- p0/(D-p0) * 1/sqrt(n) # scale for tau 

priorh <- set_prior("horseshoe(scale_global = tau0, scale_slab=1)", class="b")
fit2 <- brm(formula, data=df_binom, family=bernoulli(), iter=500,  chains=2, prior =  priorh,  
                  sample_prior = T, control=list(adapt_delta=0.999,max_treedepth = 15), save_all_pars = TRUE, refresh=0)
vs2 <- cv_varsel(fit2)
plot(vs2, stats = c('elpd', 'rmse'))
proj2 <- project(vs2, nterms = 3, ns = 500)
mcmc_areas(as.matrix(proj2)) +  coord_cartesian(xlim = c(-2, 2))

Random effects capturing whole variance

I'm keen to know if there's been any progress in tweaking projpred to deal with cases where the random effects capture most of the variance?

See here for an example: #57 (comment)

Model without random effects:

Same data/model, but with random effects:

projpred is working very well for me without random effects, but I'm very keen to include random effects models too. If there are any work-arounds you can suggest, or if you need example data to diagnose/test the issue, please let me know.

Thanks very much.

Multilevel models not working in develop version

Very likely I am doing something wrong – I installed the developer version with devtools::install_github('stan-dev/projpred', build_vignettes = TRUE, force = TRUE).

I have tried a multilevel model fitted in brms and also in rstanarm. In both cases, when running varsel, I get an error saying that multilevel models are not supported:

Error in get_refmodel.stanreg(object, ...) : stan_lmer and stan_glmer are not yet supported. and

Error in get_refmodel.brmsfit(object, ...) : 
  Multilevel or other special terms are not yet supported.
In addition: Warning message:
Method 'parse_bf' is deprecated. Please use 'brmsterms' instead.

rngtools package now depends on R >= 3.6.0

Unfortunately, as of the latest release of rngtools (which is listed in Imports for projpred) it now depends on R >= 3.6.0 (see https://cran.r-project.org/web/packages/rngtools/index.html), but projpred is listed as depending only on R >= 3.1.2.

projpred/DESCRIPTION

Lines 17 to 18 in b2a8f92

 Depends: 

 R (>= 3.1.2)

Because projpred Imports rngtools

projpred/DESCRIPTION

Line 24 in b2a8f92

rngtools (>= 1.2.4)

this means we should either require R >= 3.6.0 for projpred too or drop the dependency on rngtools. If possible we should do the latter because there's really no reason for projpred itself to require the latest R version.

Is using rngtools definitely necessary? I think we may be able to do everything without it.

Does not build with g++-6

When I try to install glmproj, it fails with g++-6 (but succeeds with clang++-4.0)

> devtools::install_github("paasim/glmproj", build_vignettes = TRUE, force = TRUE)
Downloading GitHub repo paasim/glmproj@master
from URL https://api.github.com/repos/paasim/glmproj/zipball/master
Installing glmproj
'/usr/lib/R/bin/R' --no-site-file --no-environ --no-save --no-restore --quiet CMD build  \
  '/tmp/RtmpPy8LlW/devtools79d22188b42f/paasim-glmproj-59866fe' --no-resave-data --no-manual 

* checking for file ‘/tmp/RtmpPy8LlW/devtools79d22188b42f/paasim-glmproj-59866fe/DESCRIPTION’ ... OK
* preparing ‘glmproj’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* installing the package to build vignettes
      -----------------------------------
* installing *source* package ‘glmproj’ ...
** libs
g++-6 -I/usr/share/R/include -DNDEBUG   -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/lib/R/site-library/RcppArmadillo/include"   -fpic  -O3 -march=native -mtune=native -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wno-ignored-attributes -Wno-deprecated-declarations -Wno-unused-variable -Wno-unused-local-typedefs  -c RcppExports.cpp -o RcppExports.o
g++-6 -I/usr/share/R/include -DNDEBUG   -I"/usr/local/lib/R/site-library/Rcpp/include" -I"/usr/lib/R/site-library/RcppArmadillo/include"   -fpic  -O3 -march=native -mtune=native -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2 -Wno-ignored-attributes -Wno-deprecated-declarations -Wno-unused-variable -Wno-unused-local-typedefs  -c elnetfun.cpp -o elnetfun.o
elnetfun.cpp: In function ‘Rcpp::List glm_ridge_c(arma::mat, Rcpp::Function, double, bool, double, int, int)’:
elnetfun.cpp:328:27: error: ‘isnan’ was not declared in this scope
             if (isnan(loss))
                           ^
elnetfun.cpp:328:27: note: suggested alternative:
In file included from /usr/local/lib/R/site-library/Rcpp/include/Rcpp/platform/compiler.h:100:0,
                 from /usr/local/lib/R/site-library/Rcpp/include/Rcpp/r/headers.h:48,
                 from /usr/local/lib/R/site-library/Rcpp/include/RcppCommon.h:29,
                 from /usr/lib/R/site-library/RcppArmadillo/include/RcppArmadilloForward.h:26,
                 from /usr/lib/R/site-library/RcppArmadillo/include/RcppArmadillo.h:31,
                 from elnetfun.cpp:2:
/usr/include/c++/6/cmath:662:5: note:   ‘std::isnan’
     isnan(_Tp __x)
     ^~~~~
/usr/lib/R/etc/Makeconf:141: recipe for target 'elnetfun.o' failed
make: *** [elnetfun.o] Error 1
ERROR: compilation failed for package ‘glmproj’
* removing ‘/tmp/Rtmpqmqydp/Rinst43c13cea3905/glmproj’
      -----------------------------------
ERROR: package installation failed
Error: Command failed (1)

Error when using kfold validation with brms model in projpred

Hi,

I have been using the projpred package with brms and attempting to use kfold validation and it keeps throwing up the following error:
Error in dnorm(y, mu, dis, log = T) : Non-numeric argument to mathematical function

I thought it might be linked to my data. So using the df_gaussian data when i carry out the same model in stan_glm and brms kfold works fine with the stan_glm model but not the brms model.

data('df_gaussian', package = 'projpred')

stan_mod <- stan_glm (y~1+x , data=df_gaussian,family = gaussian(), chains = 2)

brms_mod <- brm (y~1+x , data=df_gaussian,family = gaussian(), chains = 2)

stan_cv<-cv_varsel (stan_mod,cv_method="kfold",K=2)
brms_cv<-cv_varsel (brms_mod,cv_method="kfold",K=2)
`
Apologies if i've missed an obvious difference between the packages but i couldn't understand why it should work with stan_glm.

Arran

Session info
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)
rstanarm_2.18.2
brms_2.10.0

Problems installing

Hi there,

Installation fails on my machine:

> devtools::install_github('stan-dev/projpred')
Downloading GitHub repo stan-dev/projpred@master
from URL https://api.github.com/repos/stan-dev/projpred/zipball/master
Installing projpred
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file  \
  --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  '/private/var/folders/rx/_lrjsntx4072yg4c_k2m0hv53n_g27/T/Rtmpa00Ejm/devtools1c625067892a/stan-dev-projpred-4f042db'  \
  --library='/Library/Frameworks/R.framework/Versions/3.3/Resources/library'  \
  --install-tests 

* installing *source* package ‘projpred’ ...
** libs
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.3/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.3/Resources/library/RcppArmadillo/include"   -fPIC  -Wall -mtune=core2 -g -O2  -c RcppExports.cpp -o RcppExports.o
clang++ -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG  -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Library/Frameworks/R.framework/Versions/3.3/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.3/Resources/library/RcppArmadillo/include"   -fPIC  -Wall -mtune=core2 -g -O2  -c glmfun.cpp -o glmfun.o
clang++ -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o projpred.so RcppExports.o glmfun.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2 -lgfortran -lquadmath -lm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
ld: warning: directory not found for option '-L/usr/local/lib/gcc/x86_64-apple-darwin13.0.0/4.8.2'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [projpred.so] Error 1
ERROR: compilation failed for package ‘projpred’
* removing ‘/Library/Frameworks/R.framework/Versions/3.3/Resources/library/projpred’
Error: Command failed (1)

Maybe an issue with my version of Mac OS, or a missing dependency somewhere?

Cheers!

Computing LOOs in parallel

Since it is slow to compute LOOs it would be useful to be able to do it in parallel. Maybe through the future package.

Remove redundancies in the structures returned

Right now the structures we return contain some redundancies/not needed data. Fix that and make it thinner. For instance, vs$spath should be vs$search_path and contain more sensible information, etc.

Can tidybayes() be used with projected model objects?

Hi,
thank you for this great package! This is more of a question than a bug, I was wondering if tidybayes can be used with the results of a projection from project()?
Kind regards,
Jana

Thinning or subsampling posterior draws

The current behavior of .get_refdist() is to "thin" the posterior draws in case of is.null(nclusters) && !is.null(ndraws) (i.e. also for nclusters == NCOL(refmodel$mu)). As suggested in .get_refdist()'s comments, it might be safer to subsample the posterior draws in that case. Thinning might also make sense in certain circumstances, so another option would be to offer both (thinning as well as subsampling) and let the user decide. Personally, I would say the default should be subsampling, though.

In case of thinning, the .get_refdist() code could be simplified: The case

projpred/R/misc.R

Lines 292 to 301 in 0cabd73

 } else { 

 # use all the draws from the reference model 

 predvar <- sapply(seq_len(S), function(j) { 

 family$predvar(refmodel$mu[, j, drop = FALSE], refmodel$dis[j]) 

 }) 

 p_ref <- list( 

 mu = refmodel$mu, var = predvar, dis = refmodel$dis, 

 weights = refmodel$wsample, cl = c(1:S) 

 ) 

 }

could be included in the else if (!is.null(ndraws)) {[...]} case above it. To do so, setting ndraws to S if is.null(ndraws) (and using else { instead of else if (!is.null(ndraws)) {) should suffice.

In case of subsampling, would it make sense to have a separate case for ndraws == S so that the order of the posterior draws can be preserved in that case? I can't think of a situation where this would be necessary, but perhaps you can.

(Mentioned here and here.)

Vignette example not working in brms in 2.0.0

I'm having trouble getting projpred varsel to work with brms with projpred 2.0.0, despite having success with 1.0.*

This example from the GLMM vignette works with rstanarm, but not brms:

data("VerbAgg", package = "lme4")

VerbAgg_subsample <- VerbAgg %>%
  tidyr::as_tibble() %>%
  dplyr::filter(id %in% sample(id, 50)) %>%
  dplyr::mutate(r2num = as.integer(r2) - 1) # binomial family needs numeric target

## simple bernoulli model
formula_va <- r2num ~ btype + situ + mode + (btype + situ + mode | id)

# Works:
fit_va_stan <- stan_glmer(
  formula = formula_va,
  data = VerbAgg_subsample,
  family = binomial("logit"),
  seed = 1234,
  chains = 2
)
vs_va <- varsel(fit_va_stan)

# Doesn't work:
fit_va_stan <- brm(
  formula = formula_va,
  data = VerbAgg_subsample,
  family = binomial("logit"),
  seed = 1234,
  chains = 2
)
vs_va <- varsel(fit_va_stan)

Error in init_refmodel(object, data, y, formula, family, ref_predfun,  : 
  argument "y" is missing, with no default

Have there been any modifications which would explain why projpred can cope with rstanarm model structure, but not brms? Any suggestions would be gratefully received!

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.7

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rstanarm_2.19.2      modelbased_0.3.0     emmeans_1.5.0        xtable_1.8-4         ggResidpanel_0.3.0   hrbrthemes_0.8.0     extrafont_0.17       Cairo_1.5-12.2       projpred_2.0.0       kableExtra_1.1.0     loo_2.3.1           
[12] mice_3.8.0           effectsize_0.3.2     tidybayes_2.0.3.9000 corrgram_1.13        DT_0.15              duskyAnalysis_1.27   lubridate_1.7.9      summarytools_0.9.6   coefplot_1.2.6       effects_4.1-4        carData_3.0-2       
[23] nlme_3.1-137         lmerTest_3.1-0       lme4_1.1-23          Matrix_1.2-14        rstan_2.19.3         StanHeaders_2.21.0-5 see_0.5.2.1          RColorBrewer_1.1-2   readxl_1.3.1         lemon_0.4.4          ggeffects_0.15.1    
[34] bayestestR_0.7.2.1   brms_2.13.5          Rcpp_1.0.5           forcats_0.5.0        stringr_1.4.0        dplyr_1.0.2          purrr_0.3.4          readr_1.4.0          tidyr_1.1.2          tibble_3.0.4         ggplot2_3.3.2       
[45] tidyverse_1.3.0      bayesplot_1.7.1

Arguments ndraws and nclusters in project()

There seems to be a mismatch between the documentation and the behavior of arguments ndraws and nclusters in project(): In the reprex below, .get_refdist() clusters the posterior draws (into 400 clusters) since inside of project(), nclusters <- ndraws is used in that case. In contrast, the documentation suggests that if---as per default---ndraws is specified and nclusters not, no clustering is performed. (The current solution in that case---namely to "thin" the posterior draws and not to subsample them---is another issue which should be discussed separately.)

# Source: https://github.com/stan-dev/projpred/blob/master/README.md

library(projpred)
library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))

# Gaussian and Binomial examples from the glmnet-package
data('df_gaussian', package = 'projpred')

mydat <- cbind("y" = df_gaussian$y, as.data.frame(df_gaussian$x))
# fit the full model with a sparsifying prior
fit <- stan_glm(y ~ V1 + V2 + V3 + V4 + V5,
                family = gaussian(),
                data = mydat,
                prior = hs(df = 1, global_scale=0.01), seed = 1140350788)

myproj <- project(fit,
                  solution_terms = c("V3", "V2"))

Install error on R 3.4, worked on 3.3

get_refmodel: mismatch of arguments in init_refmodel

When playing around and trying whether e.g. lm models works, I found a mismatch here:

projpred/R/refmodel.R

Line 230 in 47ad88b

refmodel <- init_refmodel(object, data, y, formula, family, ref_predfun,

Compare to

projpred/R/refmodel.R

Line 331 in 47ad88b

init_refmodel <- function(object, data, formula, family, ref_predfun = NULL,

where y is missing

Deal with d_test structure

Before merging into master we should improve the treatment of d_test when supplied by the user. What fields should it contain, how to pass it around, what to do when d_test is NULL?

Broken reference example

Custom reference model initialization reference
https://mc-stan.org/projpred/reference/init_refmodel.html
shows lot of errors

fit <- stan_glm(y~x, family=gaussian(), data=data.frame(x=I(x),y=y))
#> Error in stan_glm(y ~ x, family = gaussian(), data = data.frame(x = I(x),     y = y)): could not find function "stan_glm"
draws <- as.matrix(fit)
#> Error in as.matrix(fit): object 'fit' not found

Vignette gives errors after the latest changes in argument validation

After merging PR #32, the vignette now throws an error when making predictions with proj_linpred or proj_predict. For example, executing cell

pred <- proj_linpred(vs, xnew=df_gaussian$x, ynew=df_gaussian$y, nv = 6, integrated = TRUE)

gives error

Error in proj_helper(object = object, xnew = xnew, offsetnew = offsetnew, : xnew must be a data.frame or a matrix. See ?proj-pred.

The error is raised because the df_gaussian$x has class AsIs. I guess we could get rid of this by converting that to a matrix, but I though I'd ask if you @mcol have thoughts about this since you implemented the argument validation.

projpred website

The projpred website is now live at mc-stan.org/projpred. Here is how to update the website:

Install pkgdown with devtools::install_github("r-lib/pkgdown")
Switch to the gh-pages branch of the projpred repo
Pull in the latest changes from master into gh-pages
Run pkgdown::build_site(). This will create all the documentation files and vignettes and other page content and will dump the contents into a folder called docs in the main package directory (i.e., in projpred/docs/)
Move all the contents of docs out of docs and into the main package directory instead
Push to GitHub

@jpiironen and @avehtari I'll leave this in your hands now, but let me know if you have any questions or run into trouble making changes to the site!

Incorrect column names in as.matrix.projection()

It seems like the column names of the matrix returned by as.matrix.projection() are not correct (at least not always). Here is a reprex: First run

# Original source: https://github.com/stan-dev/projpred/blob/master/README.md
library(projpred)
library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))
data("df_gaussian", package = "projpred")
df_gaussian <- df_gaussian[1:40, ]
mydat <- cbind("y" = df_gaussian$y, as.data.frame(df_gaussian$x))
mydat$mygroup <- gl(n = 8, k = floor(nrow(mydat) / 8),
                    labels = paste0("gr", seq_len(8)))
set.seed(457211)
mydat$noise <- rnorm(nrow(mydat))
mygroup_icpts_truth <- rnorm(nlevels(mydat$mygroup), sd = 6)
mygroup_V1_truth <- rnorm(nlevels(mydat$mygroup), sd = 6)
mydat$y <- mydat$y + mygroup_icpts_truth[as.numeric(mydat$mygroup)]
mydat$y <- mydat$y + mygroup_V1_truth[as.numeric(mydat$mygroup)] * mydat$V1

myfit <- stan_glmer(y ~ V1 + V2 + V3 + V4 + V5 + noise + (1 + V1 | mygroup),
                    data = mydat,
                    seed = 1140350788)

cvs <- cv_varsel(myfit)
myproj <- project(cvs,
                  nterms = length(solution_terms(cvs)),
                  ndraws = ceiling(0.05 * nrow(as.matrix(myfit))))
debugonce(projpred:::as.matrix.projection)
myproj_mat <- as.matrix(myproj)

and then debug the last call up to (but excluding) line

projpred/R/methods.R

Line 835 in 04326a1

colnames(res) <- c("Intercept", x$solution_terms)

If you then compare x$solution_terms and colnames(res), then you should see that the ordering of the terms in x$solution_terms doesn't match the ordering in colnames(res):

# x$solution_terms:
# [1] "(1 | mygroup)"  "V1"             "(V1 | mygroup)" "V3"             "V5"             "noise"          "V4"            
# [8] "V2"  
# colnames(res):
# [1] "(Intercept)"         "V1"                  "V3"                  "V5"                  "noise"              
# [6] "V4"                  "V2"                  "mygroup.(Intercept)" "mygroup.V1"

Therefore, colnames(res) <- c("Intercept", x$solution_terms) causes incorrect column names for res. The issue seems to be fixed (at least as far as I have tested it) by replacing line

projpred/R/methods.R

Line 835 in 04326a1

colnames(res) <- c("Intercept", x$solution_terms)

colnames(res) <- gsub("\\(Intercept\\)", "Intercept", colnames(res))

I'll create a pull request for that.

Session info (note that I'm using projpred's CRAN version, but the issue probably also occurs for projpred's master and develop GitHub branches since the line mentioned above hasn't changed in those GitHub branches):

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[...]

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rstanarm_2.21.2 Rcpp_1.0.6      projpred_2.0.2 

loaded via a namespace (and not attached):
 [1] nlme_3.1-152            matrixStats_0.58.0      xts_0.12.1              threejs_0.3.3           splines2_0.4.1         
 [6] rstan_2.26.0.9000       numDeriv_2016.8-1.1     tools_4.0.3             R6_2.5.0                DT_0.17                
[11] DBI_1.1.1               mgcv_1.8-33             colorspace_2.0-0        tidyselect_1.1.0        gridExtra_2.3          
[16] prettyunits_1.1.1       processx_3.4.5          curl_4.3                compiler_4.0.3          cli_2.3.0              
[21] shinyjs_2.0.0           colourpicker_1.1.0      scales_1.1.1            dygraphs_1.1.1.6        ggridges_0.5.3         
[26] callr_3.5.1             stringr_1.4.0           digest_0.6.27           StanHeaders_2.26.0.9000 minqa_1.2.4            
[31] rmarkdown_2.6           base64enc_0.1-3         pkgconfig_2.0.3         htmltools_0.5.1.1       lme4_1.1-26            
[36] fastmap_1.1.0           htmlwidgets_1.5.3       rlang_0.4.10            shiny_1.6.0             generics_0.1.0         
[41] zoo_1.8-8               jsonlite_1.7.2          crosstalk_1.1.1         gtools_3.8.2            dplyr_1.0.4            
[46] inline_0.3.17           magrittr_2.0.1          loo_2.4.1.9000          bayesplot_1.8.0         Matrix_1.3-2           
[51] munsell_0.5.0           lifecycle_0.2.0         stringi_1.5.3           MASS_7.3-53             pkgbuild_1.2.0         
[56] plyr_1.8.6              grid_4.0.3              parallel_4.0.3          promises_1.1.1          crayon_1.4.0           
[61] miniUI_0.1.1.1          lattice_0.20-41         splines_4.0.3           knitr_1.31              ps_1.5.0               
[66] pillar_1.4.7            igraph_1.2.6            boot_1.3-25             markdown_1.1            shinystan_2.5.0        
[71] reshape2_1.4.4          codetools_0.2-18        stats4_4.0.3            rstantools_2.1.1        glue_1.4.2             
[76] evaluate_0.14           V8_3.4.0                RcppParallel_5.0.2      vctrs_0.3.6             nloptr_1.2.2.2         
[81] httpuv_1.5.5            gtable_0.3.0            purrr_0.3.4             assertthat_0.2.1        ggplot2_3.3.3          
[86] xfun_0.20               mime_0.9                xtable_1.8-4            later_1.1.0.1           rsconnect_0.8.16       
[91] survival_3.2-7          tibble_3.0.6            shinythemes_1.2.0       optimx_2020-4.2         gamm4_0.2-6            
[96] statmod_1.4.35          ellipsis_0.3.1

Make varcel and related function S3 generic

I am planning to feature projpred in brms and I wanted to drop some ideas on you how to make projpred easier to use with other packages. Right now, it provides build in support for stanreg objects and for objects set up via init_refmodel. I can and will build a wrapper around init_refmodel (called varcel_brms, say), but the users won't be able to call varcel and related methods directly with brmsfit objects since these methods are not S3 generic. This is suboptimal I think.

Therefore, I propose making varcel and related methods (basically all functions that use a fitted model object) S3 generic, with the default method just pointing to the method for refmodel objects. This would make it far easier to feature projpred in other packages in my opinion.

Discrete families with finite support (and possibly other families)

It would be great to have more outcome families supported by projpred. For discrete outcome families with finite support (e.g. the cumulative family in ordinal regression), the projection can be performed via a pseudo-dataset approach:

For these families, equation (12) from

Piironen J, Paasiniemi M, Vehtari A (2020). Projective inference in high-dimensional problems: Prediction and feature selection. Electronic Journal of Statistics, 14(1), 2155–2197. https://doi.org/10.1214/20-EJS1711

simplifies to

with supp(\tilde{y}) denoting the (discrete and finite) support of the outcome distribution (i.e. the set of possible values for the outcome y).

This simplification corresponds to a weighted maximum-likelihood problem with weights

and a pseudo (or artificial) dataset constructed as follows:

Take the original dataset. Denote by n the number of rows in this original dataset.
Repeat each row another (|supp(\tilde{y})| - 1) times. The resulting dataset then has n * |supp(\tilde{y})| rows.
Replace the outcome variable (column "response", say) by all possible values for y (i.e. each value from supp(\tilde{y})) such that each row i \in {1, ..., n} from the original dataset occurs together with each possible value for y (in column "response") exactly once.

The weight for the pseudo-dataset row coming from row i \in {1, ..., n} in the original dataset and having outcome value y is then a_i^*(y) as defined above.

Thus, it should be possible to use existing routines for solving this weighted maximum-likelihood problem, like MASS::polr() for the cumulative family (in case of no group-level or smoothing terms). PSIS-LOO-CV also naturally fits into the simplification above (simply introducing the importance weights when averaging over the posterior draws).

Of course, the approach is primarily useful for outcome families (discrete and finite support) which do not constitute an exponential family (since exponential families may be treated the usual way, see Piironen et al., 2020).

EDITS:

Minor wording improvements in the text above.
The approach might also be used as an approximation for discrete families with infinite support if support areas with near-zero probability are ignored.
In principle, a similar pseudo-data approach could also be used for any outcome family (including continuous ones and also discrete ones with infinite support) with one of the following two approaches:
- Draw repeatedly (say T times) from the outcome distribution given the predictor vector x_i (with i \in {1, ..., n}) and a draw \theta_*^s (with s \in I_c) from the reference model's posterior. Within cluster I_c, this would give a pseudo dataset with |I_c| * T * n rows. That gets big very quickly, so the practical feasibility of this Monte-Carlo-pseudo-data approach might be limited. But perhaps one could split up the |I_c| * T draws for y_i (with i \in {1, ..., n}) into G groups (the groups being as homogeneous as possible within them and as heterogeneous as possible between them) and then use an existing weighted maximum-likelihood routine for the condensed pseudo dataset with n * G rows. The grouping could be performed e.g. by clustering or by using quantiles.
- Discretize the outcome distribution (preferably using quantiles) given the predictor vector x_i (with i \in {1, ..., n}) and a draw \theta_*^s (with s \in I_c) from the reference model's posterior. Within cluster I_c, this would give a pseudo dataset with |I_c| * G * n rows (with G denoting the number of groups in the discretization of the outcome distribution) for which an existing weighted maximum-likelihood routine could be used (with weights given by the probability masses corresponding to the groups). The drawback is that |I_c| * G * n gets big very quickly, too.

Strange results with binomial model in dev version

I am using the dev version of projpred and getting strange results using varsel and cv_varsel on a binomial model. There are 48282 observations with 19 predictors, all of which interact with a binary factor (so there are 40 predictors in the design matrix). There are some very high multicollinearities between the predictors, which makes interpretation tricky, hence my wish to obtain a smaller set. Running varsel(fit, nterms_max = 30), I get the following plots -- I do not understand why ELPD should be decreasing rather than increasing as the numbers of predictors increases:

plot(pulse_tap_vs, 
     deltas = FALSE,
     stats = "elpd",
     baseline = "best")

plot(pulse_tap_vs, 
     deltas = TRUE,
     stats = "elpd",
     baseline = "best") + 
  coord_cartesian(ylim = c(-300,10))

I have also tried cv_varsel(fit), but get

Error in validate_ll(log_ratios) : All input values must be finite.
In addition: Warning message:
In cv_varsel.refmodel(refmodel, ...) : K provided, but cv_method is LOO.

length of penalty for factors

Hi,
when there is a factor as a predictor in a brms model and when a penalty vector is supplied to cv_varsel(), the variable selection seems to have issues with the penalty vector; this behavior started after the latest brms update (see session info below).

    # from the tutorial
    library(projpred)
    library(brms)
    data('df_gaussian', package = 'projpred')
    split_structure <- break_up_matrix_term(y ~ x, data = df_gaussian)
    df_gaussian <- split_structure$data
    formula <- split_structure$formula
    d <- df_gaussian
    n <- nrow(df_gaussian) # 100
    D <- ncol(df_gaussian[, -1]) # 20
    p0 <- 5 # prior guess for the number of relevant variables
    tau0 <- p0/(D-p0) * 1/sqrt(n) # scale for tau (notice that stan_glm will automatically scale this by sigma)

    # Make a factor out of one variable <--------------------
    df_gaussian$x20 <- as.factor(sample(0:1, nrow(df_gaussian), rep=T))

    fit <- brm(formula, family=gaussian(), data=df_gaussian,
               prior=prior(horseshoe(scale_global = tau0, scale_slab = 1), class=b),
               seed=1, chains=2, iter=500)
    # make penalty
    betas <- grep("^b", parnames(fit), value=TRUE)[-1]
    penalty <- sample(0:1, length(betas), rep = T)
  
    cvs <- cv_varsel(fit, method = "L1", penalty = penalty)
    # >    
    # [1] "Computing LOOs..."  
    # Error in glm_elnet(d_train$x, mu, family, lambda_min_ratio = opt$lambda_min_ratio,  :      
    # Incorrect length of penalty vector (should be 21). 

    # making the penalty one longer
    cvs <- cv_varsel(fit, method = "L1", penalty = c(penalty,1))
    # >
    # warning: solve(): system seems singular; attempting approx solution

Session Info

R version 4.0.2 (2020-06-22)                                                             
Platform: x86_64-w64-mingw32/x64 (64-bit)                                                
Running under: Windows 10 x64 (build 18363)                                              
                                                                                         
Matrix products: default                                                                 
                                                                                         
locale:                                                                                  
[1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252                 
[3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                                     
[5] LC_TIME=German_Switzerland.1252                                                      
                                                                                         
attached base packages:                                                                  
[1] stats     graphics  grDevices utils     datasets  methods   base                     
                                                                                         
other attached packages:                                                                 
[1] brms_2.14.4    Rcpp_1.0.5     projpred_2.0.2

Add a default print method for vsel/cvsel objects

I find it a bit annoying that to have a quick look at the variable selection results stored in variable vs I need to type either vs$vind (which was not obvious when I first started) or varsel_stats(vs). We could add a simple S3 print method so that by typing vs will produce the default output of varsel_stats(vs).

suggest_size() value seems too large

Having done varsel(mdl_sim_ref, method = "forward"), the number given by suggest_size() seems unintuitively large -- it suggests 12, but here is the varsel plot:

Looking at the plot and the CIs, 3 or 4 would seem, by eye, to be more reasonable choices. Or maybe I am misunderstanding something.

The reference model is run over 4975 observations and has 12 predictors with some high multicollinearities; it is a simple linear (Gaussian) model, there are no divergences or other warnings, and the ESS are all above 3000.

Add AUC as option to varsel_stats()

Currently only the "acc"/"pctcorr" option is specific for a binary outcome, and it's not great. Despite its limitations, the AUC is defintely better. I'm preparing a PR for this, which will introduce the "auc" stat.

Uncertainty in statistics relative to the reference model(?)

If I compare the plot of the absolute values and of the relative values there is something off with the uncertainty calculations. If seems like it is calculated as relative to the mean of the statistic of the reference model, instead of taking the observation-wise difference, which should make the delta statistic collapse unto the baseline for the reference model. I haven't looked at the code yet, but it if is calculated in this way it will bias the length of the uncertainty intervals. Although I might have misunderstood what the reference model is.

Compare the two plots here:

Fix approach to search_L1

This method involves a lot of old code that still does things very manually, so we have to find a better way to abstract the method. For the moment we'll at least extract the response safely by using model.matrix.

Inappropriate warning messages

Using projpred 2.0.0, if I run

  pulse_tap_cvs <-
    cv_varsel(
      mdl_pulse_tap_ref_rstan,
      # nterms_max = 39,
      nloo = 2000)

I get the following warnings, all of which seem inappropriate (the third is inappropriate because it is confusing to be warned that I should use cv_varsel when I already am, and also leaves me wondering whether I can still trust the results and what exactly are the implications of the performance of the projected model being misleading):

Warning messages:
1: Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead. 
2: In cv_varsel.refmodel(refmodel, ...) :
  K provided, but cv_method is LOO.
3: In varsel.refmodel(refmodel, method = method, ndraws = ndraws, nclusters = nclusters,  :
  The performance of the projected model seems to be misleading, we recommend checking the reference model as well as running `varsel` with a larger `ndraws_pred` or `cv_varsel` for a more robust estimation.

Fixed seed in .get_refdist()

In .get_refdist(), the seed is fixed to a specific value if NULL:

projpred/R/misc.R

Line 236 in 0cabd73

seed <- 17249420

I think it would be better practice not to fix the seed, but use NULL instead.

(Mentioned here.)

Suggestion: Change-in-estimate projection

In epidemiology we are concerned with the selection of confounders. One way to select confounders for an exposure is to fit a full model (reference model) with all potential confounders and then remove potential confounders in such a way that the effect estimate for the exposure(s) is as similar to the estimate for the full model as possible, while keeping the model as small as possible. Typically only the point estimate is considered which throws away information.

I wonder whether the projection could be adjusted to work towards this goal, instead of selecting the minimal submodel that keeps the same predictive performance. Maybe something to look into?

Error when projecting with non-NULL `solution_terms`.

Applying project() with non-NULL argument solution_terms leads to an error. I'll create a PR for fixing this. Below is a reproducible example.

# Based on <https://github.com/stan-dev/projpred/blob/master/README.md>.

library(projpred)
library(rstanarm)
options(mc.cores = parallel::detectCores(logical = FALSE))

data("df_gaussian", package = "projpred")
df_gaussian <- df_gaussian[1:40, ]
mydat <- cbind("y" = df_gaussian$y, as.data.frame(df_gaussian$x))
mydat$mygroup <- gl(n = 8, k = floor(nrow(mydat) / 8), labels = paste0("gr", seq_len(8)))
set.seed(457211)
mydat$noise <- rnorm(nrow(mydat))
mygroup_icpts_truth <- rnorm(nlevels(mydat$mygroup), sd = 6)
mygroup_V1_truth <- rnorm(nlevels(mydat$mygroup), sd = 6)
mydat$y <- mydat$y + mygroup_icpts_truth[as.numeric(mydat$mygroup)]
mydat$y <- mydat$y + mygroup_V1_truth[as.numeric(mydat$mygroup)] * mydat$V1

myfit <- stan_glmer(y ~ V1 + V2 + V3 + V4 + V5 + noise + (1 + V1 | mygroup),
                    data = mydat,
                    seed = 1140350788)

cvs <- cv_varsel(myfit)
solution_terms(cvs)

C_proj <- project(myfit,
                  solution_terms = solution_terms(cvs)[1:3],
                  ndraws = ceiling(0.05 * nrow(as.matrix(myfit))))

The last call throws an error:

Error in project(myfit, solution_terms = solution_terms(cvs)[1:3], ndraws = ceiling(0.05 *  :
  solution_terms contains an index larger than the number ofvariables in the model.

change \dontrun to \donttest in examples

Implementation of the Gamma distribution

Hi,

I am interested in using projpred with a Gamma distributed dependent variable for a current research project.

I have seen (as far as I can judge this) from the source code files that the implementation for the Gamma distribution is in an advanced stage and I was interested if a full implementation is planned in the next time.

I think that my knowledge is unfortunately not deep enough to give it a try myself.

Incorrectly computed AUC for cv_varsel

For details, I refer to

https://discourse.mc-stan.org/t/auc-0-perfect-prediction/20927
and less importantly to:
https://discourse.mc-stan.org/t/projpred-interactions-without-marginal-terms/14893/12

Below is the AUC plot of the same fit, with varsel, and with cv_varsel. The scaling in the varsel version is correct, cv_varsel is incorrectly scaled. No information (0 variables) should give 0.5 (the prior, but ROC normally does not refer to priors) in standard usage, even if in the mentioned thread one of the developers claims otherwise

"Our implementation corresponds to AUC between 0 and 1, where a value of 0 corresponds to 100% incorrect predictions and 1 to 100% correct predictions.

My wild guess auc_correct = auc/2+0.5

varsel

cv_varsel

As a minor point, I think at least for cv the value for 0 should also be simulated and have error bars, but that's a matter of taste.

Implement survival models

We need to implement the projection for survival models at some point. The actual implementation needs some thinking and is likely to depend on how the survival models are/will be handled in rstanarm (issue stan-dev/rstanarm#69) and in package survival. But this is mainly related to how this appears to the user. Regarding the projection itself, there should not be too many difficulties provided the likelihood is log-concave.

Update the tests to the new interface

Some tests are already working under the new interface but I expect some more to fail. Go into details on this and update the tests (some existing test may not be sensible any more?)

An error message for a failing projection

Error message that suggests using regularization or stopping FS 'early' if "system is computationally singular" (ie. covariance matrix is non-invertible).

Implement negative binomial

Hi, I work in ecology where we often analyse count data. While in theory the default distribution for count data is the Poisson, in practice the data is more often than not overdispersed relative to the Poisson distribution. In this case, we default to using GLMs with a negative binomial likelihood. I am not sure how difficult the negative binomial likelihood would be to implement in projpred, but I can see it to be a useful addition for many users. To give an example, I am currently working on a project where I have around 2500 count responses and 350 potential predictors. I expect maybe 5-10% of the predictors to be associated with my response. In this scenario, fitting a reference model with horseshoe priors and doing variable selection with projpred seems to be the right thing to do. However, since my response is overdispersed, it seems to be more appropriate to use a negative binomial likelihood instead of a Poisson. Thanks for your consideration.

Problems when installing projpred on Windows 10

Hi all,

as discussed with Aki via email, I get the following error when trying to install projpred on Windows 10 with R 3.3.3 or 3.4.0 and g++ 4.6.3:

* installing *source* package 'projpred' ...
** libs

*** arch - i386
c:/Rtools/mingw_32/bin/g++  -I"C:/PROGRA~1/R/R-34~1.0/include" -DNDEBUG  -I"C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include" -I"C:/Users/paulb/Documents/R/win-library/3.4/RcppArmadillo/include"   -I"d:/Compiler/gcc-4.9.3/local330/include"     -O2 -Wall  -mtune=core2 -c RcppExports.cpp -o RcppExports.o
c:/Rtools/mingw_32/bin/g++  -I"C:/PROGRA~1/R/R-34~1.0/include" -DNDEBUG  -I"C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include" -I"C:/Users/paulb/Documents/R/win-library/3.4/RcppArmadillo/include"   -I"d:/Compiler/gcc-4.9.3/local330/include"     -O2 -Wall  -mtune=core2 -c elnetfun.cpp -o elnetfun.o
elnetfun.cpp: In function 'Rcpp::List glm_elnet_c(arma::mat, Rcpp::Function, arma::vec, double, bool, double, int, size_t, bool, int)':
elnetfun.cpp:170:20: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
     for (int j=0; j<D; j++)
                    ^
c:/Rtools/mingw_32/bin/g++ -shared -s -static-libgcc -o projpred.dll tmp.def RcppExports.o elnetfun.o -LC:/PROGRA~1/R/R-34~1.0/bin/i386 -lRlapack -LC:/PROGRA~1/R/R-34~1.0/bin/i386 -lRblas -lgfortran -lm -lquadmath -Ld:/Compiler/gcc-4.9.3/local330/lib/i386 -Ld:/Compiler/gcc-4.9.3/local330/lib -LC:/PROGRA~1/R/R-34~1.0/bin/i386 -lR
installing to C:/Users/paulb/Documents/R/win-library/3.4/projpred/libs/i386

*** arch - x64
c:/Rtools/mingw_64/bin/g++  -I"C:/PROGRA~1/R/R-34~1.0/include" -DNDEBUG  -I"C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include" -I"C:/Users/paulb/Documents/R/win-library/3.4/RcppArmadillo/include"   -I"d:/Compiler/gcc-4.9.3/local330/include"     -O2 -Wall  -mtune=core2 -c RcppExports.cpp -o RcppExports.o
In file included from C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include/Rcpp/as.h:25:0,
                 from C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include/RcppCommon.h:160,
                 from C:/Users/paulb/Documents/R/win-library/3.4/RcppArmadillo/include/RcppArmadilloForward.h:26,
                 from C:/Users/paulb/Documents/R/win-library/3.4/RcppArmadillo/include/RcppArmadillo.h:31,
                 from RcppExports.cpp:4:
C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include/Rcpp/internal/Exporter.h: In instantiation of 'Rcpp::traits::Exporter<T>::Exporter(SEXP) [with T = long long unsigned int; SEXP = SEXPREC*]':
C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include/Rcpp/as.h:79:51:   required from 'T Rcpp::internal::as(SEXP, Rcpp::traits::r_type_generic_tag) [with T = long long unsigned int; SEXP = SEXPREC*]'
C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include/Rcpp/as.h:144:82:   required from 'T Rcpp::as(SEXP) [with T = long long unsigned int; SEXP = SEXPREC*]'
C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include/Rcpp/InputParameter.h:34:45:   required from 'Rcpp::InputParameter<T>::operator T() [with T = long long unsigned int]'
RcppExports.cpp:25:144:   required from here
C:/Users/paulb/Documents/R/win-library/3.4/Rcpp/include/Rcpp/internal/Exporter.h:31:31: error: invalid conversion from 'SEXP' to 'long long unsigned int' [-fpermissive]
       Exporter( SEXP x ) : t(x){}
                               ^
make: *** [RcppExports.o] Error 1
Warnung: Ausführung von Kommando 'make -f "Makevars.win" -f "C:/PROGRA~1/R/R-34~1.0/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-34~1.0/share/make/winshlib.mk" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="projpred.dll" WIN=64 TCLBIN=64 OBJECTS="RcppExports.o elnetfun.o"' ergab Status 2
ERROR: compilation failed for package 'projpred'
* removing 'C:/Users/paulb/Documents/R/win-library/3.4/projpred'
Error: Command failed (1)

The session info is as follows:

R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] httr_1.2.1      compiler_3.4.0  R6_2.2.0        tools_3.4.0     withr_1.0.2    
 [6] curl_2.5        memoise_1.1.0   knitr_1.15.1    git2r_0.18.0    digest_0.6.12  
[11] devtools_1.12.0

enable travis-ci

Does not work with `rstanarm::stan_lm`

The varsel function does not throw an error but does not work correctly when given an object produced by stan_lm in the rstanarm package. The reason seems to be that in this part of the .extract_vars function:

dis = unname(e[["aux"]]) %ORifNULL% rep(NA, 
            nrow(e$beta))[perm_inv]

dis becomes a vector of NA because the auxiliary parameter is called "sigma" rather than "aux".

Installation of projpred

error message :
Quitting from lines 121-122 (quickstart.Rmd)
Error: processing vignette 'quickstart.Rmd' failed with diagnostics:
glm_ridge error: Deviance became NaN, the problem is probably ill-behaved. Try adding more regularization.
Execution halted

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_1.12.0

loaded via a namespace (and not attached):
[1] httr_1.2.1 R6_2.2.0
[3] tools_3.3.1 RcppArmadillo_0.7.800.2.0
[5] withr_1.0.2 curl_2.2
[7] Rcpp_0.12.10 memoise_1.0.0
[9] knitr_1.15.1 git2r_0.15.0
[11] digest_0.6.10

Argument K processed incorrectly (for cv_method = "kfold")

Shouldn't this line

projpred/R/cv_varsel.R

Line 256 in 9813f1d

if (tolower(cv_method) == "kfold" || is.null(K)) {

be replaced by

if (tolower(cv_method) == "kfold" && is.null(K)) {

NAs should be ignored if present in variables not used in the model

When fitting a model that uses a subset of variables from a dataframe, I would expect that missing values that are present in any of the columns not referenced by the model should be ignored.

library(rstanarm)
set.seed(1234)
df <- data.frame(matrix(runif(21*500), ncol=21))
colnames(df)[21] <- "y"

## missing value in the 20th column
df$X20[99] <- NA

## model only uses the first 10 columns
model <- reformulate(colnames(df)[1:10], response="y")
fit <- stan_glm(model, data=df, iter=500, seed=1)
vs <- varsel(fit)

## Error: NAs are not allowed in 'newdata'.

Implement student-t likelihood

The projection for the student-t likelihood would be a nice addition to the package. The student-t is not a log-concave likelihood, and thus the implementation using the same framework as for other likelihoods is likely to cause some problems. However, there is another idea of doing the projection similarly as for the Gaussian case by expressing the t-distribution as a scale mixture of Gaussians, and then using the Gaussian projection with the observations weighted by their precisions. This is not the same as measuring the KL-divergence between the t-distributions but should be easy to implement and might give reasonable results.

Incorrect formula for deviance of Gaussian family?

Shouldn't this line:

projpred/R/extend_family.R

Line 139 in dc1c3e4

-2 * weights * (-0.5 / dis * (y - mu)^2 - log(dis))

read

-2 * weights * (-0.5 / dis^2 * (y - mu)^2 - log(dis))

i.e. taking dis^2 instead of dis in the middle? If I understood projpred's internal code correctly, then dis is the standard deviation here, not the variance. And the deviance is -2 times the (weighted) log likelihood.

check the length of the penalty vector

Currently the penalty argument of varsel() is not checked to be of the expected length, which raises a confusing error and some warnings.

 > vs <- varsel(fit, method='l1', penalty=c(1, 0))

Error in seq.default(log(lambda_min), log(lambda_max), len = nlam) :
  'from' must be a finite number
In addition: Warning messages:
1: In max(lambda_max_cand[is.finite(lambda_max_cand)]) :
  no non-missing arguments to max; returning -Inf
2: In log(lambda_min) : NaNs produced
3: In log(lambda_max) : NaNs produced

projpred removed from CRAN

I just realized due to failing CRAN checks of one of my packages that the "projpred" package is no longer on CRAN. Are you in contact with the CRAN team regarding this issue (planning to put it back on CRAN), or will this package become orphaned?

	} else {
	# use all the draws from the reference model
	predvar <- sapply(seq_len(S), function(j) {
	family$predvar(refmodel$mu[, j, drop = FALSE], refmodel$dis[j])
	})
	p_ref <- list(
	mu = refmodel$mu, var = predvar, dis = refmodel$dis,
	weights = refmodel$wsample, cl = c(1:S)
	)
	}

stan-dev / projpred Goto Github PK

projpred's People

Contributors

Stargazers

Watchers

Forkers

projpred's Issues

varsel

cv_varsel

Recommend Projects

Recommend Topics

Recommend Org