pbreheny / biglasso Goto Github PK

biglasso: Extending Lasso Model Fitting to Big Data in R

Home Page: http://pbreheny.github.io/biglasso/

R 32.48% C 2.88% C++ 64.59% Shell 0.05%

lasso r out-of-core bigdata parallel-computing

biglasso's Introduction

biglasso: Extend Lasso Model Fitting to Big Data in R

biglasso extends lasso and elastic-net linear and logistic regression models for ultrahigh-dimensional, multi-gigabyte data sets that cannot be loaded into memory. It utilizes memory-mapped files to store the massive data on the disk and only read those into memory whenever necessary during model fitting. Moreover, some advanced feature screening rules are proposed and implemented to accelerate the model fitting. As a result, this package is much more memory- and computation-efficient and highly scalable as compared to existing lasso-fitting packages such as glmnet and ncvreg. Bechmarking experiments using both simulated and real data sets show that biglasso is not only 1.5x to 4x times faster than existing packages, but also at least 2x more memory-efficient. More importantly, to the best of our knowledge, biglasso is the first R package that enables users to fit lasso models with data sets that are larger than available RAM, thus allowing for powerful big data analysis on an ordinary laptop.

Installation

To install the latest stable release version from CRAN:

install.packages("biglasso")

To install the latest development version from GitHub:

remotes::install_github("pbreheny/biglasso")

News

See NEWS.md for latest news.
The technical paper of this package was selected as a Winner of 2017 ASA Student Paper Competiton from Section on Statistical Computing.
This package finished in the top 3 for 2017 ASA Chambers Statistical Software Award.

Documentation

Here are the R Reference manual and Package Website
Here are the technical papers of the package: i) The software paper; and ii) the paper of hybrid safe-strong rules

Features

It utilizes memory-mapped files to store the massive data on the disk, only loading data into memory when necessary during model fitting. Consequently, it's able to seamlessly handle out-of-core computation.
It is built upon pathwise coordinate descent algorithm with warm start, active set cycling, and feature screening strategies, which has been proven to be one of fastest lasso solvers.
We develop new, adaptive feature screening rules that outperform state-of-the-art screening rules such as the sequential strong rule (SSR) and the sequential EDPP rule (SEDPP) with additional 1.5x to 4x speedup.
The implementation is designed to be as memory-efficient as possible by eliminating extra copies of the data created by other R packages, making biglasso at least 2x more memory-efficient than glmnet.
The underlying computation is implemented in C++, and parallel computing with OpenMP is also supported.

Benchmarks:

Simulated data:

Packages to be compared: biglasso (1.4-0), glmnet (4.0-2), ncvreg (3.12-0), and picasso (1.3-1).
Platform: AMD Ryzen 5 5600X @ 4.2 GHz and 32 GB RAM.
Experiments: solving lasso-penalized linear regression over the entire path of 100 lambda values equally spaced on the log scale of lambda / lambda_max from 0.1 to 1; varying number of observations n and number of features p; 20 replications, the mean computing time (in seconds) are reported.
Data generating model: y = X * beta + 0.1 eps, where X and eps are i.i.d. sampled from N(0, 1).

(1) `biglasso` is more computation-efficient:

In all the settings, biglasso (1 core) is uniformly faster than picasso, glmnet and ncvreg. When the data gets bigger, biglasso achieves 6-9x speed-up compared to other packages. Moreover, the computing time of biglasso can be further reduced by half via parallel-computation of multiple cores.

(2) `biglasso` is more memory-efficient:

To prove that biglasso is much more memory-efficient, we simulate a 1000 X 100000 large feature matrix. The raw data is 0.75 GB. We used Syrupy to measure the memory used in RAM (i.e. the resident set size, RSS) every 1 second during lasso model fitting by each of the packages.

The maximum RSS (in GB) used by a single fit and 10-fold cross validation is reported in the Table below. In the single fit case, biglasso consumes 0.60 GB memory in RAM, 23% of that used by glmnet and 24% of that used by ncvreg. Note that the memory consumed by glmnet and ncvreg are respectively 3.4x and 3.3x larger than the size of the raw data. biglasso also requires less additional memory to perform cross-validation, compared other packages. For serial 10-fold cross-validation, biglasso requires just 31% of the memory used by glmnet and 11% of that used by ncvreg, making it 3.2x and 9.4x more memory-efficient compared to these two, respectively.

Package	picasso	ncvreg	glmnet	biglasso
Single fit	0.74	2.47	2.57	0.60
10-fold CV	-	4.62	3.11	0.96

Note: ..* the memory savings offered by biglasso would be even more significant if cross-validation were conducted in parallel. However, measuring memory usage across parallel processes is not straightforward and not implemented in Syrupy; ..* cross-validation is not implemented in picasso at this point.

Real data:

The performance of the packages are also tested using diverse real data sets:

Breast cancer gene expression data (GENE);
MNIST handwritten image data (MNIST);
Cardiac fibrosis genome-wide association study data (GWAS);
Subset of New York Times bag-of-words data (NYT).

The following table summarizes the mean (SE) computing time (in seconds) of solving the lasso along the entire path of 100 lambda values equally spaced on the log scale of lambda / lambda_max from 0.1 to 1 over 20 replications.

Package	GENE	MNIST	GWAS	NYT
	`n=536`	`n=784`	`n=313`	`n=5,000`
	`p=17,322`	`p=60,000`	`p=660,495`	`p=55,000`
picasso	0.67 (0.02)	2.94 (0.01)	14.96 (0.01)	15.91 (0.16)
ncvreg	0.87 (0.01)	4.22 (0.00)	19.78 (0.01)	25.59 (0.12)
glmnet	0.74 (0.01)	3.82 (0.01)	16.19 (0.01)	24.94 (0.16)
biglasso	0.31 (0.01)	0.61 (0.02)	4.82 (0.01)	5.91 (0.78)

Big data: Out-of-core computation

To demonstrate the out-of-core computing capability of biglasso, a 96 GB real data set from a large-scale genome-wide association study is analyzed. The dimensionality of the design matrix is: n = 973, p = 11,830,470. Note that the size of data is 3x larger than the installed 32 GB of RAM.

Since other three packages cannot handle this data-larger-than-RAM case, we compare the performance of screening rules SSR and Adaptive based on our package biglasso. In addition, two cases in terms of lambda_min are considered: (1) lam_min = 0.1 lam_max; and (2) lam_min = 0.5 lam_max, as in practice there is typically less interest in lower values of lambdafor very high-dimensional data such as this case. Again the entire solution path with 100 lambda values is obtained. The table below summarizes the overall computing time (in minutes) by screening rule SSR (which is what other three packages are using) and our new rule Adaptive. (No replication is conducted.)

Cases	SSR	Adaptive
`lam_min / lam_max = 0.1`, 1 core	189.67	66.05
`lam_min / lam_max = 0.1`, 4 cores	86.31	46.91
`lam_min / lam_max = 0.5`, 1 core	177.84	24.84
`lam_min / lam_max = 0.5`, 4 cores	85.67	15.14

Reference:

Zeng Y and Breheny P (2021). The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R. R Journal, 12: 6-19. URL https://doi.org/10.32614/RJ-2021-001
Zeng Y, Yang T, and Breheny P (2021). Hybrid safe-strong rules for efficient optimization in lasso-type problems. Computational Statistics and Data Analysis, 153: 107063. URL https://doi.org/10.1016/j.csda.2020.107063
Wang C and Breheny P (2022). Adaptive hybrid screening for efficient lasso optimization. Journal of Statistical Computation and Simulation, 92: 2233–2256. URL https://doi.org/10.1080/00949655.2021.2025376
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. (2012). Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74 (2), 245-266.
Wang, J., Zhou, J., Wonka, P., and Ye, J. (2013). Lasso screening rules via dual polytope projection. In Advances in Neural Information Processing Systems, pp. 1070-1078.
Xiang, Z. J., and Ramadge, P. J. (2012, March). Fast lasso screening tests based on correlations. In Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on (pp. 2137-2140). IEEE.
Wang, J., Zhou, J., Liu, J., Wonka, P., and Ye, J. (2014). A safe screening rule for sparse logistic regression. In Advances in Neural Information Processing Systems, pp. 1053-1061.

Report bugs：

open an issue or send an email to Patrick Breheny at [email protected]

biglasso's People

Contributors

Stargazers

Watchers

biglasso's Issues

Positivity or box constraints on fitted coefficients?

In glmnet it is possible to impose positivity constraints on the fitted coefficients using the lower.limits argument and box constraints using the upper.limits argument. Would it be possible to incorporate this feature too in biglasso? Maybe just by clipping the coefficients to the specified bounds after each iteration or something?

Error R_Reprotect: only x protected items, can't reprotect index y

This might be related to #41 but as the error is different, I'm reporting this as a separate issue. In fact, this was the problem I came across first but when trying to boil everything down to a reproducible example, I ran into what is reported in #41. What seems to trigger this more severe issue is simply a larger problem size when compared to what is reported in #41.

set.seed(11)

nrow <- 3000000
ncol <- 1000

mat <- bigmemory::big.matrix(nrow, ncol, type = "double", shared = TRUE)

for (i in seq_len(ncol)) {
 mat[, i] <- sample(c(0, 1), nrow, replace = TRUE)
}

res <- biglasso::cv.biglasso(
 mat, sample(c(TRUE, FALSE), nrow, replace = TRUE), family = "binomial",
 ncores = 8, nlambda = 50, verbose = TRUE
)

yielding

Preprocessing start: 2021-09-21 13:23:31.000
Preprocessing end: 2021-09-21 13:23:41.000

-----------------------------------------------
Lambda 1. Now time: 2021-09-21 13:23:42.000
...
Lambda 49. Now time: 2021-09-21 13:51:23.000
Error in structure(new.time - time, class = "proc_time") : 
 R_Reprotect: only -69 protected items, can't reprotect index -71
Calls: <Anonymous> -> biglasso -> system.time -> structure
Error during wrapup: R_Reprotect: only -53 protected items, can't reprotect index -54

and my sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /cluster/spack/apps/linux-centos7-x86_64/gcc-8.2.0/openblas-0.3.7-aapr6wps44sbmhbedmmigxjk5bhfy5p3/lib/libopenblas_haswell-r0.3.7.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] bigmemory.sri_0.1.3 bigmemory_4.5.36    compiler_4.0.2     
[4] Matrix_1.2-18       biglasso_1.4.1      parallel_4.0.2     
[7] Rcpp_1.0.7          grid_4.0.2          ncvreg_3.13.0      
[10] lattice_0.20-41

Happy to share additional info and also to do some further probing of the issue if someone has any pointers.

Warning: stack imbalance

Using code as

set.seed(11)

mat <- matrix(sample(c(0, 1), 5000 * 500, replace = TRUE), nrow = 5000)
mat <- bigmemory::as.big.matrix(mat, type = "double", shared = TRUE)

res <- biglasso::cv.biglasso(
 mat, sample(c(TRUE, FALSE), 5000, replace = TRUE), family = "binomial",
 ncores = 4, nlambda = 50, verbose = TRUE
)

I get

Preprocessing start: 2021-09-20 21:29:25.000
Preprocessing end: 2021-09-20 21:29:25.000

-----------------------------------------------
Lambda 1. Now time: 2021-09-20 21:29:25.000
...
Lambda 49. Now time: 2021-09-20 21:29:27.000
Warning: stack imbalance in '<-', 2 then 21

Any my sessionInfo() is

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /cluster/spack/apps/linux-centos7-x86_64/gcc-8.2.0/openblas-0.3.7-aapr6wps44sbmhbedmmigxjk5bhfy5p3/lib/libopenblas_haswell-r0.3.7.so

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] bigmemory.sri_0.1.3 bigmemory_4.5.36    compiler_4.0.2     
[4] Matrix_1.2-18       biglasso_1.4.1      parallel_4.0.2     
[7] Rcpp_1.0.7          grid_4.0.2          ncvreg_3.13.0      
[10] lattice_0.20-41

Happy to share further info if requested!

Segfault

I get the following error when using biglasso:

 *** caught segfault ***
address (nil), cause 'unknown'

Traceback:
 1: col.idx + 1
 2: biglasso(X.bm, Y, screen = "SSR-BEDPP", ncores = 16, verbose = TRUE,     dfmax = 50000, output.time = TRUE, max.iter = 100)
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault

It occurs when executing the following:

fit <- biglasso(X.bm, Y, screen = "SSR-BEDPP", ncores = 16, verbose=TRUE, dfmax=50000, output.time=TRUE, max.iter=100)

The data has 292074 features (SNPs) and 298769 individuals. biglasso runs for 92 lambdas. Excluding max.iter=100 makes the process run indefinitely, so we decided to set max.iter to 100 to see if we can get the training to complete (we assumed it ran indefinitely due be unable to converge on lambda 92 and thus run through all 1000 iterations, which we expect will take 5 days).

We have had no issues running biglasso on other similar datasets, albeit ones with many more individuals (~350,000) than features (~100,000).

I am using the stable version:

install.packages("biglasso")

Package versions:

R version 3.4.1 (2017-06-30) -- "Single Candle"
       Package     Version
            BH    1.66.0-1
    data.table      1.11.4
        ncvreg      3.10-0
          Rcpp     0.12.17
 RcppArmadillo   0.8.500.0
            BH    1.65.0-1
      biglasso       1.3-6
     bigmemory      4.5.33
 bigmemory.sri       0.1.3
    data.table    1.10.4-3
        ncvreg       3.9-1
          Rcpp     0.12.15
 RcppArmadillo 0.8.300.1.0

Thanks,

Vince

Split available cores in cv.biglasso()

I have a quick question: say you have 40 cores available for a 5-fold cv.biglasso() run. From having a quick look at your code, it seems to me that this would have 35 cores doing nothing. From a distance it seems straightforward, adding a couple of minor changes, to have each fold work with 8 cores (and thereby making such a scenario a bit more efficient). Or am I mistaken? I'm happy to prepare a PR adding such functionality.

.

Subsetting a

I think you should always subset a at https://github.com/YaohuiZeng/biglasso/blob/master/R/biglasso.R#L263.

An early stop can be the consequence of a saturated model (only in the logistic model) but also the consequence of the use of dfmax, which can appear in gaussian models too.
It then causes an error at https://github.com/YaohuiZeng/biglasso/blob/master/R/biglasso.R#L275 due to the length incompatibility between the two subtracted vectors.

Rigorous LASSO and causal inference

Hi.

I have two requests for future development:

I want to estimate causal LASSO on large datasets. Currently I do it with the hdm package. I think it internally uses glmnet, thus it quickly becomes infeasible as I increase the size of the data set. Could you incorporate a standardized procedure to estimate this?
In order to quickly manually implement a causal LASSO one can run a rigorous LASSO of X separately on Y and D, and then use the partialled-out results to estimate an average treatment effect. There is currently no option to estimate a rigorous LASSO in the package, could it be implemented?

Gal

are NCV penalties working correctly?

get latest version

devtools::install_github("pbreheny/biglasso")
#> Skipping install of 'biglasso' from a github remote, the SHA1 (d0d0ec2) has not changed since last install.
#> Use force = TRUE to force installation
library(biglasso)
#> Loading required package: bigmemory
#> Loading required package: Matrix
#> Loading required package: ncvreg
library(ncvreg)
library(glmnet)
#> Loaded glmnet 4.1-8
library(reprex)

data(colon)
X <- colon$X |> ncvreg::std()
X <- cbind(1, X)
xtx <- apply(X, 2, crossprod)
init <- rep(0, ncol(X)) # cold starts - use more iterations (default is 1000)
y <- colon$y
og_resid <- resid <- drop(y - X %*% init)
og_X <- X.bm <- as.big.matrix(X)

fit1b <- biglasso_fit(X.bm, y, lambda = 0.05, xtx=xtx, r = resid,
penalty = "MCP", max.iter = 10000)

fit2b <- ncvfit(X = X, y = y, lambda = 0.05, xtx = xtx, r = resid,
penalty = "MCP", max.iter = 10000)

these run fine...

tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.1)
#> ----- PASSED : <-->
#> call| tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.1)
tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.1)
#> ----- PASSED : <-->
#> call| tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.1)

...but the tests below fail -- does this signal a problem?

tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.01)
#> ----- FAILED[data]: <-->
#> call| tinytest::expect_equivalent(fit1b$resid, fit2b$resid, tolerance = 0.01)
#> diff| Mean relative difference: 0.2640539
tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.01)
#> ----- FAILED[data]: <-->
#> call| tinytest::expect_equivalent(fit1b$beta, fit2b$beta, tolerance = 0.01)
#> diff| Mean relative difference: 0.2418661
Created on 2024-04-20 with reprex v2.1.0

data storage mode affects model fit -- why??

# TKP 
# April 2024 
# Objective: explore differences between model fits in-memory and filebacked
# remotes::install_github("YaohuiZeng/biglasso")
library(reprex)
library(biglasso)
#> Loading required package: bigmemory
#> Loading required package: Matrix
#> Loading required package: ncvreg
library(ncvreg)
# load colon data ----------------------------------------------------------------
data(colon)
X <- colon$X |> ncvreg::std()
# X <- cbind(1, X)
xtx <- apply(X, 2, crossprod)
init <- rep(0, ncol(X)) # cold starts - use more iterations (default is 1000)
y <- colon$y
resid <- drop(y - X %*% init)
X.bm <- as.big.matrix(X)

# take 1 -----------------------------
# file-backed model fit 
fit1 <- biglasso_fit(X = X.bm, y = y, lambda = 0.05, xtx = xtx, r = resid,
                     penalty = "lasso", max.iter = 10000)

# compare with `ncvreg::ncvfit()` -- runs in memory 
fit2 <- ncvfit(X = X, y = y, lambda = 0.05, xtx = xtx, r = resid,
               penalty = "lasso", max.iter = 10000)

# test coefficients 
tinytest::expect_equal(fit1$beta, fit2$beta, tolerance = 0.01)
#> ----- FAILED[data]: <-->
#>  call| tinytest::expect_equal(fit1$beta, fit2$beta, tolerance = 0.01)
#>  diff| Mean absolute difference: 0.02752312

# test residuals
tinytest::expect_equal(fit1$resid, fit2$resid, tolerance = 0.01)
#> ----- PASSED      : <-->
#>  call| tinytest::expect_equal(fit1$resid, fit2$resid, tolerance = 0.01)

str(fit1) # note: iter = 3773
#> List of 11
#>  $ beta          : Named num [1:2000] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..- attr(*, "names")= chr [1:2000] "Hsa.3004" "Hsa.13491" "Hsa.13491.1" "Hsa.37254" ...
#>  $ iter          : int 3773
#>  $ resid         : num [1:62] 0.987 0.617 0.905 0.202 0.793 ...
#>  $ lambda        : num 0.05
#>  $ penalty       : chr "lasso"
#>  $ alpha         : num 1
#>  $ loss          : num 29.5
#>  $ penalty.factor: num [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ n             : num 62
#>  $ y             : num [1:62] 1 0 1 0 1 0 1 0 1 0 ...
#>  $ time          : num 0.35
#>  - attr(*, "class")= chr [1:2] "biglasso" "ncvreg"
str(fit2) # note: iter = 2
#> List of 10
#>  $ beta          : Named num [1:2000] 0 0 0 0 0 0 0 0 0 0 ...
#>   ..- attr(*, "names")= chr [1:2000] "Hsa.3004" "Hsa.13491" "Hsa.13491.1" "Hsa.37254" ...
#>  $ iter          : int 2
#>  $ resid         : num [1:62] 0.987 0.617 0.905 0.202 0.793 ...
#>  $ lambda        : num 0.05
#>  $ penalty       : chr "lasso"
#>  $ gamma         : num 3
#>  $ alpha         : num 1
#>  $ loss          : num 29.5
#>  $ penalty.factor: num [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ n             : int 62

nz1 <- which(fit1$beta != 0)
fit1$beta[nz1]
#>      Hsa.8147     Hsa.36689      Hsa.3152     Hsa.36665     Hsa.42949 
#> -0.0322178250 -0.1131494353  0.0218359134  0.0224074245 -0.0043473699 
#>     Hsa.692.2      Hsa.1272       Hsa.166     Hsa.31801      Hsa.3648 
#> -0.1202222748 -0.0109727114  0.0107157905  0.0414907591  0.0007487836 
#>     Hsa.13628      Hsa.3016      Hsa.5392   Hsa.35496.1      Hsa.1832 
#>  0.0022106709  0.0157166112  0.0360233297  0.0107837452 -0.0180740994 
#>     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159     Hsa.33268 
#> -0.0082794639 -0.0176171036  0.0036267952  0.0105440705 -0.0482435225 
#>      Hsa.6814      Hsa.1660      Hsa.1491   Hsa.41098.1 
#>  0.0302496905  0.0469252260  0.0107054818 -0.0234533426

nz2 <- which(fit2$beta != 0)
fit2$beta[nz2]
#>      Hsa.3152     Hsa.36665     Hsa.42949     Hsa.692.2     Hsa.13628 
#>  1.722814e-06  4.248432e-08 -9.890970e-08 -1.908319e-07  9.282081e-07 
#>      Hsa.3016      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928 
#>  5.391427e-07 -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07 
#>     Hsa.41159     Hsa.33268      Hsa.6814      Hsa.1660 
#>  1.868370e-07 -2.609362e-07  2.094525e-07  2.229240e-07
# note: there are fewer nonzero betas here compared to fit1

fit1$beta[intersect(nz1, nz2)]
#>     Hsa.3152    Hsa.36665    Hsa.42949    Hsa.692.2    Hsa.13628     Hsa.3016 
#>  0.021835913  0.022407425 -0.004347370 -0.120222275  0.002210671  0.015716611 
#>     Hsa.1832    Hsa.12241    Hsa.44244     Hsa.2928    Hsa.41159    Hsa.33268 
#> -0.018074099 -0.008279464 -0.017617104  0.003626795  0.010544071 -0.048243523 
#>     Hsa.6814     Hsa.1660 
#>  0.030249691  0.046925226
fit2$beta[intersect(nz1, nz2)]
#>      Hsa.3152     Hsa.36665     Hsa.42949     Hsa.692.2     Hsa.13628 
#>  1.722814e-06  4.248432e-08 -9.890970e-08 -1.908319e-07  9.282081e-07 
#>      Hsa.3016      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928 
#>  5.391427e-07 -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07 
#>     Hsa.41159     Hsa.33268      Hsa.6814      Hsa.1660 
#>  1.868370e-07 -2.609362e-07  2.094525e-07  2.229240e-07
# note: these estimates are much smaller than fit1! 


# take 2 -----------------------------
fit1_take2 <- biglasso_fit(X.bm, y, lambda = 0.05, xtx = xtx, r = resid,
                           penalty = "lasso", max.iter = 10000)

# compare with `ncvreg::ncvfit()`
fit2_take2 <- ncvfit(X = X, y = y, lambda = 0.05, xtx = xtx, r = resid,
                     penalty = "lasso", max.iter = 10000)

# test coefficients 
tinytest::expect_equal(fit1_take2$beta, fit2_take2$beta, tolerance = 0.01)
#> ----- PASSED      : <-->
#>  call| tinytest::expect_equal(fit1_take2$beta, fit2_take2$beta, tolerance = 0.01)

# test residuals
tinytest::expect_equal(fit1_take2$resid, fit2_take2$resid, tolerance = 0.01)
#> ----- PASSED      : <-->
#>  call| tinytest::expect_equal(fit1_take2$resid, fit2_take2$resid, tolerance = 0.01)


nz1_take2 <- which(fit1_take2$beta != 0)
fit1_take2$beta[nz1_take2]
#>      Hsa.3152     Hsa.36665     Hsa.42949     Hsa.692.2     Hsa.13628 
#>  1.722814e-06  4.248432e-08 -9.890970e-08 -1.908319e-07  9.282081e-07 
#>      Hsa.3016      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928 
#>  5.391427e-07 -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07 
#>     Hsa.41159     Hsa.33268      Hsa.6814      Hsa.1660 
#>  1.868370e-07 -2.609362e-07  2.094525e-07  2.229240e-07

nz2_take2 <- which(fit2_take2$beta != 0)
fit1_take2$beta[nz2_take2]
#>      Hsa.3152     Hsa.42949     Hsa.692.2     Hsa.13628      Hsa.3016 
#>  1.722814e-06 -9.890970e-08 -1.908319e-07  9.282081e-07  5.391427e-07 
#>      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159 
#> -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07  1.868370e-07 
#>     Hsa.33268      Hsa.6814      Hsa.1660 
#> -2.609362e-07  2.094525e-07  2.229240e-07

fit1_take2$beta[intersect(nz1_take2, nz2_take2)]
#>      Hsa.3152     Hsa.42949     Hsa.692.2     Hsa.13628      Hsa.3016 
#>  1.722814e-06 -9.890970e-08 -1.908319e-07  9.282081e-07  5.391427e-07 
#>      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159 
#> -2.866869e-07 -5.207912e-07 -8.897555e-07  5.288186e-07  1.868370e-07 
#>     Hsa.33268      Hsa.6814      Hsa.1660 
#> -2.609362e-07  2.094525e-07  2.229240e-07
fit2_take2$beta[intersect(nz1_take2, nz2_take2)]
#>      Hsa.3152     Hsa.42949     Hsa.692.2     Hsa.13628      Hsa.3016 
#>  1.638265e-06 -7.992673e-08 -1.577083e-07  9.061937e-07  4.678115e-07 
#>      Hsa.1832     Hsa.12241     Hsa.44244      Hsa.2928     Hsa.41159 
#> -2.529658e-07 -5.153089e-07 -8.802163e-07  4.484431e-07  1.429543e-07 
#>     Hsa.33268      Hsa.6814      Hsa.1660 
#> -2.589283e-07  1.462757e-07  1.735474e-07
# now all the estimated coefficients are just close to 0


# Questions: 
#   (1) why is it that in take 1, the test of residuals passes but the test of beta doesn't?
#   (2) why is it that in take 2, the beta coefficients are different? 
#       The data passed to the functions is the same. Nothing is stochastic here....
#   (3) why does running the same code produce different answers the second time I run it? (Related to issue 2)

^{Created on 2024-04-12 with reprex v2.1.0}

Immediate crash on small problem

R MRO v3.5.3 and RStudio 1.2.1335 (both fresh installs), latest biglasso. Windows 10 (1903).

Biglasso works on the "colon" toy example with exact same script as below (except X and y are pulled out of the colon data object).

But, biglasso fails with a complete R crash (RStudio reports fatal error) with this:
minimal.gz
Contains a tiny matrix and outcome vector, X and y.

library(biglasso)
load('F:/minimal.gz')
X.bm <- as.big.matrix(X)
fit.bin.lasso <- biglasso(X.bm, y, family = "binomial")

Yet it works perfectly with glmnet:
fit.bin.lasso.glmnet <- glmnet(X, y, family = "binomial")

> str(X)
 int [1:242, 1:2000] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : NULL
> str(y)
 num [1:242] 1 1 1 1 1 1 1 1 1 1 ...

Error in calling table(y) on numeric variable

I hit an error ("unique() applies only to vectors") when trying to run cv.biglasso on a numeric response.

if (fit$family == "binomial" & (min(table(y)) > nfolds)) is not short circuiting when fit$family is gaussian and thus table(y) is being called regardless of family.

Changing & to && should take care of the error.

standardize = FALSE ?

Is there a way for the biglasso() function to not standardize the data? What I'm hoping for is the equivalent of standardize =FALSE in glmnet. I'm using penalty.factor to specify varying penalties on groups of coefficients and wan't more flexibility to effectively standardize groups of coefficients by grouping via the penalty.factor argument instead of individually. However, to do so I need to disable the default of standardization.

Benchmarks

It would be very nice if there were a set of benchmarks comparing this implementation to other implementations in R, such as:

As far as I can tell, right now there's only the author's word for the efficiency.

I can't reproduce your issue posted on SO.

Concerning your question on SO, I can't reproduce your error because I don't know exactly what you've changed compared to the current master branch.
Could you commit the changes you've tried?
It would be easier to try to fix the problem.

Moreover, you should consider using devtools and roxygen2 for package development, it's really nice and elegant. Following the instruction in the book R packages makes it easy.

Dealing with Extremely large datasets (750 GB)

Hi,

I have a very large dataset. The X matrix has 480k rows and around 205k cols. I have saved it as a filebacked.big.matrix and i am using attach.big.matrix to "load" at the time of performing the elastic net regression. Sadly, this is working really slow and took almost 2 days for 1 lambda (lambda is quite small). On this, i have 2 questions

Will it be faster if i load the data in memory? I can ask for upto 2-3TB on the server.
If yes, how can i load filebacked matrix in memory? I was trying using read.big.matrix, deepcopy etc but it doesnt look like it works
Any other suggestions on dealing with such a large dataset?

Thanks
Prateek Sasan

Memory not mapped

This package has been quite useful for my research and thank the author for writing it. However, sometimes I experienced the following problem:

In short words, I have a function which calls biglasso, and I was trying to use mclapply function in the package parallel to have multiple cores to run lasso with same X (created by setupX) and different y.

Here is my function, I think procedures other than biglasso are unimportant:

 if (k > 1) {
   y <-sample(realy)
 } else {
   y <-realy
 }
 tryCatch( {
   time <- system.time(  
     fit <- biglasso(X, y, family='gaussian', penalty.factor = penalty, 
                   lambda.min = 0.0001, nlambda = 250)
   )
   print(time)
   # do the "maxfrac" operation
   coef = as.matrix(fit$beta)
   coef[coef < 0] <- 0.0
   coef = round(coef, digits = 6)
   colnames(coef) = fit$lambda
   if (k == 1) {
     write.table(coef, file=paste(outfname, ".coef", sep=""), sep="\t")
   }
   norm_beta = scale(coef, center=FALSE, scale=colSums(coef))
   beta_max = apply(norm_beta, 1, max)
 }, error = function(e) {
   return(rep(-1, dim(X)[2]+1)) } 
 )
 return(beta_max)
}

Then I use beta_max_all <- mclapply(1:(batch_size+1), bottleneck, mc.cores=n_cores, mc.cleanup=TRUE) for multi-core jobs.

I'd be appreciate if the author can help me figure out what causes the problem. I cannot see any rules about when such issue happens; it seem to occur randomly and sometimes the script just runs smoothly. I run my computing on an Sun Grid Engine Cluster.

Here is all the error message.

*** caught segfault ***
address 0x30, cause 'memory not mapped'

Traceback:
 1: biglasso(X, y, family = "gaussian", penalty.factor = penalty,     lambda.min = 1e-04, nlambda = 250)
 2: system.time(fit <- biglasso(X, y, family = "gaussian", penalty.factor = penalty,     lambda.min = 1e-04, nlambda = 250))
 3: doTryCatch(return(expr), name, parentenv, handler)
 4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 5: tryCatchList(expr, classes, parentenv, handlers)
 6: tryCatch({    time <- system.time(fit <- biglasso(X, y, family = "gaussian",         penalty.factor = penalty, lambda.min = 1e-04, nlambda = 250))    print(time)    coef = as.matrix(fit$beta)    coef[coef < 0] <- 0    coef = round(coef, digits = 6)    colnames(coef) = fit$lambda    if (k == 1) {        write.table(coef, file = paste(outfname, ".coef", sep = ""),             sep = "\t")    }    norm_beta = scale(coef, center = FALSE, scale = colSums(coef))    beta_max = apply(norm_beta, 1, max)}, error = function(e) {    return(rep(-1, dim(X)[2] + 1))})
 7: FUN(X[[i]], ...)
 8: lapply(X = S, FUN = FUN, ...)
 9: doTryCatch(return(expr), name, parentenv, handler)
10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
11: tryCatchList(expr, classes, parentenv, handlers)
12: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        msg <- conditionMessage(e)        sm <- strsplit(msg, "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && identical(getOption("show.error.messages"),         TRUE)) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
13: try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
14: sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))
15: FUN(X[[i]], ...)
16: lapply(seq_len(cores), inner.do)
17: mclapply(1:(batch_size + 1), bottleneck, mc.cores = n_cores,     mc.cleanup = TRUE)
18: system.time(beta_max_all <- mclapply(1:(batch_size + 1), bottleneck,     mc.cores = n_cores, mc.cleanup = TRUE))
An irrecoverable exception occurred. R is aborting now ...

Support methods to calculate AIC and BIC values of biglasso fit

I was wondering if it would be possible to also support the calculation of AIC and BIC values of a biglasso fit, similar to the way that the ncvreg package provides this (returning a vector of AIC and BIC values for each lambda value used). This would enable selection of the best lambda based on AIC or BIC, which is faster than based on cross validation.

Option for sample weights?

I was wondering if there was a way to put in sample weights for biglasso. In cv.glmnet there is a "weights" option and it would be very useful for econometrics work.

Possible to set the intercept to 0

Hi.

Thanks for making biglasso, it works well with bigmemory and memory-mapped files.

Is there a way, or hack, to set the intercept to 0? In glmnet you can set intercept=FALSE and the intercept(s) are set to zero.

Removed from CRAN

biglasso got removed from CRAN today due to issues with function signatures differing in some files:
https://www.stats.ox.ac.uk/pub/bdr/LTO/biglasso.out

row.idx doesn't work ?

I try to split X,y by row.idx parameter in biglasso and cv.biglasso
however when row.idx is been used, cv.biglasso will return error massage
"錯誤發生在 E[cv.ind == i, 1:res$nl] <- res$loss：
被替換的項目不是替換值長度的倍數"
sorry for chinese version massage,
this massage more likely indicate that two lengths "E[cv.ind == i, 1:res$nl]" and "res$loss" are different

code

library(biglasso)

data("colon")

X <- as.big.matrix(colon$X)
y <- rnorm(62, mean=50)

tr.ind <- sample(length(y), length(y)*0.85, replace = FALSE)
tt.ind <- seq_len(length(y))[-tr.ind]

## With Error massage after cv.biglasso
mdl <- biglasso(X,y,row.idx = tr.ind, penalty = "lasso", family = "gaussian")
cv.mdl <- cv.biglasso(X,y,row.idx = tr.ind, nfolds = 10, penalty = "lasso", family = "gaussian")

## Without Error massage after cv.biglasso
tr.X <- as.big.matrix(colon$X[tr.ind,])
tt.X <- as.big.matrix(colon$X[-tr.ind,])

tr.mdl <- biglasso(tr.X, y[tr.ind], penalty = "lasso", family = "gaussian")
cv.tr.mdl <- cv.biglasso(tr.X,y[tr.ind], nfolds = 10, penalty = "lasso", family = "gaussian")

session info

R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 20

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=zh_TW.UTF-8       LC_NUMERIC=C               LC_TIME=zh_TW.UTF-8        LC_COLLATE=zh_TW.UTF-8     LC_MONETARY=zh_TW.UTF-8   
 [6] LC_MESSAGES=zh_TW.UTF-8    LC_PAPER=zh_TW.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=zh_TW.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biglasso_1.4.1   ncvreg_3.13.0    Matrix_1.3-4     bigmemory_4.5.36

loaded via a namespace (and not attached):
[1] bigmemory.sri_0.1.3 compiler_4.1.1      parallel_4.1.1      tools_4.1.1         Rcpp_1.0.7          grid_4.1.1          lattice_0.20-45

does biglasso suport cox?

Hi,
I have seen in the vignette of biglasso that cox family isn't an option allowed for biglasso. Do you know a way to perform biglasso with cox model? If no, I'd be interested in making a collaboration possibly to develop a Cox biglasso model. I have read also that sparse matrix aren't available for Cox family in the glmnet package. Would using another techniques to allow Cox family conflict with the basic development of biglasso?
Thank you,

I have seen that #7 is very related to this as well.

Installation error on CRAN

It seems {biglasso} is getting lots of "Installation failed" on CRAN (cf. https://cran.r-project.org/web/checks/check_results_biglasso.html).

And this starts to propagate to my package as well.
It would be great if you could look into it.

Proper way to incorporate penalty.factor

Two questions/remarks:

I would have expected the same results (with rescaled lambda sequences)

data(colon); X.bm <- as.big.matrix(colon$X); y <- colon$y
fit.lasso  <- biglasso(X.bm, y, family = 'gaussian')
fit.lasso2 <- biglasso(X.bm, y, family = 'gaussian',
                       penalty.factor = rep(100, ncol(X.bm)))

Looking at the code of {glmnet}, it seems that they rescale the multiplicative factors (by dividing by their mean). Should {biglasso} do the same here?

What is the (implementation) problem with having some penalty factor as 0? (you don't allow unpenalized variables in the current version)

Code and data used to produce benchmarks?

Was just wondering if it would be possible to release the code you used to produce the given benchmarks? Just asking because this would help to compare with other methods, and have everything 100% reproducible... And providing direct access to the R formatted real-world datasets would also be handy, just to be sure that other people would be using the exact same data... Would this be possible?

Support for other families recognized by glmnet (poisson, multinomial, cox, mgaussian)

Great package! Are there any plans to extend it to also support the other families currently supported by glmnet, i.e. "poisson","multinomial","cox" and "mgaussian"? (I was currently working on a problem with big data that required the use of either multinomial or mgaussian)

Sparse matrices

Great work. Is it possible to extend package to allow sparse matrices as input?

Error when setting row.idx

When I use row.idx I get the error
Error in [<-(*tmp*, cv.ind == i, 1:res$nl, value = res$loss) : (subscript) logical subscript too long

Maybe this error is related to line 87 being commented?
y <- fit$y # this would cause error if eval.metric == "MAPE"

Code that causes error:
train <- sample(c(T,F), size, c(.5,.5), replace = TRUE)
fit$blasso <- cv.biglasso(X = x, y = y, row.idx = which(train), penalty = "lasso", family = "binomial", nfolds = 10)

If I take "row.idx = which(train)" out it runs without any errors.

Bug in cv.biglasso() triggered by edge case

I ran into a minor issue when doing some testing (for which i use a very short lambda sequence): if ind if only of length 1, E and Y loose a dimension, which in turn causes apply() later to trip. I believe adding drop = FALSE would make the situation more robust.

Lambda optimization with series of alpha values

Hi!

Currently, the script enables lambda parameter optimization for alpha = 1 (lasso) only. Is that correct? Is it possible to optimize the lambda.min value with crossvalidation using a range of alpha values (0, 0.05, 0.1, [...], 0.9, 0.95, 1.0)? This way I could compare lasso vs. ridge vs. elastic net approaches.

The code would look something like this
cv <- list()
a <- seq(0.0, 1.0, 0.05)
for (i in a) {
cv1 <- cv.biglasso(X, Y, family = "gaussian", nfolds = 5, alpha = i)
cv[i] <- cv1
}

Update benchmarks

The last two releases of glmnet have yielded significant speed increases, as legacy Fortran code has been replaced with modern C++ (see their NEWS file). In the interest of objectivity, I think it would be a good idea to update the benchmarks that are highlighted on the README here. AFAICT these benchmarks were last run in 2020.

Regardless, thanks for maintaining this excellent package. I'm a big fan.

Default CV error contradictory documentation

This states that the cross-validation error for classification is the binomial deviance:
https://github.com/YaohuiZeng/biglasso/blob/4e98d83e5cd0dc6ddb9c417faa1673a10e19eca5/R/cv.biglasso.R#L7-L9
but the eval.metric parameter' documentation says it defaults to misclassification:
https://github.com/YaohuiZeng/biglasso/blob/4e98d83e5cd0dc6ddb9c417faa1673a10e19eca5/R/cv.biglasso.R#L22-L24

Do these two passages refer to different things? To me it sounds like they are contradicting.

user-defined eval.metric function

Fantastic package. Any chance you might allow the user to provide a custom function to cv.biglasso for eval.metric? You probably don't want to get into the weeds in terms of writing a new function to accommodate everyone's favorite loss function (e.g. AUC, PPV, IDI, etc) -- but only having misclassification error in the binary setting can be limiting.

bug in setupX?

sessionInfo()
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/Chicago
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] styler_1.10.3     digest_0.6.35     fastmap_1.1.1     xfun_0.41        
#>  [5] magrittr_2.0.3    glue_1.7.0        R.utils_2.12.3    knitr_1.45       
#>  [9] htmltools_0.5.7   rmarkdown_2.26.1  lifecycle_1.0.4   cli_3.6.2        
#> [13] R.methodsS3_1.8.2 vctrs_0.6.5       reprex_2.1.0      withr_3.0.0      
#> [17] compiler_4.3.3    R.oo_1.26.0       R.cache_0.16.0    purrr_1.0.2      
#> [21] rstudioapi_0.15.0 tools_4.3.3       evaluate_0.23     yaml_2.3.8       
#> [25] rlang_1.1.3       fs_1.6.3
X <- matrix(rnorm(1000*10000), nrow = 1000, ncol = 10000)
class(X); dim(X)
#> [1] "matrix" "array"
#> [1]  1000 10000

dir <- tempdir()              
write.table(X, file = paste0(dir, "/X.txt"),
            sep = "\t",
            quote = FALSE,
            row.names = FALSE,
            col.names = FALSE)

# attempt to create backing files (.bin, .desc)
bm_X <- biglasso::setupX(filename = paste0(dir, "/X.txt"),
                      dir = dir,
                      backingfile = paste0(dir, "/X.bin"),
                      descriptorfile = paste0(dir, "/X.desc"),
                      sep = '\t') 
#> Reading data from file, and creating file-backed big.matrix...
#> This should take a while if the data is very large...
#> Start time:  2024-04-25 11:33:59
#> Error in filebacked.big.matrix(nrow = nrow, ncol = ncol, type = type, : The path to the descriptor and backing file are specified with the backingpath option

^{Created on 2024-04-25 with reprex v2.1.0}

Warnings/errors on run

I have run biglasso on a similar dataset as reported previously (mostly sparse, other whole numbers cast as numeric). I made sure nothing was NA, infinite, or NaN and type was numeric (I remember the integer bug from before).

glm.trained = cv.biglasso(as.big.matrix(X),factor(y),family="binomial"ncores=16)
Warning message:
In mean.default(y) : argument is not numeric or logical: returning NA
> plot(glm.trained)
Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
> glm.trained$cve
numeric(0)
> glm.trained$cvse
numeric(0)
> glm.trained$lambda
numeric(0)

Is this expected? I have never been able to get bigLasso to work on real problems for me like glmnet -- maybe my use cases are too unconventional?

> sum(is.na(X))
[1] 0
> sum(is.infinite(X))
[1] 0
> sum(is.nan(X))
[1] 0
str(X)
 num [1:72, 1:1000926] 0 35 0 0 59 0 456 0 0 0 ...
sum(rowMeans(X)==0)
[1] 0

Glmnet runs fine. Other fields within the output, glm.trained, are populated, such as $y (with 0's and 1's that line up reasonably well with the actual classes), center and scale, and the coefficient matrix looks like it has numbers in it (quite sparse, as expected).

Thanks for any insight. I can try to finagle a minimal reproducible example if the problem isn't obvious with my command and the error messages!

Disabling intercept

Hi,

Thank you for maintaining and creating biglasso, it's been really helpful! We've run into an application where we need to disable the intercept, can you please provide some guidance on how to do this? (a hacky way to do this would be great even if its not standardly supported!)

predict.cv.biglasso - address 0x2b2da5f69ae0, cause 'memory not mapped'

I was trying to predict.cv.biglasso and got the following error :

I havent been able to figure why this error occurs. Any ideas?

*** caught segfault ***
address 0x2b2da5f69ae0, cause 'memory not mapped'

Traceback:
1: get_eta(X@address, as.integer(row.idx - 1), beta, beta.T@i, beta.T@j)
2: predict.biglasso(object$fit, X = X, row.idx = row.idx, type = type, lambda = lambda, which = which, ...)
3: predict.cv.biglasso(fit1, X, type = "response", lambda = fit1$lambda[fit1$min])
4: predict(fit1, X, type = "response", lambda = fit1$lambda[fit1$min])
5: as.vector(predict(fit1, X, type = "response", lambda = fit1$lambda[fit1$min]))
6: lasso_reg(test_sub, "101_anat_fmri.nii.gz", "afni_biglasso_selection_cv_fs+orig.BRIK.gz", "penalty_bglasso_obj_fs.RData", "fs")
An irrecoverable exception occurred. R is aborting now ...
/var/spool/torque/mom_priv/jobs/11529146.owens-batch.ten.osc.edu.SC: line 43: 7338 Segmentation fault Rscript /users/PAS1316/prateeksasan/hcp/wexner/controls_lasso_reg_oos.R

ncore problem

Hi,

I am having some problem with the ncores option in biglasso() and cv.biglasso() function. I compiled biglasso on my mac enabling openMP (4 cores). However, when I am trying to reproduce the worked out example of the cv.biglasso() function usng ncores=4, I am getting an error

Error in checkForRemoteErrors(val) : 
  4 nodes produced errors; first error: could not find function "attach.big.matrix"
In addition: Warning messages:
1: In .Internal(gc(verbose, reset)) :
  closing unused connection 6 (<-localhost:11359)
2: In .Internal(gc(verbose, reset)) :
  closing unused connection 5 (<-localhost:11359)
3: In .Internal(gc(verbose, reset)) :
  closing unused connection 4 (<-localhost:11359)
4: In .Internal(gc(verbose, reset)) :
  closing unused connection 3 (<-localhost:11359)

I got the same error in our cluster where biglasso is compiled using intel MKL. Can anyone help?

Thanks,
Subhomoy

Running biglasso is much slower than before

Before, with my CentOS 7.2, it was very fast.
Now, I've upgraded to 7.4. I also tried with Linux Mint and with another computer.

Every run of biglasso (or my function that has a very similar code) is 1.5-10 times slower (sometimes more).

Have you any idea what could cause this huge drop in performance?
A missing library? Some difference in compiler optimizations?

This prevents me from finishing my simulations for a paper, any insight would be MUCH appreciated.

Automatically detect and cast variables to expected type(s)

I have bumped into issues a few times where the expected formats for the inputs to biglasso turn out to be invalid, so I'd like to propose a set of usability enhancements that many could benefit from.

Matrix format. Suggestion: Detect normal matrix and cast to numeric big.matrix internally. (Example: I'm used to using normal matrices, and integer matrices. The former causes an error that is not clear, and the latter is a recipe for a crash! #22 )
If matrix is not big.matrix already, use the non-disk-spinning cast automatically or let user pick ( kaneplusplus/bigmemory#97 )
Response format. Cast to numeric. (Example: I'm used to using factors for classification, but biglasso performs numeric operations on them even when doing classification. This leads to strange errors and does not let cv.biglasso complete correctly: #25 ).

If changing the current behavior is not a good idea, it may still be nice to report an error about types so the user can make the necessary adjustments themselves.

Thanks for considering!

Will this package supports a weight vector?

Impressive package, would love to be able to see support for weights

Estimate run time

Hi,

Is there a way to estimate run time of biglasso()? I have a data set of ~341k individuals vs 640k SNPs. This took a days to run, resulting in coefficients for 45 lambdas. I have another data set that was started around the same time that consists of 341k individuals and 332k SNPs. It is on lambda 65 and still running.

So:

Is there a way to determine how many lambdas biglasso() will test? Particularly if I already know at what lambda produces the min RMSE and other parameters I put into the training (dfmax etc)?
If possible, a message when biglasso() starts processing that reports how many lambda it will test would be a useful addition.

Thanks again for such a useful R package!

Vince

Cross validation tests failing

When running the test_biglasso_linear tests I'm getting errors in the linear cross validation tests.

Platform: Ubuntu 18.04
R version: 4.1.2
big lasso version: devtools::install_github("YaohuiZeng/biglasso") on 2/3/2022

$ R -f test_biglasso_linear.R

── Failure (???): Test cross validation:  ──────────────────────────────────────
as.numeric(cvfit.ncv$cvse) not equal to as.numeric(cvfit.hybrid$cvse).
100/100 mismatches (average diff: 0.484)
[1] 4.14 - 3.69 == 0.449
[2] 4.12 - 3.69 == 0.429
[3] 4.10 - 3.72 == 0.379
[4] 4.08 - 3.74 == 0.340
[5] 4.05 - 3.73 == 0.321
[6] 4.02 - 3.71 == 0.306
[7] 4.00 - 3.69 == 0.304
[8] 3.99 - 3.68 == 0.306
[9] 3.98 - 3.67 == 0.310
...

── Failure (???): Test cross validation:  ──────────────────────────────────────
as.numeric(cvfit.ncv$cvse) not equal to as.numeric(cvfit.adaptive$cvse).
100/100 mismatches (average diff: 0.484)
[1] 4.14 - 3.69 == 0.449
[2] 4.12 - 3.69 == 0.429
[3] 4.10 - 3.72 == 0.379
[4] 4.08 - 3.74 == 0.340
[5] 4.05 - 3.73 == 0.321
[6] 4.02 - 3.71 == 0.306
[7] 4.00 - 3.69 == 0.304
[8] 3.99 - 3.68 == 0.306
[9] 3.98 - 3.67 == 0.310
...

Modifying the tests to include 1,000,000 rows and 1000 columns still results in errors:

── Failure (???): Test against ncvreg for entire path: ─────────────────────────
as.numeric(fit.ncv$beta) not equal to as.numeric(fit.hybrid$beta).
4090/100100 mismatches (average diff: 0.234)
[1004] 0.0390 -  0.1028 == -0.0638
[1019] 0.0507 -  0.1145 == -0.0639
[2005] 0.0881 -  0.2096 == -0.1215
[2020] 0.0998 -  0.2214 == -0.1216
[2039] 0.0000 - -0.0453 ==  0.0453
[3006] 0.1357 -  0.3092 == -0.1735
[3021] 0.1475 -  0.3210 == -0.1735
[3028] 0.0000 - -0.0945 ==  0.0945
[3032] 0.0000 - -0.0108 ==  0.0108
...

── Failure (???): Test against ncvreg for entire path: ─────────────────────────
as.numeric(fit.ncv$beta) not equal to as.numeric(fit.adaptive$beta).
4090/100100 mismatches (average diff: 0.234)
[1004] 0.0390 -  0.1028 == -0.0638
[1019] 0.0507 -  0.1145 == -0.0639
[2005] 0.0881 -  0.2096 == -0.1215
[2020] 0.0998 -  0.2214 == -0.1216
[2039] 0.0000 - -0.0453 ==  0.0453
[3006] 0.1357 -  0.3092 == -0.1735
[3021] 0.1475 -  0.3210 == -0.1735
[3028] 0.0000 - -0.0945 ==  0.0945
[3032] 0.0000 - -0.0108 ==  0.0108
...