Giter Club home page Giter Club logo

hal9001's Introduction

R/hal9001

R-CMD-check Coverage Status CRAN CRAN downloads CRAN total downloads Project Status: Active – The project has reached a stable, usable state and is being actively developed. License: GPL v3 DOI DOI

The Scalable Highly Adaptive Lasso

Authors: Jeremy Coyle, Nima Hejazi, Rachael Phillips, Lars van der Laan, and Mark van der Laan


What’s hal9001?

hal9001 is an R package providing an implementation of the scalable highly adaptive lasso (HAL), a nonparametric regression estimator that applies L1-regularized lasso regression to a design matrix composed of indicator functions corresponding to the support of the functional over a set of covariates and interactions thereof. HAL regression allows for arbitrarily complex functional forms to be estimated at fast (near-parametric) convergence rates under only global smoothness assumptions (van der Laan 2017a; Bibaut and van der Laan 2019). For detailed theoretical discussions of the highly adaptive lasso estimator, consider consulting, for example, van der Laan (2017a), van der Laan (2017b), and van der Laan and Bibaut (2017). For a computational demonstration of the versatility of HAL regression, see Benkeser and van der Laan (2016). Recent theoretical works have demonstrated success in building efficient estimators of complex parameters when particular variations of HAL regression are used to estimate nuisance parameters (e.g., van der Laan, Benkeser, and Cai 2019; Ertefaie, Hejazi, and van der Laan 2020).


Installation

For standard use, we recommend installing the package from CRAN via

install.packages("hal9001")

To contribute, install the development version of hal9001 from GitHub via remotes:

remotes::install_github("tlverse/hal9001")

Issues

If you encounter any bugs or have any specific feature requests, please file an issue.


Example

Consider the following minimal example in using hal9001 to generate predictions via Highly Adaptive Lasso regression:

# load the package and set a seed
library(hal9001)
#> Loading required package: Rcpp
<<<<<<< HEAD
#> hal9001 v0.4.4: The Scalable Highly Adaptive Lasso
=======
#> hal9001 v0.4.5: The Scalable Highly Adaptive Lasso
>>>>>>> 81093a5ceebcd36630f308dd07f69d4e30f07f1c
#> note: fit_hal defaults have changed. See ?fit_hal for details
set.seed(385971)

# simulate data
n <- 100
p <- 3
x <- matrix(rnorm(n * p), n, p)
y <- x[, 1] * sin(x[, 2]) + rnorm(n, mean = 0, sd = 0.2)

# fit the HAL regression
hal_fit <- fit_hal(X = x, Y = y, yolo = TRUE)
#> [1] "I'm sorry, Dave. I'm afraid I can't do that."
hal_fit$times
#>                   user.self sys.self elapsed user.child sys.child
#> enumerate_basis       0.014    0.003   0.059          0         0
#> design_matrix         0.004    0.001   0.005          0         0
#> reduce_basis          0.000    0.000   0.000          0         0
#> remove_duplicates     0.000    0.000   0.000          0         0
#> lasso                 2.684    0.343   6.583          0         0
#> total                 2.703    0.348   6.655          0         0

# training sample prediction
preds <- predict(hal_fit, new_data = x)
mean(hal_mse <- (preds - y)^2)
#> [1] 0.03667466

Contributions

Contributions are very welcome. Interested contributors should consult our contribution guidelines prior to submitting a pull request.


Citation

After using the hal9001 R package, please cite both of the following:

    @software{coyle2022hal9001-rpkg,
      author = {Coyle, Jeremy R and Hejazi, Nima S and Phillips, Rachael V
        and {van der Laan}, Lars and {van der Laan}, Mark J},
      title = {{hal9001}: The scalable highly adaptive lasso},
      year  = {2022},
      url = {https://doi.org/10.5281/zenodo.3558313},
      doi = {10.5281/zenodo.3558313}
      note = {{R} package version 0.4.2}
    }

    @article{hejazi2020hal9001-joss,
      author = {Hejazi, Nima S and Coyle, Jeremy R and {van der Laan}, Mark
        J},
      title = {{hal9001}: Scalable highly adaptive lasso regression in
        {R}},
      year  = {2020},
      url = {https://doi.org/10.21105/joss.02526},
      doi = {10.21105/joss.02526},
      journal = {Journal of Open Source Software},
      publisher = {The Open Journal}
    }

License

© 2017-2022 Jeremy R. Coyle & Nima S. Hejazi

The contents of this repository are distributed under the GPL-3 license. See file LICENSE for details.


References

Benkeser, David, and Mark J van der Laan. 2016. “The Highly Adaptive Lasso Estimator.” In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). IEEE. https://doi.org/10.1109/dsaa.2016.93.

Bibaut, Aurélien F, and Mark J van der Laan. 2019. “Fast Rates for Empirical Risk Minimization over Càdlàg Functions with Bounded Sectional Variation Norm.” https://arxiv.org/abs/1907.09244.

Ertefaie, Ashkan, Nima S Hejazi, and Mark J van der Laan. 2020. “Nonparametric Inverse Probability Weighted Estimators Based on the Highly Adaptive Lasso.” https://arxiv.org/abs/2005.11303.

van der Laan, Mark J. 2017a. “A Generally Efficient Targeted Minimum Loss Based Estimator Based on the Highly Adaptive Lasso.” The International Journal of Biostatistics. https://doi.org/10.1515/ijb-2015-0097.

———. 2017b. “Finite Sample Inference for Targeted Learning.” https://arxiv.org/abs/1708.09502.

van der Laan, Mark J, David Benkeser, and Weixin Cai. 2019. “Efficient Estimation of Pathwise Differentiable Target Parameters with the Undersmoothed Highly Adaptive Lasso.” https://arxiv.org/abs/1908.05607.

van der Laan, Mark J, and Aurélien F Bibaut. 2017. “Uniform Consistency of the Highly Adaptive Lasso Estimator of Infinite-Dimensional Parameters.” https://arxiv.org/abs/1709.06256.

hal9001's People

Contributors

benkeser avatar blind-contours avatar jeremyrcoyle avatar katrinleinweber avatar larsvanderlaan avatar nhejazi avatar rachaelvp avatar wilsoncai1992 avatar yuzhang712 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

hal9001's Issues

Update needed ahead of Matrix 1.6-2 release

I sent two e-mails to maintainer("hal9001"), on Sep 15 and Oct 7, but did not receive a response. Is GitHub preferred? The gist is that hal9001 fails its checks under forthcoming version 1.6-2 of Matrix. Matrix 1.6-2 is due to be released in November, so a patched version of hal9001 should be submitted to CRAN "soon" to avoid unnecessary archival. The relevant part of the first e-mail is pasted below.

Matrix 1.6-2 is more strict about conformation of x and y in crossprod(x=<vector>, y=<Matrix>) and tcrossprod(x=<Matrix>, y=<vector>), throwing an error in exactly those cases where base R would throw an error if the Matrix were replaced with a traditional matrix. Your packages seem to have relied on the old base-incompatible behaviour, as they now fail with errors "non-conformable arguments" tracing back to crossprod or tcrossprod calls.

Please revise your packages and test that they pass their checks under the latest Matrix-devel, which you can install with

install.packages("Matrix", repos = "http://R-Forge.R-project.org")

IIRC only minimal changes are required. Do let me know if you are unsure how to adjust your code. Then I can investigate and try to provide a patch.

Symmetric reduce_basis

Request to add feature to reduce sparse basis, in terms of both the proportion of zeros and the proportion of ones

family argument in fit_hal is funky

I am getting weird answers when inputting the family argument

devtools::install_github("jlstiles/Simulations")
library(hal9001)
library(Simulations)

n=200
data =gendata(n, g0=g0_linear,Q0=Q0_trig1)
X=data
X$Y=NULL
Y=data$Y
X0=X1=X
X0$A=0
X1$A=1
newdata = rbind(X,X1,X0)

time = proc.time()
halres9001 <- fit_hal(Y = Y, X = X, family = "binomial")
timehal9001 = proc.time() - time

Qk = predict(halres9001, new_data = newdata)[1:n]
Q1k = predict(halres9001, new_data = newdata)[n+1:n]
Q0k = predict(halres9001, new_data = newdata)[2*n+1:n]
esthal9001 = var(Q1k-Q0k)
esthal9001ATE = mean(Q1k-Q0k)

esthalATE
esthal9001ATE

Conflict with data.table package

When I tried to install hal9001 this error came up
clang: error: unsupported option '-fopenmp'
make: *** [assign.o] Error 1
ERROR: compilation failed for package ‘data.table’

  • removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/data.table’
  • restoring previous ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/data.table’
    Error in i.p(...) :
    (converted from warning) installation of package ‘/var/folders/ph/l_06hx0d75x66h6f7dr3s72w0000gn/T//RtmpOicTJA/file5eb3aa883a7/data.table_1.12.3.tar.gz’ had non-zero exit status

R version 3.5.3 (2019-03-11) -- "Great Truth"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin15.6.0 (64-bit)

custom IRLS implementation

We need a custom implementation of iteratively re-weighted least squares (IRLS) in order to properly support the custom ("lassi") implementation of HAL when fitting GLMs other than with the Gaussian error family.

Currently, the custom implementation (fit_type = "origami") supports family = "gaussian" while cv.glmnet must be used when calling family = "binomial".

predict method not working when reduce_basis

Getting predictions from a hal fit that uses the "reduce_basis" argument fails. The error message is: "Error in .local(x, y, ...) : Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90".
Here is a reproducible example:

set.seed(385971)
n = 100
p = 3
x <- xmat <- matrix(rbinom(n * p, 1, .3), n, p)
y <- x[, 1] * sin(x[, 2]) + rnorm(n, mean = 0, sd = 0.2)

\# fit the HAL regression two ways
hal_fit1 <- fit_hal(X = x, Y = y)
pred1 <- predict(hal_fit1, new_data = x)  # this works
pct <- 10^-12
hal_fit2 <- fit_hal(X = x, Y = y, reduce_basis = pct)
pred2 <- predict(hal_fit2, new_data = x)  # this fails for many values of pct, e.g. .1, .2

hal needs speed

clever things that might help us be fast:

  • lassi_fit_cd should accept an index and perform fitting on that subset
  • make things into a core class as per #20
  • benchmark further cases of n, p for how lassi is performing vs glmnet
  • coordinate descent wrt cv-risk -- only do things that minimize cv-risk
  • use of R code for cv_lasso might be (probably is) causing significant delays
  • could consider moving CV steps into RcppParallel instead of origami
  • Wainwright

making HAL faster

clever things that might help us be fast:

  • lassi_fit_cd should accept an index and perform fitting on that subset
  • make things into a core class as per #20
  • benchmark further cases of n, p for how lassi is performing vs glmnet
  • coordinate descent wrt cv-risk -- only do things that minimize cv-risk
  • use of R code for cv_lasso might be (probably is) causing significant delays
  • could consider moving CV steps into RcppParallel instead of origami
  • Wainwright http://jmlr.csail.mit.edu/papers/volume17/15-605/15-605.pdf

binary prediction fails when `cv_select = FALSE`

Probably not a disastrous error but would be nice if this didn't happen:

n_obs <- 1000
w <- cbind(rbinom(n_obs, 1, 0.8), rbinom(n_obs, 1, 0.5),
           rbinom(n_obs, 1, 0.4), rbinom(n_obs, 1, 0.6))
colnames(w) <-  paste("W", seq_len(ncol(w)), sep = "_")
prob_a <- plogis(rowMeans(w[,-1]))
a <- rbinom(n_obs, 1, prob_a)

fit_gn <- fit_hal(X = as.matrix(w), Y = as.numeric(a),
                  fit_type = "glmnet", family = "binomial",
                  standardize = FALSE, cv_select = FALSE,
                  return_x_basis = TRUE, return_lasso = TRUE,
                  lambda = exp(seq(-1, -10, length = 1e4)),
                  yolo = FALSE)
est_gn <- predict(fit_gn, new_data = as.matrix(w))

where the call to predict fails with a glmnet error

Error in .local(x, y, ...) :
  Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90

Note that this does not happen if cv_select = TRUE in the above call to fit_hal.

logit link for binomial outcomes

The current custom implementation of the Lasso only supports continuous outcomes (that is, the case of the Gaussian error family). Some degree of revision will be necessary to support other types of outcomes, the most common case being that of binary outcomes.

For the time being, those interested in using Lasso with the binary outcomes can simply use cv.glmnet and pass it the option family = "binomial".

coupling lambda search with CV

A further step in the custom implementation of the lasso would be to more tightly integrate the CV and lasso steps so that time is not wasted in evaluating the full sequence of lambdas. This would be done by performing cross-validation on each lambda in the initial sequence one at a time.

By evaluating the CV-MSE at each lambda, we would be able to observe shifts in the CV-MSE and safely stop at a value of lambda where the CV-risk is minimized but before the full sequence of lambdas is evaluated. This saves time by avoiding fitting the model on very small values of lambda for which a disproportionate amount of time is required (since there are more coefficients to fit).

predict method should break here but gives a wrong answer instead...

library(hal9001)
set.seed(10)

# defining a propensity score
g0 = function(W1) plogis(-.1-.5*sin(W1) - .4*(abs(W1)>1)*W1^2)

# generating pscores and outcomes for sample of 1000
n=1000
W = rnorm(n)
ps = g0(W)
A = rbinom(n, 1, ps)

# going to fit over 1/2 the data
df_train = data.frame(W = W[1:(n/2)])

# defining validation over other half, one df including A
df_valA = data.frame(A = A[(n/2+1):n], W = W[(n/2+1):n])
df_val = data.frame(W = W[(n/2+1):n])

fitg_hal = fit_hal(X = df_train, Y = A[1:(n/2)], degrees = NULL, fit_type = "glmnet",
                   n_folds = 10, use_min = TRUE, family = "binomial",
                   return_lasso = FALSE, yolo = TRUE)

# predicting with dataframe including A should give an error or better yet, 
# the right answer, instead of wrong answer
pred_halA = predict(fitg_hal, new_data = df_valA, type = 'response')
# prediction with dataframe not including A, gives right answer
pred_hal = predict(fitg_hal, new_data = df_val, type = 'response')

# big difference
pred_hal - pred_halA

# correct risk
risk_hal = mean(with(df_valA, -A*log(pred_hal)-(1-A)*log(1-pred_hal)))
risk_hal

# ridiculous risk for the wrong preds
risk_hal_wrong = mean(with(df_valA, -A*log(pred_halA)-(1-A)*log(1-pred_halA)))
risk_hal_wrong

# glm fcn gives correct preds either way
fit_glm = glm(A[1:(n/2)] ~ W, data = df_train, family = binomial)
pred_glmA = predict(fit_glm, newdata = df_valA, type = 'response')
pred_glm = predict(fit_glm, newdata = df_val, type = 'response')
pred_glmA - pred_glm

In David's hal, there is an error given when inputting df_valA. I think it should give the right answer like glm does, if the data.frame or matrix contains the right names.

centering of sparse design matrix

Scaling of the design matrix (of basis functions) provided considerable speedups in our custom implementation of the LASSO -- a ~2x speedup, in fact.

Based on the above observation, it appears that centering of the design matrix might provide further (hopefully considerable) speedups / efficiency gains. That said, it is not immediately clear how to go about efficiently centering sparse matrices. This will require some investigation.

custom gradient descent with momentum

Currently, we're implementing the LASSO by hand, including standard gradient descent. Theory has shown that faster convergence is possible using methods that incorporate momentum. Numerical experiments have shown that Nesterov momentum can be applied to the LASSO fairly easily, and with good results (see slides 27-29 of this deck for details). @jeremyrcoyle Would it be worthwhile for us to add in this method as an option and assess whether this would improve the performance of HAL?

Example usage

Ideally, example usage demonstrates the utility of the software in solving real-world problems. The examples using simulated data are good and helpful for the user, but it would be good to have examples that use this software with real, large datasets. If these examples do exist in the repo, then they should be more clearly labeled as they are not obvious. Maybe making an /examples directory would help.

`num_knots` default does not grow with sample size?

Thanks for all the work on this package.

If I am reading the documentation correctly, the new default values of num_knots uses a fixed number of knots for specifying a basis irrespective of sample size. Is that correct? If so, is there a motivation for this?

Competing work

The paper does not talk about competing work at all. If this paper is the first that has been written on software implementation of HAL, this should be clearly stated and then my concern about this will be ameliorated. Otherwise, the authors should more clearly highlight previous implementation of HAL. I see the van der Laan and Dudoit paper series, but it is unclear if these papers are the only prior work, if they address the same questions as this paper, etc.

Still have bases with interactions when setting max_degree = 1

Where I met the issue:

I have the following:
X: (W, A), a n*2 numeric matrix
Y: binary outcome, numeric vector with length n.

Then I fit HAL with the codes:
fit_obj <- fit_hal(X = X, Y = Y, family = "binomial",
return_x_basis = TRUE,
num_knots = hal9001:::num_knots_generator(
max_degree = 1,
smoothness_orders = 1
))

I still have interactions. First few rows of the output:

summary(fit_obj)
Summary of non-zero coefficients is based on lambda of 0.002024853

        coef                                                                term

-5.575820e+00 (Intercept)
1.710380e+00 [ I(A >= 0)(A - 0)^1 ]
-3.430264e-01 [ I(W >= -0.804)
(W - -0.804)^1 ] * [ I(A >= 0.396)*(A - 0.396)^1 ]

fitting grouped binomial

Hi - just fitting a grouped binomial which works great by setting Y to be a 2 columned matrix of negatives and positives. To align with other model fitting functions (GLM, GAM etc.) you might want to stick to the convention of positives and negatives, i.e. columns the other way around.

Thanks for the package!

Use hal9001 in superlearner() in R

Hi,

I am tryiny to add the "SL.hal9001" function to SuperLearner() function, and there is no problem to fit such model. But when I try to do prediction with new data set, by using predict(mod, newdataset, onlyST = T), it returns me "Error in predict.hal9001(object$object, new_data = newX) : argument "newX" is missing, with no default" if the weight for hal9001 is not 0. I ma wondering how to deal with this case?

Best,

Hao

custom gradient descent in Rcpp

Currently (as of ebf0084), the LASSO fitting step of the HAL algorithm is a call to the cv.glmnet function from the R package glmnet. The results of simple benchmarks indicate that the vast majority of time spent on execution of a call to hal9001::fit_hal is spent in the cv.glmnet step (in fact, up to 95% of the total execution time).

Re-writing the LASSO in C++ with a Rcpp wrapper is a simple solution to this problem.

unit tests and sanity checks

It should be a priority to implement unit tests for simple/easy statistical problems, examples including

  • does hal_fit return the same thing as cv.glmnet for an easy problem?
  • ...
  • ...

python version

Incredibly grateful for you releasing this.

Is there any plan for a python implementation?

Regards,

relaxing HAL constraints from `glmnet` defaults

Asymptotically, it appears that a call to fit_hal, wrapping cv.glmnet, will lead to severely constrained fits that fail to capture a rich enough set of basis functions asymptotically. In a previous implementation, this was ameliorated by explicitly setting the lambda.min.ratio argument of glmnet to a low value (e.g., 0.01); we should consider something similar for fit_hal.

HAL fails to return a matrix of predictions when `cv_select=FALSE`

#79 introduced a regression in which support for providing predictions along a grid of lambda values instead returns a vector of apparently identical values. This is clear from the README example, which used to return:

r$> set.seed(385971)
r$> packageVersion("hal9001")
[1] ‘0.2.7r$> n <- 100
r$> p <- 3
r$> x <- matrix(rnorm(n * p), n, p)
r$> y <- x[, 1] * sin(x[, 2]) + rnorm(n, mean = 0, sd = 0.2)
r$> hal_fit <- fit_hal(X = x, Y = y, cv_select = FALSE)
[1] "I'm sorry, Dave. I'm afraid I can't do that."
r$> hal_fit$times
                  user.self sys.self elapsed user.child sys.child
enumerate_basis       0.003        0   0.003          0         0
design_matrix         0.006        0   0.006          0         0
reduce_basis          0.000        0   0.000          0         0
remove_duplicates     0.004        0   0.004          0         0
lasso                 0.028        0   0.028          0         0
total                 0.041        0   0.042          0         0
r$> preds <- predict(hal_fit, new_data = x)

r$> preds
               s0          s1          s2          s3          s4          s5         s6
  [1,] 0.06356236  0.06478106  0.06559040  0.06310132  0.06072537  0.05844328  0.0562756
                s7          s8          s9         s10          s11         s12
  [1,]  0.05463778  0.05627511  0.05339973  0.05226898  0.051860713  0.05198880
                s13          s14         s15         s16         s17         s18
  [1,]  0.052159151  0.052403705  0.05170359  0.05074108  0.04971674  0.04872899
               s19         s20         s21         s22          s23         s24
  [1,]  0.04718589  0.04224513  0.03713774  0.03313715  0.029578804  0.02611531
               s25         s26          s27          s28          s29          s30
  [1,]  0.02220966  0.01861265  0.016745656  0.014962968  0.014016019  0.013150862
                 s31           s32          s33          s34          s35          s36
  [1,]  0.0084326033  0.0016781764 -0.004379224 -0.009756481 -0.018208534 -0.027420644
                s37          s38          s39          s40          s41           s42
  [1,] -0.035692904 -0.042651339 -0.047731797 -0.057171332 -0.068810925 -0.0814902028
                s43           s44          s45           s46          s47          s48
  [1,] -0.094145457 -0.1063458006 -0.112258682 -0.1169236800 -0.122157452 -0.126232513
                s49          s50          s51          s52          s53          s54
  [1,] -0.128004295 -0.129774524 -0.129584798 -0.129109598 -0.128860797 -0.128431573
                s55          s56           s57           s58          s59          s60
  [1,] -0.128002734 -0.127860160 -1.276471e-01 -1.263560e-01 -0.124513722 -0.123259610
                s61           s62          s63          s64          s65          s66
  [1,] -0.122217608 -0.1212981615 -0.119173537 -0.117948634 -0.118493536 -0.118916254
               s67          s68          s69          s70          s71          s72
  [1,] -0.11921098 -0.119095745 -0.118599351 -0.116668209 -0.115855176 -0.115227862
                s73          s74          s75          s76          s77          s78
  [1,] -0.114295608 -0.113795088 -0.113272632 -0.113057443 -0.113127302 -0.112956354
               s79          s80          s81          s82          s83          s84
  [1,] -0.11290637 -0.114645294 -0.116617272 -0.118362758 -0.119785603 -0.121354928
                 s85          s86           s87          s88          s89          s90
  [1,] -0.1219591307 -0.121072044 -0.1201841224 -0.119764167 -0.119247460 -0.118394374
                s91          s92          s93          s94          s95          s96
  [1,] -0.117574887 -0.117310636 -0.116898537 -0.116803168 -0.117258712 -0.117751223
                s97         s98          s99
  [1,] -0.118306951 -0.11876637 -0.119148360
 [ reached getOption("max.print") -- omitted 99 rows ]

...but now returns

r$> set.seed(385971)
r$> packageVersion("hal9001")
[1] ‘0.3.0r$> n <- 100
r$> p <- 3
r$> x <- matrix(rnorm(n * p), n, p)
r$> y <- x[, 1] * sin(x[, 2]) + rnorm(n, mean = 0, sd = 0.2)
r$> hal_fit <- fit_hal(X = x, Y = y, yolo = TRUE, cv_select = FALSE)
[1] "I'm sorry, Dave. I'm afraid I can't do that."
r$> hal_fit$times
                  user.self sys.self elapsed user.child sys.child
enumerate_basis       0.188    0.000   0.187          0         0
design_matrix         0.010    0.000   0.010          0         0
reduce_basis          0.000    0.000   0.000          0         0
remove_duplicates     0.000    0.000   0.000          0         0
lasso                 0.054    0.001   0.054          0         0
total                 0.253    0.001   0.253          0         0
r$> preds <- predict(hal_fit, new_data = x)

r$> preds
  [1] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
  [8] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [15] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [22] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [29] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [36] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [43] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [50] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [57] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [64] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [71] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [78] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [85] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [92] 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236 0.06356236
 [99] 0.06356236 0.06356236
 [ reached getOption("max.print") -- omitted 9900 entries ]

lasso fits while bounding L1 norm

The implementation of Lasso available in the lasso2 package (GitHub | CRAN) appears to allow options for fitting the Lasso model while imposing bounds on the L1 norm. This could be a highly desirable feature for HAL regressions. Writing a backend/wrapper for hal9001 to incorporate the lasso2 package --- similar to the wrapper for glmnet --- would likely be worth the effort. It's possible that lasso2 does not directly allow for cross-validation, meaning some of the functionality of CV-Lasso (available in glmnet) will have to be replicated by hand. Idea originally from @benkeser.

additive models

It would be very cool to add a formula interface for fitting HAL additive models.

So if X has two columns x1, x2 you could write

fit_hal(..., formula = "x1 + x2")

and it would fit a 1-degree HAL

fit_hal(..., formula = "x1*x2")

would fit the 2-degree HAL, etc...

Dave, my faster replacement is giving different answers, I'm afraid, Dave ...

devtools::install_github("jlstiles/Simulations")
library(hal9001)
library(Simulations)

# using built-in functions in Simulations pack
n = 200
data  = gendata(n, g0=g0_linear,Q0=Q0_trig1)
X = data
X$Y = NULL
Y = data$Y
X0 = X1 = X
X0$A = 0
X1$A = 1
newdata = rbind(X, X1, X0)

time = proc.time()
halres <- hal(Y = Y, newX = newdata, X = X, family = binomial(), 
                                   verbose = FALSE, parallel = FALSE)
timehal = proc.time() - time

Qk = halres$pred[1:n]
Q1k = halres$pred[n+1:n]
Q0k = halres$pred[2*n+1:n]

# estimating blip var and ATE
esthal = var(Q1k-Q0k)
esthalATE = mean(Q1k-Q0k)

time = proc.time()
halres9001 <- fit_hal(Y = Y, X = X)
timehal9001 = proc.time() - time

Qk = predict(halres9001, new_data = newdata)[1:n]
Q1k = predict(halres9001, new_data = newdata)[n+1:n]
Q0k = predict(halres9001, new_data = newdata)[2*n+1:n]

# estmating blip var and ATE
esthal9001 = var(Q1k-Q0k)
esthal9001ATE = mean(Q1k-Q0k)

timehal
timehal9001

# answers are different, consistently hal9001 is underestimating both compared with hal
# blip variance is about 1/2 of what hal gives and ATE is considerably smaller, tried with n up
# to 500

esthal
esthal9001

esthalATE
esthal9001ATE

Default `max_degree = ifelse(ncol(X) >= 20, 2, 3)`

The number of interaction terms increases dramatically, when the number of variables increases. This makes it almost impossible to produce model results in a reasonable time when ncol(X) is large. Is it possible to change this default value ncol(X) >= 20 to something smaller?

Poisson family?

Is it possible to fit poisson models using this package? It is not listed as a supported family in the documentation, and it doesn't work when I try it, but I wonder how hard it would be to add support for it. Is this something that requires just R code (in which case I might be interested in working on a PR), or would it require changes to Rcpp code underneath?

Thanks in advance. This package is awesome.

incorporating covariance updates

In this paper, Hastie et al. report significant gains in efficiency of their algorithmic implementation of the Lasso using covariance updates rather than naive updates of the coefficients. By incorporating this process into our implementation of the Lasso -- at least for the Gaussian error family case -- we ought to be able to obtain further gains in efficiency.

`lambda.min.ratio` ignored by `glmnet` for `family = "binomial"`

To achieve the quartic convergence rate dictated by theory, it is necessary to have the HAL regression function reach a saturated model asymptotically; however, this is currently not possible in the family = "binomial" implementation of glmnet(), as a call to glmnet() will ignore any specification of lambda.min.ratio in this case. This can be avoided in two ways: (1) setting family = "gaussian" and specifying a very low value of lambda.min.ratio (essentially a linear probability model in terms of HAL basis functions); or (2) directly passing in a sequence of L1 constraint parameters that includes a very low value through the lambda argument of glmnet(). Issue originally reported and confirmed by @idiazst.

degrees should not be NULL by default?

Currently, by default, fit_hal fits the lasso on the full set of interactions between all covariates presented to it. This is extremely slow and/or impractical for even modestly large data sets. Perhaps it ought to be changed?

supporting active-set convergence

In their Lasso implementation in the glmnet package, Hastie et al. note that

Considerable speedup is obtained by organizing the iterations around the
active set of features—those with nonzero coefficients. After a complete cycle
through all the variables, we iterate on only the active set till convergence. If
another complete cycle does not change the active set, we are done, otherwise
the process is repeated. Active-set convergence is also mentioned in Meier
et al. [2008] and Krishnapuram and Hartemink [2005].

We could very likely speed up our implementation by incorporating this procedure.

lambda parameter ignored by predict.hal9001

Issue
predict.hal9001 ignores the lambda parameter and always shows predictions for all lambdas specified during fitting. Having had a cursory look at the source code, lambda does not seem to be used at all in predict.hal9001 but is used in predict.lassi.

This is not clear from the documentation, and led to confusions in my case since it makes the lambda parameter redundant and diverges from glmnet's behaviour. There might be theoretical reasons behind this but it would good to remove this parameter in this case, or state this behaviour clearly in the manual if the parameter is needed for compatibility with predict.lassi.

Reproducible example

# Simulated data taken from glmnet help page ?glmnet::glmnet
x = matrix(rnorm(100 * 20), 100, 20)
y = sample(1:2, 100, replace = TRUE)

# Fit a elastic net and predict with different lambda 
net_fit = glmnet(x, y, family = "binomial", lambda = c(0.001, 0.01, 0.1))
predict(net_fit, head(x), type = "response")  # Use lambda used during fitting
predict(net_fit, head(x), s = c(0.1, 0.05), type = "response") # Use custom lambdas
# --> glmnet correctly shows the specified lambdas, interpolating or refitting for the previously unseen lambda

# Fit a highly adaptive lasso model and try to predict with different lambda
hal_fit = fit_hal(X = x, Y = y, family = "binomial", lambda = c(0.001, 0.01, 0.1), cv_select = FALSE, fit_type = "glmnet")
predict(hal_fit, new_data = head(x), type = "response") # fitted lambdas 
predict(hal_fit, new_data = head(x), lambda = c(0.1, 0.05), type = "response") # custom lambdas
# --> hal9001 ignores the lambda parameter provided to predict, showing the same result in each case

Quasibinomial family

Hi Nima,

No hurry, but if you could implement family="quasibinomial" as an option in HAL now that cvglmnet allows it, that would be great!

Thanks,
Lauren

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.