jepusto / clubsandwich Goto Github PK

View Code? Open in Web Editor NEW

46.0 46.0 8.0 126.51 MB

Cluster-robust (sandwich) variance estimators with small-sample corrections

Home Page: http://jepusto.github.io/clubSandwich/

R 26.10% TeX 14.36% CSS 0.01% Shell 0.08% HTML 59.45%

clubsandwich's People

Contributors

Stargazers

Watchers

Forkers

jimhester windshield999 guhjy chrinide tappek meghapsimatrix s3alfisc jayhyung

clubsandwich's Issues

Update docs for conf_int()

conf_int() does not support test = "saddlepoint". Docs should briefly explain why.

methods for ivreg objects

Methods for new ivreg package. Likely quite similar to AER::ivreg.

vcovCR() on a plm object doesn't work within a function

Calling vcovCR() on a plm object does not work properly when it's called within a function.

Here's a reproducible example:

example.of.problem <- function(){
  require(plm)
  require(clubSandwich)
  our.data <- iris
  model.output <- plm(formula = "Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width", data = our.data, model = "within", index = "Species")
  vcovCR(model.output, cluster = our.data$Species, type="CR1S")
}
example.of.problem()

This results in an error message, whereas running the code that's inside the function works just fine if it's not inside a function.

add facilities for post-hoc comparisons

Perhaps see what could be done to interface with the multicomp package.

add lm function for 3-stage fixed-effects model

lm_absorb(y ~ R, FE_between = ~ S, FE_within = ~ T, ...)

Incomplete vignette?

Seems like the vignette "Robust-Variance-Estimation-with-clubSandwich" is incomplete and should be either revised or deleted? It ends mid-sentence--as does the source RMD file.

According to the Stata documentation (Stata User's Guide, Release 13, section 20.21.2, 'Correlated errors: Cluster–robust standard errors'), the equation for "CR1S" should be (m (N-1)) / [(m - 1)(N - p)] (instead of (m N) / [(m - 1)(N - p)]).

Revise documentation with mathjax

Added mathjaxr to package dependencies, so documentation can now include mathjax equations. Revise docs to make the equations pretty.

Need unit tests for lme/gls with non-default clustering

unit tests for lme/gls with non-default clustering

tidy tidy

When I use the tidy = TRUE argument with conf_int() it adds the name of the variable (like genderFemale) is the rowname. Could you make it so that the variables are a separate column in the data - so it's easier to filter without having to pull rownames and add it to the data?

coxph

Would it be possible to extend clubSandwich to handle coxph objects as well? Currently (v0.3.2), this gives an error due to lacking nobs function. Adding such a function does not help, as the below toy example illustrates:

set.seed(123)
library(clubSandwich)

# Avoids errors from clubSandwich, but doesn't solve the problem.
nobs.coxph <- function(f) f$n

# Generate some data.
n <- 200
dd <- data.frame(x1=rnorm(n), x2=rnorm(n))
dd$lp <- dd$x1 + dd$x2 # Linear predictor.
dd$time <- rexp(n=n, rate=exp(dd$lp)) # Survival time.
dd$event <- T # No censoring.
dd$id <- 1:n # No clustering.

# Classical robust standard errors (works).
fit1 <- coxph(Surv(time, event) ~ x1 + x2 + cluster(id), data=dd)
vcov1 <- vcov(fit1)

# Robust standard errors from clubSandwich (doesn't work).
fit2 <- coxph(Surv(time, event) ~ x1 + x2, data=dd)
vcov2 <- vcovCR(fit2, cluster=dd$id, type="CR0")

# In theory, vcov2 should equal vcov2. However, it is much larger:
print(vcov2 / vcov1)

nobs.plm can be removed as there is now plm:::nobs.panelmodel on CRAN

Referring to (the beginning of) #13:
you can remove the nobs.plm method from clubSandwich for your next CRAN release (if you wish). There is now plm:::nobs.panelmodel in the latest CRAN release of plm (version 1.6-4) and you can rely on this if you want to (nobs.panelmodel also captures plm objects).

BTW: there is also support for clubSandwich's vcovs for plm::pwaldtest and hence one can pass a clubSandwich vcov in summary.plm and get identical results for the F statistic and partial wald tests on coefficients:

library(plm)
library(clubSandwich)
data(MortalityRates)

# partly from clubSandwich's vignette
# subset for deaths in motor vehicle accidents, 1970-1983
MV_deaths <- subset(MortalityRates, cause=="Motor Vehicle" & 
                      year <= 1983 & !is.na(beertaxa))

plm_unweighted <- plm(mrate ~ legal + beertaxa, data = MV_deaths, 
                      effect = "individual", index = c("state","year"))

vcov_club_id   <- vcovCR(plm_unweighted, type = "CR0", cluster = "individual")
vcov_plm_id    <- vcovHC(plm_unweighted, method="arellano", type="HC0", cluster = "group")

summary(plm_unweighted, vcov = vcov_plm_id) 
summary(plm_unweighted, vcov = vcov_club_id)
pwaldtest(plm_unweighted, test = "F", vcov = vcov_plm_id)
pwaldtest(plm_unweighted, test = "F", vcov = vcov_club_id)

Bug in findcluster.rma.mv()

clustering variable is not detected models that include both inner | outer and ~ 1 | id random effects structures.

Possible values of the "type" argument of vcovCR() are not documented

There is no documantion describing possible values of the "type" argument to vcovCR(), neither in the "Arguments" sectionnor in the "Details" section nor even in the (lacking) "Example" section of the help file.

Typo in Meta-analysis with CRVE

The first line has a typo: "This vignette dimeonstrates how to use the clubSandwich package"

add tests for rma.mv() with random slopes

Development vresion of metafor now allows for random slopes in rma.mv(). Add tests for whether current clubSandwich methods are adequate for such models (and if not, fix!).

add methods for mlm objects.

Validate methods for plm instrumental variables models.

targetVariance does not work for 3-level lme

Generalize targetVariance.lme to work on models with multiple levels of nesting.

Add option to display t-statistics in output

It would be useful to have the option to display cluster-robust t-statistics in the output of coef_test()

Use matrix-lists for block-diagonal target and weight matrices

targetVariance and weightMatrix should return matrix lists for gls, lme, plm-RE, rma-mv.

A more informative error message for missing data?

Hi!

Noticed this thing, which caused a bit of trouble. Could we maybe have a more informative message saying that you can't have missings, or maybe build in a way to deal with them?

This gives "Error in t(x) %*% w : non-conformable arguments":

data <- data.frame(value = rnorm(500, mean = 100, sd = 10),
cluster = sample(c(letters[1:10], NA), 500, replace = TRUE))

lm_object <- lm(value ~ 1, data = data)
clubSandwich::vcovCR(lm_object, cluster = data$cluster, type = "CR1S")

Removing NA helps:

data <- data.frame(value = rnorm(500, mean = 100, sd = 10),
cluster = sample(c(letters[1:10], NA), 500, replace = TRUE)) %>%
na.omit()

lm_object <- lm(value ~ 1, data = data)

clubSandwich::vcovCR(lm_object, cluster = data$cluster, type = "CR1S")

superfluous dot in printed output: "Std..Error"

Just a minor cosmetic "Std..Error" in printed output of coef_test, see:

library(clubSandwich)
data(MortalityRates)

MV_deaths <- subset(MortalityRates, cause=="Motor Vehicle" & 
                      year <= 1983 & !is.na(beertaxa))
plm_unweighted <- plm(mrate ~ legal + beertaxa, data = MV_deaths, 
                      effect = "individual", index = c("state","year"))
coef_test(plm_unweighted, vcov = "CR1", cluster = "individual", test = "naive-t")

# Coef Estimate Std..Error p-val (naive-t) Sig.
#1    legal     4.42       2.13           0.043    *
#2 beertaxa    30.25       5.99          <0.001  ***

support for MASS::polr

I have build an ordered logistic regression model with MASS::polr and would like to get clustered standard errors for the marginal predictions. However this throws the following error, suggesting that MASS::polr is currently not supported:
#> Error in split.default(resid, cluster): first argument must be a vector

add confidence intervals for regression coefficients.

add robust score test.

methods for glmmTMB objects?

Possible to add methods for models fitted with glmmTMB? The package supports linear and generalized linear mixed models where random effects are gaussian and on the scale of the linear predictor.

Possible bug in multi-level multi-variate model with lme()

library(RCurl)
library(nlme)
library(clubSandwich)

### get data
tmp <- getURL("https://raw.githubusercontent.com/wviechtb/multivariate_multilevel_models/master/data.dat")
dat <- read.table(text=tmp, header=TRUE, sep="\t")

### calculate PA and NA
dat$pa <- rowMeans(dat[, grepl("pa", names(dat))])
dat$na <- rowMeans(dat[, grepl("na", names(dat))])

### keep only variables that are needed
dat <- dat[, c("id", "sex", "beep", "pa", "na")]

### change into very long format
dat <- reshape(dat, direction="long", varying=c("pa","na"), v.names="y", idvar="obs", timevar="outcome")
dat$obs <- NULL
dat <- dat[order(dat$id, dat$beep, dat$outcome),]
rownames(dat) <- 1:nrow(dat)
dat$outnum <- dat$outcome
dat$outcome <- factor(dat$outcome)

### fit multivariate multilevel model
res1 <- lme(y ~ outcome - 1,
            random = ~ outcome - 1 | id,
            weights = varIdent(form = ~ 1 | outcome),
            correlation = corSymm(form = ~ outnum | id/beep),
            data = dat, na.action = na.omit)
summary(res1)

# Error in tcrossprod(s) * R : non-conformable arrays
coef_test(res1, vcov = "CR2")
coef_test(res1, vcov = "CR2", cluster = dat$id)

# the error stems from vcovCR()
vcovCR(res1, type = "CR2", cluster = dat$id)
vcovCR(res1, type = "CR0", cluster = dat$id)

The error goes away when I remove the correlation structure on the errors:

res2 <- lme(y ~ outcome - 1,
            random = ~ outcome - 1 | id,
            weights = varIdent(form = ~ 1 | outcome),
            data = dat, na.action = na.omit)
summary(res2)
coef_test(res2, vcov = "CR2")

But gls() with such a correlation structure works:

res3 <- gls(y ~ outcome - 1,
           weights = varIdent(form = ~ 1 | outcome),
           correlation = corSymm(form = ~ outnum | id/beep),
           data = dat, na.action = na.omit)
summary(res3)
coef_test(res3, vcov = "CR2")
coef_test(res3, vcov = "CR2", cluster = dat$id)

Maybe it's because 'res1' uses 'id' as the grouping variable for 'random' while the correlation structure is defined with respect to 'id/beep'?

error handling in Wald_test().

Wald_test() should return an informative error message if test does not match any of the available options.

Also, Wald_test() currently fails (in an uninformative way) for an intercept-only model with constrain_zero(1). Should return tests equivalent to the coef_test() result.

multiple correction

Can you adjust the p-value for multiple correction? That way you can have the symbols for the standard alpha values without having to mess with correcting alpha. Like what p.adjust() does.

Example:

if (mc_method == "Bonferroni"){

res$p_val <- res$p_val * q # q being number of contrasts
res$sig <- symnum(x$p_val, corr = FALSE, na = FALSE,
cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1),
symbols = c("", "", "", ".", " "))

}

Possibly incorrect finite sample adjustment for CR1S w/ multivariate model

Hi!

Was benchmarking an estimator I'm working on and noticed that I believe your finite sample adjustment of the "stata" flavor may be wrong when working with multiple outcomes. Here's a MWE:

lmo <- lm(cbind(mpg, hp) ~ wt, data = mtcars)
clubSandwich::vcovCR(lmo, cluster = mtcars$cyl, type = "CR0")

# Manually build correction
J <- length(unique(mtcars$cyl))
n <- nrow(mtcars)
r <- 2

# Not equivalent
clubSandwich::vcovCR(lmo, cluster = mtcars$cyl, type = "CR1S")
clubSandwich::vcovCR(lmo, cluster = mtcars$cyl, type = "CR0") * ((J * (n - 1)) / ((J - 1) * (n - r)))

# can see here that your N is actually n * ny, where ny is the number of models
# and r is actually r*ny, where r is the rank of X
debugonce(clubSandwich:::CR1S)
clubSandwich::vcovCR(lmo, cluster = mtcars$cyl, type = "CR1S")

Now, I'm not actually sure that the correction you are using is wrong, but it does mean that the block diagonals of your vcov matrix are not the same as you would get from running the regression separately for each outcome, which I believe it should be.

Specifying cluster variable sometimes fail

Hi all!
I'm trying to specify a cluster variable after plm for my simulated data (which I use for power simulation), but I get the following error message:
"Error in [.data.frame(eval(mf$data, envir), , index_names) : undefined columns selected"

I'm not sure if this is specific to vcovCR or something general about R, but could anyone tell me what's wrong with my code?
My example (data file is in the attached zip file):
library(clubSandwich)
library(plm)
load("Data.RData")
fe.fit <- plm(formula = Y ~ T, data = Data, model = "within", index = "id",effect = "individual", singular.ok = FALSE)
vcovCR(fe.fit,cluster=Data$id,type = "CR2") # doesn't work, but I can run this by not specifying cluster as in the next line
vcovCR(fe.fit,type = "CR2")
vcovCR(fe.fit,cluster=Data$gid,type = "CR2") # I ultimately want to run this

For some other datasets, for example Produc in plm package, this works:
library(clubSandwich)
library(plm)
data("Produc", package = "plm")
plm_FE <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"),
effect = "individual", model = "within")
vcovCR(plm_FE, type="CR2",cluster = Produc$state)

Thanks a lot!
Data.zip

detect crossed random effects in lme

Throw an error if lme model contains cross random effects (i.e., covariance matrix that is not block-diagonal).

nobs.plm

I noticed there is a nobs.plm function in this package. This makes sense because the current CRAN release of plm does not have this method. Since some time, it is implementend in the development version of plm (https://r-forge.r-project.org/projects/plm/).

Just a tiny note about clubSandwich's implementation:

 nobs.plm <- function(object, ...) {
   NROW(object$model)
 }

This should work for FE and RE models. For FD models, the original/undifferenced data is kept in object$model, thus this will likely yield a somewhat wrong number. However, I don't know if FD models are supported by clubSandwich. I recommend to use length(object$residuals) which will give the number of (differenced) observations used in the (FD) model estimation and does the trick for FE and RE models as well. This is likely the number we want here?

add complete data from Kalaian & Raudenbush (1996)

coef_test with fixed corelation matrices

Hi @jepusto

In ecology and evolution, meta-analysis often has correlation matrices for a random factor. But I tried coef_test with such a model. Then I get this:

> coef_test(mod4, vcov="CR2", cluster = dat$paper)
Error in vcovCR.rma.mv(obj, type = vcov, ...) : 
  vcovCR.rma.mv() does not work with fixed correlation matrices in the R argument.

Do you think you could make this work with such a model?

Thanks!

Control parallel processing

Great package. But I was struggling to find a way, without hacking all of the code, to control how many cores are utilized in simulations. Could you add that, please?

methods for lme4 objects.

merDeriv package (https://www.jstatsoft.org/article/view/v087c01) should make this quite a bit easier.

"CR1" Formula

Shouldn't the adjustment factor for "CR1" be m / (m - p), where m is the number of clusters and p the rank of the model matrix (instead of m / (m - 1))? This would be analogous to 'HC1'. See, for example, MacKinnon and White (1985).

break up bread and meat?

Hi @jepusto,

@josherrickson and I were wondering, does clubSandwich have a route to/user-facing mechanism for calculating meat only, ie assuming the user will construct his or her own bread matrix or get it from elsewhere? For us something like that would be pretty nifty.

Cool project! --ben

add cluster-Wild bootstrap test.

clubSandwich moved to archive on CRAN

Hi, clubSandwich was recently removed from CRAN (https://cran.r-project.org/web/packages/clubSandwich/) - are there any plans to fix the issues and re-submit? I'm asking because one of my packages suggests clubSandwich, and I would have to remove it from the suggested packages if it will not come back to CRAN any time soon...

Consider using sandwich::bread.mlm()

On suggestion from Dr. Zeileis, consider using the following:

bread.mlm <- function(x, ...)
{
   if(!is.null(x$na.action)) class(x$na.action) <- "omit"
   cf <- coef(x)
   rval <- summary.lm(x)
   rval <- kronecker(
     structure(diag(ncol(cf)), .Dimnames = rep.int(list(colnames(cf)), 2L)),
     structure(rval$cov.unscaled,  .Dimnames = rep.int(list(rownames(cf)), 2L)) * as.vector(sum(rval$df[1:2])),
     make.dimnames = TRUE
   )
   return(rval)
}

methods for fixest

Possible to add methods for models from fixest package?

Two-way clustering for plm object?

Hi! I'm trying to do two-way clustering with vcovCR() command after running fixed effects model with plm() by vcovCR(plm_FE,type="CR2",cluster=c("data$id","data$year")), but it returns the same covariance matrix estimate as one-way clustering. Is this covariance matrix estimate correct, or something is wrong with the syntax?

Sample code (adapted from the vcovCR.plm() help menu code):
library(plm)
data("Produc", package = "plm")
plm_FE <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"),
effect = "individual", model = "within")
vcovCR(plm_FE, type="CR2") % one-way clustering
vcovCR(plm_FE, type="CR2",cluster = c(Produc$state)) % same as above - one-way clustering
vcovCR(plm_FE, type="CR2",cluster = c(Produc$state,Produc$year)) % supposed to be two-way clustering, but returns the same covariance matrix as one-way clustering

model_matrix.plm for RE models

Could you check please: in model_matrix.plm for RE models, the model matrix is extracted by:
model.matrix(Formula::as.Formula(formula(obj)), model.frame(obj))

This will give the untransformed data (as it is in the model frame) - intended? What is the reason behind the special treatment of RE models here?

Write methods for ivreg objects.

Need to distinguish between projection matrix and second stage model matrix.

Let impute_covariance_matrix() take a patterned correlation matrix as input.

Modify impute_covariance_matrix() to take a vector describing categories of effect sizes and a matrix describing the correlations between effect sizes of each pair of categories (and within category). Return a variance-covariance matrix (or list of vcov matrices) with the implied correlations.

Here's an example dataset and pattern matrix:

set.seed(20200910)
cluster <- factor(rep(1:3, each = 4))
cati <- sample(letters[1:4], 12, replace = TRUE)
vi <- 1 /rchisq(12, df = 6)
dat <- data.frame(cluster, cati, vi)

r_pattern <- matrix(c(0.9, 0.8, 0.7, 0.6, 
                      0.8, 0.85, 0.7, 0.5,
                      0.7, 0.7, 0.88, 0.5,
                      0.6, 0.5, 0.5, 0.92), 4, 4,
                    dimnames = list(letters[1:4],letters[1:4]))

Here's how the syntax might look:

impute_covariance_matrix(vi, cluster, cati, pattern)

error with conf_int() when clustering on a single study

Hi James,

Many thanks for creating this fantastic package!
We aim at applying RVE to summarize R-squared values that we calculated based on data from several large-scale assessment studies. The sample size at the study level is quite small, ranging between 1 study and 6 studies, depending on the covariate set under investigation. To account both for within- and between-study variation and to model the dependencies within studies, we thought about fitting CHE models (as outlined in Pustejovsky & Tipton, 2020 [https://osf.io/preprints/metaarxiv/vyfcj/]) using metafor and clubSandwich.
As described above, there are situations where we would like to summarize across only a single study. In these cases, we still would like to account for the fact that the effect sizes stem from the same study. So we tried to fit the following model (using sample data here):

# install.packages("metafor")
# remotes::install_github("jepusto/clubSandwich")

library(metafor) # version 2.4-0
library(clubSandwich) # version 0.5.3

sessionInfo()
## R version 4.0.4 (2021-02-15)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19042)

# sample data with 1 study and 5 effect sizes
set.seed(1234)
dat <- data.frame(
  study = "study1", # study number
  est = runif(5, 0.1, 0.6), # R-squared values
  se = runif(5, 0.005, 0.025), # standard errors of R-squared values
  es_id = 1:5 # effect size ID
)

print(dat)
##    study       est          se es_id
## 1 study1 0.1568517 0.017806212     1
## 2 study1 0.4111497 0.005189915     2
## 3 study1 0.4046374 0.009651010     3
## 4 study1 0.4116897 0.018321675     4
## 5 study1 0.5304577 0.015285023     5

# sampling variance-covariance matrix
v_mat <- impute_covariance_matrix(dat$se^2, cluster = dat$study, r = 0.8)

# working model in metafor
res <- rma.mv(yi = est, V = v_mat, random = ~ 1 | study/es_id, data = dat)

metafor throws out the following warning:

In rma.mv(yi = est, V = v_mat, random = ~1 | study/es_id, data = dat) :
  Single-level factor(s) found in 'random' argument. Corresponding 'sigma2' value(s) fixed to 0.

Since fixing the between-cluster heterogeneity to zero is exactly what we intended to do here, we ignore this warning. However, when calling clubSandwich::conf_int(), we get the following error message:

# clustered SEs and CIs
ci <- conf_int(res, "CR2")

Error in `diag<-`(`*tmp*`, value = diag(P_array[, , i]) + P_diag[i, ]) : 
  only matrix diagonals can be replaced

Regardless of the fact that RVE with Satterthwaite correction might produce imprecise estimates with less than 5 studies anyway (as mentioned in Tipton, 2015), I assume that it simply won't work to cluster on only a single study, right? (Fixing the between-cluster variance with 2 or more studies by means of sigma2 = c(0, NA) in metafor::rma.mv() does not produce this error message.) If this is the case, is it possible to adjust the error message accordingly? Or are we overlooking something?

Many thanks in advance for your effort.
All the best,
Sophie

method for lfe::felm() objects

Really like the look of this package. Thanks for putting it together.

Would it be possible to add a method for lfe::felm()? For many of us, this is the go-to approach for estimating high-dimensional FE models in R. The function already includes native cluster-robust SE estimation, but I don't think it implements a small cluster correction of the type presented here.

One potential complication is that the fixed effects are "projected out" by the underlying algorithm... which is why felm() is able to able to estimate HDFE models so quickly in the first place. But this may limit how well it plays with clubSandwich.

PS - A link to the lfe package.