fbartos / zcurve Goto Github PK

zcurve R package for assessing the reliability and trustworthiness of published literature with the z-curve method

Home Page: https://fbartos.github.io/zcurve

R 74.65% C++ 24.92% TeX 0.43%

zcurve's Introduction

zcurve

This package implements z-curves - methods for estimating expected discovery and replicability rates on bases of test-statistics of published studies. The package provides functions for fitting the new density and EM version (Bartoš & Schimmack, in preparation) as well as the original density z-curve (Brunner & Schimmack, 2020). Furthermore, the package provides summarizing and plotting functions for the fitted z-curve objects. See the aforementioned articles for more information about the z-curves, expected discovery and replicability rates, validation studies, and limitations.

Installation

You can install the current version of zcurve from CRAN with:

install.packages("zcurve")

or the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("FBartos/zcurve")

Example

Z-curve can be used to estimate expected replicability rate (ERR) and expected discovery rate (EDR) using z-scores from a set of significant findings. This is a reproduction of an example in Bartoš and Schimmack (in preparation) where the z-curve is used to estimate ERR and EDR on a subset of studies used in reproducibility project (OSC, 2015). Only studies with non-ambiguous original outcomes are used - excluding studies with “marginally significant” original findings, leading to 90 studies. Out of these 90 studies, 35 were successfully replicated.

We included the recoded z-scores from the 90 OSC studies as a dataset in the package (‘OSC.z’). The expectation-maximization (EM) version of the z-curve is implemented as the default method and can be fitted (with 1000 bootstraps) and summarized using ‘zcurve and ’summary’ functions.

The first argument to the function call is a vector of z-scores. Alternatively, a vector of two-sided p-values can be also used, by specifying “zcurve(p = p.values)”.

set.seed(666)
library(zcurve)
#> Please, note the following changes in version 1.0.9 (see NEWS for more details):
#> - The ERR estimate now takes the directionality of the expected replications into account, which might lead to slight changes in the estimates.

fit <- zcurve(OSC.z)

summary(fit)
#> Call:
#> zcurve(z = OSC.z)
#> 
#> model: EM via EM
#> 
#>     Estimate  l.CI  u.CI
#> ERR    0.615 0.443 0.740
#> EDR    0.388 0.070 0.699
#> 
#> Model converged in 27 + 783 iterations
#> Fitted using 73 z-values. 90 supplied, 85 significant (ODR = 0.94, 95% CI [0.87, 0.98]).
#> Q = -60.61, 95% CI[-72.24, -46.24]

More details from the fitted object can be extracted from the fitted object. For more statistics, as expected number of conducted studies, the file drawer ratio or Sorić’s FDR specify ‘all = TRUE’.

summary(fit, all = TRUE)
#> Call:
#> zcurve(z = OSC.z)
#> 
#> model: EM via EM
#> 
#>               Estimate  l.CI   u.CI
#> ERR              0.615 0.443  0.740
#> EDR              0.388 0.070  0.699
#> Soric FDR        0.083 0.023  0.705
#> File Drawer R    1.574 0.430 13.387
#> Expected N         219   122   1223
#> Missing N          129    32   1133
#> 
#> Model converged in 27 + 783 iterations
#> Fitted using 73 z-values. 90 supplied, 85 significant (ODR = 0.94, 95% CI [0.87, 0.98]).
#> Q = -60.61, 95% CI[-72.24, -46.24]

For more information regarding the fitted model weights add ‘type = “parameters”’.

summary(fit, type = "parameters")
#> Call:
#> zcurve(z = OSC.z)
#> 
#> model: EM via EM
#> 
#>   Mean  Weight  l.CI  u.CI
#> 1 0.000  0.056 0.000 0.445
#> 2 1.000  0.005 0.000 0.374
#> 3 2.000  0.734 0.002 0.999
#> 4 3.000  0.205 0.000 0.640
#> 5 4.000  0.000 0.000 0.000
#> 6 5.000  0.000 0.000 0.000
#> 7 6.000  0.000 0.000 0.000
#> 
#> Model converged in 27 + 783 iterations
#> Fitted using 73 z-values. 90 supplied, 85 significant (ODR = 0.94, 95% CI [0.87, 0.98]).
#> Q = -60.61, 95% CI[-72.24, -46.24]

The package also provides a convenient plotting method for the z-curve fits.

plot(fit)

The default plot can be further modified by using classic R plotting arguments as ‘xlab’, ‘ylab’, ‘main’, ‘cex.axis’, ‘cex.lab’. Furthermore, an annotation with the main test statistics can be added to the plot by specifying ‘annotation = TRUE’ and the pointwise confidence intervals of the plot by specifying “CI = TRUE”. For more options regarding the annotation see ’?plot.zcurve”.

plot(fit, CI = TRUE, annotation = TRUE, main = "OSC 2015")

Other versions of the z-curves may be fitted by changing the method argument in the ‘zcurve’ function. Set ‘method = “density”’ to fit the new version of z-curve using density method (KD2). The original version of the density method as implemented in Brunner and Schimmack (2020) can be fitted by adding ‘list(model = “KD1”)’ to the ‘control’ argument of ‘zcurve’.

(We omit bootstrapping to speed the fitting process in this case)

fit.KD2 <- zcurve(OSC.z, method = "density", bootstrap = FALSE)
fit.KD1 <- zcurve(OSC.z, method = "density", control = list(model = "KD1"), bootstrap = FALSE)

summary(fit.KD2)
#> Call:
#> zcurve(z = OSC.z, method = "density", bootstrap = FALSE)
#> 
#> model: KD2 via density
#> 
#>     Estimate
#> ERR    0.613
#> EDR    0.506
#> 
#> Model converged in 47 iterations
#> Fitted using 73 z-values. 90 supplied, 85 significant (ODR = 0.94, 95% CI [0.87, 0.98]).
#> RMSE = 0.11

summary(fit.KD1)
#> Call:
#> zcurve(z = OSC.z, method = "density", bootstrap = FALSE, control = list(model = "KD1"))
#> 
#> model: KD1 via density (version 1)
#> 
#>     Estimate
#> ERR    0.634
#> 
#> Model converged in 141 iterations
#> Fitted using 73 z-values. 90 supplied, 85 significant (ODR = 0.94, 95% CI [0.87, 0.98]).
#> MAE (*1e3) = 0.25

The ‘control’ argument can be used to change the number of iterations or reducing the convergence criterion in cases of non-convergence. It can be also used for constructing custom z-curves by changing the location of the mean components, their number or many other settings. However, it is important to bear in mind that those custom models need to be validated first on simulation studies prior to their usage. For more information about the control settings see ‘?control_EM’, ‘?control_density’, and ‘?control_density_v1’.

If you encounter any problems or bugs, please, contact me at f.bartos96[at]gmail.com or submit an issue at https://github.com/FBartos/zcurve/issues. If you like the package and use it in your work, please, cite it as:

citation(package = "zcurve")
#> 
#> To cite the zcurve package in publications use:
#> 
#> Bartoš F, Schimmack U (2020). "zcurve: An R Package for Fitting
#> Z-curves." R package version 1.0.9, <URL:
#> https://CRAN.R-project.org/package=zcurve>.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Misc{,
#>     title = {zcurve: An R Package for Fitting Z-curves},
#>     author = {František Bartoš and Ulrich Schimmack},
#>     year = {2020},
#>     note = {R package version 1.0.9},
#>     url = {https://CRAN.R-project.org/package=zcurve},
#>   }

Sources

Bartoš, F., & Schimmack, U. (2020, January 10). Z-Curve.2.0: Estimating Replication Rates and Discovery Rates. https://doi.org/10.31234/osf.io/urgtn

Brunner, J., & Schimmack, U. (2020). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology, 4.

Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.

zcurve's People

Contributors

Stargazers

Watchers

Forkers

pawlenartowicz

zcurve's Issues

False discovery rate

Hi zcurve developer

I am curious about how to get the false discovery rate from zcurve package, as described in this paper:
Schimmack U, Bartoš F (2023) Estimating the false discovery risk of (randomized) clinical trials in medical journals based on published p-values. PLOS ONE 18(8): e0290084. https://doi.org/10.1371/journal.pone.0290084

I could find this estimate. See below reproducible example:
set.seed(666)
library(zcurve)
fit <- zcurve(OSC.z)
summary(fit)

BTW, would you like to show how to use swfdr package to compute the false positive rate using the dataset OSC.z? I want to compare the two packages.

Best,
Yefeng

relationship between z value and replication rate

Hey @DominikVogel @FBartos , I am addicted to your method. Just curious about whether there is a way to construct the relationship between z value and replication rate. I am meant to make a plot with x-axis as the z value and the y-axis as the replication rate estimate. My main point is that in the case of non-normal distributions, it is not that meaningful to calculate the average power or the so-called expected replication rate. Rather, we should visualize the relationship between z value and the replication rate estimate

plot zcurve object using ggplot(2)

Hey, two great researchers @DominikVogel @FBartos, just want to know if the ggplot version of plotting zcurve object is available? If not, can you instruct me how to do it?

Best,
Yefeng

How does z-curve deal with censored p-values?

@FBartos thank you for this package!

I was wondering whether you describe anywhere how z-curve deals with censored p-values? I can't find it in the papers or the package docs, but might have overlooked something? We are in the process of writing a Registered Report using zcurve, and need to explain that there.

My understanding so far is that ps > .05 are just ignored. But what happens with ps < .05? Are they just used in the EM algorithm as is, with the bounds that are passed? I see some transformation steps in the code that I can't quite figure out - are they just about ensuring that the lower bound is above 0?

Finally (and feel free to ignore this part as it is not about zcurve per se), might you be able to sense-check my understanding of how EM deals with censored values? My understanding is that on each iteration, the model essentially predicts the exact value of the censored observations based on the current parameter estimates, then updates the model parameters to maximise the log-likelihood, and then iterates again. Does that sound right?

List of Articles - script?

Thank you for your work in making Psychology research more accountable and valid.

It is hinted in replicationindex.com that there is a script to take into from a pdf>doc>list all the data from a list of author articles without doing it manually.

Is this available?

A bit unsure how to begin

Hi devs
I like your package and am looking to use it in my upcoming thesis
However, I have no experience with R or anything. I was wondering if you knew of a guide or something?

I have downloaded the package, I can run the arguments from your README file, but I am unsure how I should input the data I have collected. In csv?

I hope you can help
Dio

fit z-curve (mixture model) with all z-values rather than only statsitically significant ones

@FBartos @gaborcsardi I would be grateful, if you would like to tell me how to fit a collection of z values without truncation at 1.96. I mean z-curve only uses the statistically significant z-values to fit the mixture model. But how to use all z values regardless of the statistical significance. The reason why I ask this is because I want to test if a dataset without publication bias (this can be guaranteed by Registered Reports), the EDR derived from a mixture model fitted with only statistically significant z-values should be similar to that fitted with all z-values regardless of the statistical significance.

Best,
Yefeng

Won't run even on the example dataset

I tried running this both on my own data, and then also just running the example code from the readme file, and both times got the same error message

set.seed(666)
library(zcurve)

fit <- zcurve(OSC.z)

Error in density.default(augZ, n = 100, bw = bw, from = 1.96, to = 6) : need at least 2 points to select a bandwidth automatically

Not explicitly labelling p-values as an argument

This might be a "unique to me" issue, but it took me ages to resolve. I was inputting p-values rather than Z-scores, but was using lazy evaluation and writing zcurve([vector of p-values])

Wonder if a line at Line 71 or so to check that the max(z) is over 1 would help any other idiots.
if(max(z) < 1 )stop("It looks like you are entering p-values rather than Z-scores? To use p-values, explicitly name your argument zcurve(p=[vector of p-values]")

add observed discovery rate to output

Thanks for this very useful and well-made package.

The plot gives us the observed discovery rate, and a 95% CI - but as far as I can see, these numbers are not available in the results outputted by z-curve (not even when all = TRUE). To report results in R Markdown, it would be useful to have these numbers in the output of the z-curve function (maybe including the number of studies included in the analysis, and the number of significant results, also in the plot but not in the output.

Misleading error message when empty vector is passed

I use zcurve in an app, and took quite a long time to figure out what was going on in a case like the below - maybe there could be a less misleading/more explicit error message?

zcurve::zcurve(z = numeric(0))
#> Error in zcurve::zcurve(z = numeric(0)): It looks like you are entering p-values rather than z-scores. To use p-values, explicitly name your argument 'zcurve(p = [vector of p-values])'

^{Created on 2024-05-23 with reprex v2.1.0}

zcurve() not working

I've tried using a data frame column of 67 separate numeric values (z scores) and it throws the following error.

OUTPUT:
Error in zcurve(df$z_score) :
There must be at least 10 z-scores in the fitting range but a much larger number is recommended.

Additionally trying to use sample data provided throws an error the first time I run it and then stalls out if I run it again.

OUTPUT:

fit <- zcurve(OSC.z)
Error in .zcurve_EM_start_fast_RCpp(x = z, K = control$K, mu = control$mu, :
function 'Rcpp_precious_remove' not provided by package 'Rcpp'

Z-CURVE ANALYSIS

library(zcurve)
fit <- zcurve(OSC.z)
--- Rstudio stalls out after this.

Option to change the default CI

First of all, I want to thank you for this great package!

I have a question regarding the confidence intervals for the observed discovery rate (ODR) and the expected discovery rate (EDR). It would be great to test if EDR and ODR differ significantly. To do so, one could test if the 90% confidence intervals overlap. However, zcurve() only offers 95% CIs. Is there a way to get 90% CIs? The alpha options does not seem to help, since it changes the whole estimation.

fbartos / zcurve Goto Github PK

zcurve's Introduction

zcurve

Installation

Example

Sources

zcurve's People

Contributors

Stargazers

Watchers

Forkers

zcurve's Issues

Z-CURVE ANALYSIS

Recommend Projects

Recommend Topics

Recommend Org