Replication materials for "Measuring Transparency in the Social Sciences"

The following is a walk-through of the methods we used for the paper "Measuring Transparency in the Social Sciences." It is written in an informal style and goes into more detail than the paper allowed in order to explain the technical details and the specific code files that do the work for the project.

Much of the repository is messy. We beg the reader's clemency. The project was exploratory and iterative, and it was difficult to conceptualise at the beginning how the workflow could have been turned into an end-to-end pipeline while it was still being constructed. Future work in this area will not suffer the same deficit. Nevertheless, the following should give a sense for how the project came together and provide enough information for an interested reader to reconstruct our steps.

Methods section from the paper (dated March 30, 2022)

Our study design called for a comprehensive analysis of population-level data. Our population --- (1) papers using data and statistics, and (2) original experiments --- were part of all political science and international relations publications in target journals. We downloaded all of the journals' papers from 2010 to July 2022. Once we had these papers, we identified the data, statistical, and experimental papers through dictionary-based feature engineering and machine learning. We then used public APIs, web scraping, and text analysis to identify which of the studies had replication materials. We outline this process below.

Files:

/ref/JCR_Political_Science.csv

/ref/JCR_International_Relations.csv

These files were obtained directly from the Clarivate subscription-only website in August 2021.

Phase one: gathering and classifying the papers

Code:

/code/R_updated/1.1_pull-crossref-data.R # Gets data from crossref for each journal, dumps it out in a folder named and date-stamped

/code/R_updated/1.2_join-clean-crossref-data.R # # Cleans that data

Identifying the papers that relied on data, statistical analysis, and experiments was an iterative process. In each case we read target papers and devised a dictionary of terms meant to uniquely identify others like them. We extensively revised these dictionaries to arrive at terms that seemed to maximally discriminate for target reports.

Dictionary files:

/output/experimental_dict.txt

/output/quant_dict_w_cats.txt

The dictionaries were then used with custom functions to create document feature matrices, where each paper is an observation, each column a dictionary term, and each cell a count of that term.^[A custom function was preferable to existing text analysis libraries like quanteda because of our need to capture regular expressions and asterisks.] The DFM format made the papers amenable to large-scale analysis. In machine learning parlance, this process is known as feature engineering.

Code files:

/code/R_updated/3.1_classify-fulltext-papers.R

Output files:

For the first research question --- examining the presence of replication code and data in papers involving data analysis or statistical inference --- we hand-coded a total of 1,624 papers.

Hand-coded files:

/output/handcoded_stat_papers.csv

Note that we hand-coded the files over a number of work sessions in Google Sheets and exported the combined result as a csv file.

Game theory dictionary:

/output/game_theory_dict.txt

For the second question --- examining what proportion of experiments were preregistered --- we hand-coded 518 papers with a single boolean category: whether the paper reported one or more original experiments. We defined this as any article containing an experiment where the researchers had control over treatment.

Hand-coded files:

/output/experiment_handcodes_round2.csv

We then trained two machine learning models --- the Support Vector Machine (SVM) and Naive Bayes (NB) binary classifiers --- to arrive at estimates for the total number of data analysis/statistical inference and experimental papers.^[As an additional robustness check to predict open data and statistical inference papers, we estimated a series of bivariate logistic regressions using the same DFMs. The predicted probability plots can be found in the appendix. These plots give a lower estimate than the machine learning models, though they are in the same broad range.] Note that in the final manuscript, we include only the SVM results and only the SVM results are now visible in the code.

Code files:

/code/R_updated/3.1_classify-fulltext-papers.R

Phase two: Identifying open data and preregistrations

We attempted to identify open data resources in seven ways.

Using the Harvard Dataverse API, we downloaded all datasets held by all journals in our corpus who maintained their own, named dataverse (n=20);
We queried the Dataverse for the titles of each of the papers in our corpus and linked them to their most likely match with the aid of a custom fuzzy string matching algorithm. We validated these matches and manually established a string-similarity cut-off, setting aside the remainder;
We extracted from the full text of each paper in our corpus the link to its dataset on the Dataverse; note this had significant overlap with the results of the first and second queries);

For the above, the code files:

/code/R_updated/4.1_query-dataverse-with-titles.R

/code/R_updated/4.2_pull-dataverse-links-from-papers.R

We downloaded the metadata listing the contents of these datasets, to confirm firstly that they had data in them, and secondly that it did not consist of only pdf or doc files. In cases where a list of metadata was not available via the Dataverse API, we scraped the html of the dataset entry and searched for text confirming the presence of data files;
We used regular expressions to extract from the full text papers references to "replication data," "replication materials," "supplementary files" and similar terms, then searched in the surrounding text for any corresponding URLs or mentions of author websites;

Code files:

/code/R_updated/4.4_precarious-data.R

We termed this 'precarious data' and have reported the results in the paper.

The output files here are ./output/papers_w_precarious_data.csv and, as part of the exploratory process of inspecting them we generated three excel files:

./output/replication_links_author_website.xlsx
./output/replication_links_author_website1.xlsx
./output/replication_links_author_website2.xlsx

We searched all of the full text papers for references to other repositories, including Figshare, Dryad, and Code Ocean, in the context of references to supplementary and replication materials. There were, however, only over a dozen meeting this criteria and we did not incorporate these into the results further.

This analysis is in ./code/R_disorganised/test_precarious_data.R

As additional validation for DA-RT signatory journals specifically, we downloaded the html file corresponding to each article and/or the html file hosting supplemental material, then extracted all code and data-related file extensions to establish their open data status.

Code files:

/code/R_updated/4.5_query-dart-journals.R

/code/R_updated/4.3_query-jcr-jpr-data.R

We attempted to identify preregistration of experiments in the following ways:

We used regular expressions to extract from all of the experimental papers sentences that referred to "prereg" or "pre-reg", as well as any references to commonly used preregistration servers (osf, egap, and aspredicted), and then searched for the availability of the corresponding link to validate that the preregistration had taken place. Parts of this process --- for instance, searching author names in the Experiments in Governance and Politics (EGAP) registry to look for the corresponding paper --- involved time-consuming detective work;

Code files:

/code/R_updated/5.2_identify-prereg.R

We downloaded all EGAP preregistration metadata in JSON format from the Open Science Foundation Registry (https://osf.io/registries/discover), extracted from this file all osf.io links and unique EGAP registry IDs, and used command line utilities to search for them through the corpus of all the papers.

Code files:

/code/bash/rg_for_prereg_osf_egap_papers.txt

We did not examine whether the published report conformed to the preregistration plan.

Codebook

This codebook identifies all of the variables in the primary dataset for our paper.

doi The digital object identifier of the paper, as identified in Crossref

od_doi The DOI -- or URL in some instances -- of the open data (OD) for the paper

ir_bool Boolean value if an international relations journal as identified by Clarivate's Journal Citation Report

ps_bool Boolean value if a political science journal as identified by Clarivate's Journal Citation Report

abbrev_name Abbreviated name of the journal

prereg_score The preregistration status of the paper. 0 = ; 1 = ; 2 = .

exp_bool Boolean value if the paper is identified as experimental by our Support Vector Machine classifier.

data_bool Boolean value if the paper is identified as data analysis by our Support Vector Machine classifier

stat_bool Boolean value if the paper is identified as statistical inference by our Support Vector Machine classifier

i.od_doi The second DOI -- or URL in some instances -- of the open data (OD) for the paper

journal_name Name of the journal as identified in Crossref

i.abbrev_name The second abbreviated name of the journal

created Creation date, per Crossref field

deposited Deposit date, per Crossref field

published_online Date the paper was published online, per Crossref field

published_print Date the paper was published in print, per Crossref field. Note that we primarily filtered on this field, given the lack of correspondence between published_online and the usual time the paper first appeared.

indexed Date the paper was index, per Crossref field.

issn International Standard Serial Number of the journal, typically in an untidy format

pissn The (print) International Standard Serial Number of the journal, clean

eissn The (electronic) International Standard Serial Number of the journal, clean

publisher Publisher of the paper as identified by Crossref

title Title of the paper as identified by Crossref

od_bool Boolean value, derived from whether there is text in the od_doi or i.od_doi fields

dart_year Year the journal signed onto the Data Access and Research Transparenty (DA-RT) statement

dart_bool Boolean value for whether the journal signed DA-RT

published_print_ym Year and month of publication, derived from the published_print variable using basic string manipulation

published_print_ym_date_format Year and month of publication as POSIXct date-time object, derived from the published_print variable using the lubridate package

published_print_year Year of publication (string), derived from published_print variable

published_print_month Month of publication (string), derived from published_print variable

pd_bool Boolean value for 'precarious data', derived from whether or not pd_context had text or was empty

pd_context A string extracted from the full text of the papers discussing the location of data that is neither on the Harvard Dataverse or a journal's own, clearly advertised data repository hosted elsewhere (i.e., 'precarious')

Reproducibility

All packages were loaded with the groundhog library, so the R code should be reproducible. But it will be impossible to reproduce the full analysis without the ~6gb of text files of the papers (which are based on ~90gb of raw pdf and html). However, due to copyright concerns, we're not able to share these publicly. The rest of the analysis should be entirely reproducible.

mpr1255 / mtss Goto Github PK

mtss's Introduction

Replication materials for "Measuring Transparency in the Social Sciences"

Methods section from the paper (dated March 30, 2022)

Phase one: gathering and classifying the papers

Phase two: Identifying open data and preregistrations

Codebook

Reproducibility

mtss's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent