Giter Club home page Giter Club logo

r2d_study's Introduction

repo2docker study of NIPS 2017 papers

Presented at Learning to Be Reproducible ICML 2018

Data as part of Reproducible Research Environments with repo2docker

We collect data from the NIPS 2017 schedule to demonstrate the relationship between the presence of configuration files used by repo2docker and GitHub engagement. These results are reported in Section 4 of the paper.

We use repo2docker to publish a live version of the repo on binder.

To encourage the reproducibility of this work, we are including a link to a binder version of this repo to run our analysis in the browser. Click the button below to launch a live version of the repo on binder:

Binder

Local Installation

To install with conda:

conda env create -f environment.yml

We recommend exploring the repository with JuptyerLab.

source activate r2d-study
jupyter lab

If running a recent version of Jupyter Notebook or binder, you may switch to JupyterLab by replacing the part of your URL with /tree to /lab.

File Descriptions

The majority of analysis occurs in get_data.ipynb. Helper functions used in the notebook are in collect_data.py.

Note: that since we use GitHub's graphql API to collect GitHub metadata, one would need to create a personal access token and replace it in the appropriate code block to recollect all the data. The relevant collected files are included anyway as csv as described in the notebook for simplicity and exceptions are written to skip data collection if necessary.

NIPS 2017 Datasets

Data scraped from the NIPS 2017 schedule is reformatted in a csv in all_papers.csv.

The dataset with all GitHub repo data with metadata is gh_metadata_w_labeled.csv, indicating that this dataset includes URLs to GitHub research repos that were found through manual inspection.

The dataset with all config file information is r2d_w_labeled.csv. Similarly, some URLs were found through manual inspection.

Interim datasets concatenated to make gh_metadata_w_labeled.csv are called gh_metadata.csv and gh_labeled.csv. Interim datasets concatenated to make gh_r2d_data.csv and r2d_labeled.csv. Datasets with the _labeled suffix signify that the URL for the repo was found through manual inpsection.

Two types of papers required manual inspection: papers that changed their GitHub repo name, or papers that had errors in their URL. Papers that had errors in their URL are listed with their labeled URLs in validate_url_w_labels.csv. Papers that changed their reponame with their new repo are in change_reponame_labeled.csv.

Libraries that were part of larger repositories that were excluded from our analysis are in larger_libraries.csv. Similarly repositories that did not include lines of programming cdoe were excluded, which are listed in no_code.csv.

r2d_study's People

Contributors

jzf2101 avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

r2d_study's Issues

Label mismatch

Interesting statistics!

I noticed that the highlighted labels should swap places here.
image

SysML Paper

Hi Everyone,

I've submitted a paper to SysML. Rebuttal period is November 14-21. I'm trying to plan some revisions in anticipation or rebuttal period. Also, I'd like to figure out how to get you set up on Microsoft CMT so we have you in the system as an author.

http://cmt3.research.microsoft.com

Let's discuss this here.

One thing I'd like to help figure out is how to spoof ourselves so we can put all our work online anonymously

I'm out of town Nov 2 to 11 with limited internet so I think we should talk about the draft and what we want to do before then.

@ellisonbg @yuvipanda @minrk @betatim @consideRatio @willingc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.