everpub / openscienceprize Goto Github PK

View Code? Open in Web Editor NEW

69.0 69.0 20.0 752 KB

:telescope: Everpub - Making reusability a first class citizen in the scientific workflow.

License: Other

Shell 51.27% Makefile 48.73%

openscienceprize's People

Contributors

Stargazers

Watchers

Forkers

ctb jackdapid cranmer rougier khinsen anaderi lukasheinrich minrk odewahn raoofphysics eamonnmag betatim daniel-mietchen researchastute jhrudey zenny

openscienceprize's Issues

What about adding also conda environments to the mix

Docker containers are great, but oftentimes the software is simple enough that the environment can be reproduced with conda. This approach has also the benefit to be multi-platform.

It is a slightly more fragile approach but I think is good enough in many cases and a big improvement from manual installation.

In the spirit of "not changing you tools" it makes sense to add conda to the mix.

Proponents

I have to submit information about each team member. Please make a PR adding the below template (with your details) to team.md to get yourself added.

As this will be run as an open-source project you can participate independently of being a proponent.

Who can be a proponent? Anyone who thinks they contributed significantly to the proposal (commenting, editing, writing, connecting, ...) and who can make a commitment to continue working on this (aka has free brain cycles and time). Some day, and that day will come, we will call upon you to do a service for the project. With open-source projects all the attractive topics get done quickly, the tedious and admin work gets done much more slowly. If you'd rather not have that kind of responsibility or have enough things cooking already, that is fine. The analogy for me is that a lot of people want to be an Ironman, but not many will do what it takes to be one.

I think we should find a way to give credit to all those that contributed (PRs, issues, and hypothes.is), I might start a contributors.md and list the names somewhere in proposal.md. Thoughts?

As always, if you disagree with this or have comments: let's discuss this.

First Name:
Last Name:
Email Address:
Country of permanent address:
The Area of Expertise Contributing to Your Project:
Age: (you can prefer not to say)
Sex: (you can prefer not to say)

Archiving the research object

This discussion is around principles and best practices around archiving of compound research objects, such as the model proposed in everpub

Renaming

@ctb suggested "everpub" as a name. Thoughts?

I like it, not only because it builds on the theme of everware.

Add conversion to PDF for submission

re #46, I'd like to ask @betatim for a makefile or something similar to do the final PDF conversion - I don't really know how to do it from md and am curious :)

lobbying for official / supported docker images from scientific software projects

I talked about this a bit with @cranmer et al and maybe this is a good forum, also this is related to #51 .

A lot of software products are already a good fit for the Docker paradigm of wrapping a single entry-point / program / command line tool, with all their dependencies. I think it lobbying for large, widely used software products (as opposed to e.g. library provides that are meant primarily for re-mixing) to build official docker images can help both by 1) give at the very least a reference dockerfile on how the authors of that project would install their own software and 2) give already a useful

A perfect example in HEP would be ROOT. I think a ROOT docker base image would already go a long way for a lot of the scientific code that exclusively lives in the ROOT ecosystem.

Other HEP examples are Monte Carlo generators. These are also almost exclusively (at least by experimenters) used as black boxes that eat a couple of configuration files and spit out events in some format. Maybe another example could be GEANT? Maybe there are similar examples in biomed fields?

should we approach such projects and try to get them to have official docker images?

Everpub for clinical trial pubs?

First off great project idea, thanks to all involved.

I was thinking if Everpub could be a nice option for publications of clinical trials. Sadly, I'm not an expert myself, but there's a fair amount of statistical analyses in these publications. Also, the "significance" of these studies can be quite close to the "thresholds".

For an example of a typical publication, see here: https://www.ncbi.nlm.nih.gov/pubmed/26406150

AFAIK, clinical trials nowadays have to be registered beforehand (https://clinicaltrials.gov/), this includes predetermining the outcome to inhibit statistical mischief. There are initiatives, e.g. AllTrails (http://www.alltrials.net/) from Ben Goldacre, to report the results of clinical trials openly.

So, firstly Everpub would be great to play around with the results of a Trial pub, e.g. trying different statistical tests/visualisations on the data etc.. And secondly it might be a tool of post-publication analysis (something like PPPR) if clinical trial data with the associated metadata can be pulled automatically.

Meta-issue re composability

I noticed that #18 veered into some really great discussions of composability and I want to close that issue (because most of it has been dealt with by #41) but retain a link to composability.

So, put links to good comments about composability in this issue and we'll revisit if/when people want to talk about it more :).

Collection of existing training initiatives

This proposal should generate a lot of training material and advice. That material should be contributed to an existing education effort. This issue collects those:

RIO grant proposal publication

Should we got for it ? (see https://twitter.com/RIOJournal/status/703141211612155904)
I can try to write it based on the proposal (but I'm not familiar with the RIO journal).

What are we actually proposing?

This thing is due in three days, folks :). I can do some of the writing on Saturday while traveling but we need to nail down what, exactly, we are proposing.

Based on our pitch,

I would argue for proposing the following deliverables:

a prototoype that demonstrates a vertical spike through some good practice in this area.
a detailed discussion & set of links around each feature of the prototype, explaining what we and others have done, an opinionated perspective on what approaches could be used to address each problem, and a brief on why we chose the approach we used in the prototype.
an exploration of what "big features" are missing from the prototoype

For the prototype, I think we've converged on establishment of initial directory structure; a declarative specification of dependencies, execution framework, and inputs/outputs for building a paper; support for CI CI; and integration with Zenodo for minting DOIs.

I'd strongly push for supporting the R ecosystem, since many biologists use R and this is a biology prize :). Between R and Python I think we get most modern biomedical scientists.

I think we need a brief discussion of the goal of enabling composition, without focusing on how -- I haven't seen convergence in that discussion yet. Please correct me if I'm wrong!

For the discussion, we just need to make sure that we discuss and document our decisions and link in projects and demos. But I don't think we need to do much about this for this round of the proposal, just point out that there are a lot of people who have done things related to our project and that we will engage with their ideas and demos and connect them into our project. This could even make a nice publication... ;)

For the third part on missing features, we should pay attention to topics like editing, diffing, and merging that are important for specific ecosystem members. The point to make in the proposal here is that we will inevitably run across great ideas that will need substantial work, so while we may not integrate them into our demo, we will record them and brainstorm about them.

The last question is how we propose to do this, or, basically, what we'll do with the money. I don't think we need to do more than sketch this out, but at least from my perspective all I'd want to do is run hackathons and support travel.

Name - this thing needs a name

What to call this project?

Links and existing "notebooks for journals"

Nature announcement and demo

Jupyter gallery of reproducible papers

Gigascience uses Galaxy, which is a nice example. Ties researchers to galaxy though (how much is it used in biomed?) and hard to adopt for outsiders (like particle phys). Clicking a random article I could find the supporting data, but no "run it right here, right now"

Check list (kind of)

To be done:

Abstract to be written (= executive summary for submission, 300 words max)
team.md to be completed (see #47)
Build the final PDF and review errors / glitches

There are trailing comments in the proposal, I've listed them here just to be sure to not forget to process them before actual submission:

"many others (XXX more here)" to be completed or removed
"remote containers; (XXX more here)." to be completed or removed
"be addressed. (refs)" to be completed or removed
"this proposal (XXX)" to be removed ?
"and N people" N = 5 ?
"place holder XXX" maybe it's time to remove it ?

Add a biomedical connection

We need a good story/link between our proposal and biomed and related fields that NIH and WT mainly operate in.

Licensing

This repo's README currently states the license as CC BY-SA 4.0.

The regulations for the prize state

Executive summaries for all applications will be shared via the prize Web site without exception and licensed under the Creative Commons Attribution 4.0 License (CC BY 4.0)

Of course, you could have the abstract under CC BY and the rest under CC BY-SA, but what about switching to CC BY for the entire application? Besides, the software is likely (to be) licensed differently, so I suggest to clarify which licensing applies to what, perhaps in a separate LICENSE file.

Contribute to and promote existing projects

One big problem is fragmentation, everyone invents their own thing. This project should try and work against that trend by contributing to existing solutions/open source projects.

One reason for fragmentation could be that people simply do not know about existing tools. This project should create some best practices/documentation/advice. This would spread the word about projects. More importantly lots of people using a 80% perfect tool will give that project momentum and let it get closer to 100% perfection. As opposed to people going away and building what they are missing somewhere else.

ROADMAP

This isn't a roadmap for writing the proposal but for what to do when the project starts.

Short term 🎉

Build a prototype based on thebe, theoj
Create an example analysis

Medium term ⏩

Show the MVP to publishers to find a partner to add this to a existing journal

Long term 🔮

Budget

It wasn't clear from my brief look at https://www.openscienceprize.org/res/p/FAQ/ what is required in terms of budgeting.

Workflow - Scientist

Outline the envisioned workflow for a scientist. With this we can build a better idea of what needs teaching, blue-printing, etc

First suggestion for a workflow:

start new data analysis by creating a empty directory
type openscience init to create a skeleton
- runs git init, creates a "sensible" Dockerfile
- setups up aliases for running things in the docker container?
create code, run it with openscience run <cmd> which executes it inside the docker container
create a notebook or .md with code blocks that has narrative mixed with steps for reproducing parts of the analysis
git commit all along
push repo to git hub at some point(?)
as analysis comes to an end create a new ipynb/md that is the paper, preview it with openscience paper(?)

(I will edit this entry as we iterate)

Reviewing published work

The toolbox should contain something to help with post publication review. Something like the arxiv to host the interactive papers and facilitate post pub review?

How to stay informed

Add a short section to the README.md to tell people how they can stay informed about what is happening here by "watching" the repo.

Click "notifications" button top right
select "watching"

Maybe add a screenshot of the button

Second round of thorough comments

Hello,

this is a proposal for http://openscienceprize.org. We propose to build tools to make reusability and composition a first-class citizen in computer aided research. Enabling the publication of dynamic and interactive scientific narratives that can be verified, altered, reused, and cited.

Comments and feedback welcome! Right now the focus should be on the structure of the proposal. What is unclear, missing, or superfluous. Language and spelling second.

To gather feedback directly on the document it would be ideal if you could use hypothes.is via this link: https://via.hypothes.is/https://github.com/betatim/openscienceprize/blob/7cd9fd5615b44daf9e720cdc486a4f9ec8054979/proposal.md. (simply highlight the text you want to comment on, there should be a toolbar on the right edge of your browser window, if too much hassle see next point) Next best is to leave a comment on this issue or email Tim [email protected].

Thanks, I owe all of you a drink of your choice,
T

GitHub and GitLab

We are all big fans of GitHub, Travis, etc., but GitLab also provides similar services.
Should we specifically focus on GitHub in the document, or include GitLab, bitbucket, in some way

Web rendering of papers

We need to add a section/bullet point back in to the proposal that mentions that one end of the vertical spike is a web app that renders these executable papers. It shows you a collection of existing papers, and let's you interact with individual ones. For me it is the "interface" you'd install if you wanted to run a journal/the arxiv using this. (via #41)

Make GitHub organization

Probably this is for later since time is tight, but perhaps something to clean up before submission....

Do we make an everpub organization now, move this repo there?
To avoid broken links @betatim could fork from organization.

mention http://www.researchobject.org

@eamonnmag mentioned
http://www.researchobject.org

It is very relevant and probably we should mention it like we do some of the other tools.
More importantly, I'm curious how what most clearly differentiates what we are proposing.

Your submission has been submitted.

Thank you everyone for your never ending enthusiasm! This was a real team effort. I can't quite believe that ten days ago this ideas was on hold and now it is a real proposal!

🏁 🎊 🎉 🎊 🚀 🎢

To all the commentators, writers, spellers, mergers, issue creators, issue resolvers, idea'ators: thank you. But keep coming back because this is just the start!

At the same time I'd like to point out our new home: https://github.com/everpub

ps. I think the amount of 📧 you will receive from this repo will reduce to a more sustainable level now 😀
pps. openscienceprize will announce their decision at the end of April

Everpub Trademark

So in googling around for this openscienceprize everpub site, it became clear that there is another 'EverPub' in the digital publishing space (see search result screen cap below). This raises a pesky question of trademark / service marks. Realizing that legalities would be of little concern in this amazingly productive, creative visioning process unfolding, just throwing it out there for consideration in the future. Trademarks are about avoiding brand confusion, so could the multiple everpubs - if unleashed fall into that trap?

P.S. Librarians are not necessarily legal Debbie Downers, but IP stuff really does come into play a lot in our world, as long-term stewards of the scholarly record!

Firming up the idea

Build the infrastructure required to create and publish scientific output that is more than a simple, static document. To make this a success two things are needed:

This openscienceprize project will create a web app that allows you to display a single, interactive notebook for your publication. (Build on top of experience from thebe). Use ORCID ID as auth. Build in the components of the "social web". Meaning ability to ⭐ and fork publications. (or alternative ways to have "the crowd" assign credit/fame/prestige)

In order for people to be able to publish in such a place they need to know how to create a publication that is reusable. The technical components required to build a data analysis that is geared towards reusability and richer publication format already exist (github + docker + jupyter + snakemake). To use these tools right now, you have to be more of a geek than your average researcher is. What is missing is the social component. This openscienceprize project will create (and contribute to existing training initiatives #6) a blue print for setting up such a environment.

The structure of a digital research object

A discussion started on another issue by @Repositorian, which deserves its own issue.

Thoughts and questions on a first thorough review

For continuous integration, we need some indication of what success is to be built in. Is that "zero exit code" or can we put in assertions of some sort?

Konrad Hinsen clearly has some thoughts on composability

We shouldn't tie things to mounting local directories because they don't work with most docker-machine types (see my approach with data volumes]. For a demo or prototype, of course it's ok :)

I really like this concept for some reason: "web based way to create an environment, try it and then download it".

Main reaction: we need to narrow down to some sort of hard focus for the OSP application, around which we build a fairy castle of air that spells out all the awesome things that could be done.

demo: notebook with attached cluster + declarative workflows

Hi all,

this touches on #16 #51 and other issues

as promised a small demo on how one could use both declaratively defined workflows + a docker swarm cluster to run workflows whose steps are each captured in different docker containers. This

https://github.com/lukasheinrich/yadage-binder/blob/master/example_three.ipynb

in the GIF, each of the yellow bubbles executes in its own container, in parallel if possible. All these containers and the container that the notebook runs in share a docker volume mounted at /workdir so that they can share at least a filesystem state. This keeps the execution itself isolated but allows steps to read the outputs of previous steps and take them as inputs.

let me explain the different parts:

adage:
https://github.com/lukasheinrich/adage

this is a small workflow tool I wrote in order to be able to execute arbitrary DAGs of python callables in cases where the full DAG is not known upfront, but only develops with time. I keeps track of a graph and has a set of rules of when and how to extend the graph

yadage
https://github.com/lukasheinrich/yadage

this is the same concept but adds a declarative layer. In effect it defines a callable based on a JSON file like this one

https://github.com/lukasheinrich/yadage-workflows/blob/master/lhcb_talk/dataacquisition.yml

that defines a process with a couple of parameters complete with its environment and a procedure how to determine the result.

this is already helpful to use docker container's basically as black-box python callables like here:

https://github.com/lukasheinrich/yadage-binder/blob/master/example_two.ipynb

On top of these callables, there is also a way to define complete workflows in a declarative manner like here:

https://github.com/lukasheinrich/yadage-workflows/blob/master/lhcb_talk/simple_mapreduce.yml

https://github.com/lukasheinrich/yadage-binder/blob/master/example_four.ipynb (try changing the number of input datasets, but don't forget to clean up the workdir using the cell above)

which then can be executed by the notebook. As a result we get the full resulting DAG (complete with execution times) as well as PROV-like graph of "entities" and "activities".

yadage-binder

just a small wrapper on top of yadage that install the ipython notebook.. it doesn't really work in binder as originally intended since I can't get binder to have writable VOLUMES. So currently you have to start it on carina instead like so

docker run -v /workdir -p 80:8888 -e YADAGE_WITHIN_DOCKER=true -e CARINA_USERNAME=$CARINA_USERNAME -e CARINA_APIKEY=$CARINA_APIKEY -e YADAGE_CLUSTER=yadage lukasheinrich/yadage-binder

where you pass your carina creadentials and the cluster name

Add @ctb as a collaborator

I'm on board if you don't mind this competing with my other application (which is quite different).

Notebooks as a research tool

This is a follow-up to a Twitter conversation which I think is relevant for everpub.

Logo

We need a small logo/sketch for the submission of the proposal.

@JackDapid how are your ✏️ skills?

Ideas:

sheet of paper with a computer on it
sheet of paper with two gears
sheet of paper with a factory drawn on it, data and code at the bottom being pumped through the factory, chart comes out the top

stronger explicit coupling of code and data

Nice proposal! Many things in the pitch are exactly what we try to achieve within the context of the CERN Open Data service and the CERN Analysis Preservation pilot.

One suggestion: the proposal seems to address running code more in length than it addresses its relation to data. It may be useful to promote the idea of coupling of code and data more closely, e.g. via git-annex or git-lfs tools, that permit researchers to maintain versioning of both software and data in the same place, even though the data is located on some remote storage service due to its size.

For services like Zenodo, this would open an easy possibility to archive not only software, but also (reasonably sized) datasets at the time of the release, for example.

Introduce yourself: Welcome! Willkommen! Bienvenue!

This isn't a one person project, probably not even a five person project. We need help!

If you want to become a co-proponent for submitting this to the openscienceprize make yourself known here. In particular I'd like to hear from you if you are an "individual or group based in the United States".

Independent of the openscienceprize, make yourself known in this issue if you want to collaborate on making this happen.

Scope and deliverables

In order to approach wider groups, I think we need a statement of scope & a rough idea of deliverables. If I were to one together it would say:

develop proposed simple/common denominator standard around specification of dependencies, execution of code, inputs and outputs, etc. so that services like mybinder/everware/thebe/etc know how to run papers.
implement alpha-level demo/vertical spike through idea, implement some use cases
expand mybinder/everware to deal with R stuff
provide & implement integration points with travisCI/github/pull requests
integrate with zenodo to provide DOIs
convene discussions with publishers and other potential stakeholders
think hard and prototype ideas around composition of workflows

(& I think this is all do-able within the size of the prize.)

thoughts?

Submission

I will submit the proposal at 10pm GMT+1 (Paris, Geneva) time. (The deadline for teams to submit entries for the Phase I prize is 11.59pm GMT on 29 February 2016)

Deadline for changes 8pm GMT+1 (Paris, Geneva) time
Anything not done by then will have to wait.

Bundle:

proposal.md converted to PDF
UI mockup
abstract.md as plain text.

To be done:

Abstract to be written (= executive summary for submission, 300 words max)
team.md to be completed (see #47)
Build the final PDF and review errors / glitches
- check #89
Check length (current is ~16000 characters)

There are trailing comments in the proposal, I've listed them here just to be sure to not forget to process them before actual submission:

"many others (XXX more here)" to be completed or removed
"remote containers; (XXX more here)." to be completed or removed
"be addressed. (refs)" to be completed or removed
"this proposal (XXX)" to be removed ?
"and N people" N = 7 ?
"place holder XXX" maybe it's time to remove it ?

reproducible research guidelines

there was a question in other issue (#36) about reproducible research guidelines/best practices.
If such thing exists we could mention it in our proposal so our proposal would not look very disconnected from reality, otherwise we could mention that we are going to contribute to development of such ourselves.

in v1 of proposal there was a deliverable

blue prints, tools and best practices guides for creating such a
publication; and

in v2 it is gone. But I guess it is an important part of making reproducible research go beyond 'nucleus of this project'.

anyway this issue might serve as a collection of existing reproducibility guidelines.

Track changes (if paper is the focus) or consider alternate project titles?

I'm responding to the request for comments by @ctb on Twitter. This project is very exciting and I look forward to following it and perhaps contributing. My main comment from reviewing the proposal and issues is, is the paper the focus or not?

The current name of the project (everpub) suggests that it is the focus. If so then more attention could be paid to the publication part. Right now the publication stack is only 1 of 8 focus points. As a scientist who is striving (desperately!) for a more reproducible workflow, a major stumbling block is the inability of git/Github to enable Track Changes-like functionality where individual comments within a PR can be accepted/rejected. I saw the latex package and perhaps something like this would be valuable to work toward. Could have wide-reaching value for git beyond publication as well.

On the other hand, if the paper is not the focus (e.g., https://github.com/betatim/openscienceprize/issues/50) then perhaps the everpub name is misleading? Or is that the point, that the publication and dissemination of results is not connected to the actual "paper"? Anyway, I was thrown here.

Maybe paper isn't the endpoint

Hi,

Just read the new draft proposal. Looks good. I like the beginning for sure.

One thing that strikes me is that I think it is a trap of the current scientific model to think of papers as the endpoint. That's part of the mind-set that lead to our reproducibility and openness problem in the first place. And it also makes composition harder.

I'd suggest that we think of the result of one of these reusable research products as workflow that can be extended or embedded into a larger workflow. The work that Lukas and I have been doing is based on the idea of a "workflow template" or "parametrized workflow", which has been quite a flexible model. As such, the paper might just be one of the outputs products ("entities" in the PROV language), but along with that would be other products. There could be multiple papers, multiple data products, etc.

I think this is all in agreement with the what others are thinking, but proposal has a lot of focus on papers:
" with tools to go from an empty directory to a fully rendered paper "
"Despite these advances, there are still many missing components of a system for executable papers."
"paper repository"

Nicer PDF

Too many PRs on the queue and I've to go:

For a nicer PDF output (for title and authors):

Makefile:

FORMAT   = markdown+mmd_title_block
pdf: proposal.md
        pandoc --standalone --from $(FORMAT) -V colorlinks proposal.md -o proposal.pdf

Proposal.md:

Title:  Everpub: reusable research, 21st century style
Author: Tim Head (Europe lead), Titus Brown (US lead)

## Introduction
...

Journal contacts

What should we ask from a journal if we make contact with them?

Move this thread of discussion here from #18

Screencast

We can submit up to three additional files. One of them should be a screencast of a researcher in the future browsing executable papers, highlighting the link to the code, showing the ⭐ing, then re-running a small part of the paper, forking it, modifying it, re-running, finally clicking "download to my computer"

Need some HTML skill to make mockups that work well enough for a screencast.