markziemann / 5pillars Goto Github PK

View Code? Open in Web Editor NEW

6.0 3.0 2.0 738 KB

Five pillars of computational reproducibility

Home Page: https://ziemann-lab.net/public/5pillars/5pillars.html

License: MIT License

TeX 99.09% Shell 0.91%

bioinformatics computational-biology data-science journal-article reproducible-research

5pillars's Introduction

Hi there 👋

I am a Lecturer and researcher in computational biology at Burnet Institute, Australia. Our group is focused on building data resources and software tools to accelerate biomedical discovery. We collaborate closely with clinicians and biologists to get the most out of their 'omics experiments. Our lab is committed to reproducibility, open science, and diversity.

I code mostly in R and bash. I'm looking to learn more about machine learning, web design and other languages used for bioinformatics including python.

Topic areas of interest:

Transcriptome analysis
Multi-omics/epigenomics
Enrichment analysis
Scientific rigour

List of scientific publications: Google Scholar and ORCID

Contact me:

twitter: @mdziemann
email: mark.ziemann αt gmail.com

Pronouns: he/him

5pillars's People

Contributors

Stargazers

Watchers

Forkers

drvenki ginolhac

5pillars's Issues

Drone continouous validation

Notes on portability

Using relative paths instead of absolute paths
Specifying the package used. eg: dplyr::select()
Anything else?

Things to add to the supplementary guides

How to reference using R Markdown and Quarto: https://pages.citedrive.com/r-markdown/

How to use journal templates: https://github.com/rstudio/rticles

Make parallel with the 3 pillars of open science?

We often say that open science is based on 3 pillars (see for instance https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002506 figure 1):

open data
open source software
open access to publications

I guess we could make a parallel between these 3 pillars and the 5 for reproducibility from this work:

open data -> FAIR & persistent data sharing
open source -> Code version control & sharing
open access -> Documentation

Are the extra 2 (Comput env control & Literate programming) the gist of reproducibility? Making your code explicitely work the same for the former, and making clear connection between code and data for the latter.

Relevant papers

https://www.researchsquare.com/article/rs-3222221/v1
Code sharing increases citations, but remains uncommon

Incentives for rigorous reserarch

https://researchintegrityjournal.biomedcentral.com/articles/10.1186/s41073-021-00113-7

Anusuiya comments via email

I am not sure if this is relevant but recently, in one of our modules, we were asked to perform analysis of metagenomic papers. We were asked to use similar bioinformatics tools in the Galaxy server compared to the tools that were used as stand-alone options by the researchers originally. I used the same tools that were present in galaxy (for example - Kraken2 - for taxonomical classification - an updated version of kraken1). The results were not very similar, even though taxonomical classification was done through Kraken2 and the stand-alone tools (like Kraken1). This can also mean that the results of Kraken1 used by researchers earlier can be problematic or may no longer be valid?
Also, the authors did not mention some statistical criteria and the base database to compare the datasets with. These were important information to analyse the datasets, and hence, we went in with the default settings. Even though stand-alone tools were used by the authors, I was able to perform similar processes using different tools in Galaxy - still, the results were very different. Again, this raises the importance of proper documentation.
I understand that different tools cannot give the exact same results, but they should be able to give a similar idea/picture of the results right (as taxonomical classification was the ultimate goal), plus Kraken 2 is just an updated version of kraken1. I was just wondering if it would be too much to expect for similar bioinformatics tools to produce similar - if not the exact results?
Another point in documentation or "FAIR" can be that - most of the papers do not label the supplementary files properly, and sometimes, even in the CSV file contents, there are no mentions of what the results in the Excel sheet are about. There was also a metagenomic study with 40+ datasets and none of the datasets were labelled with an identifier that can be used to identify that dataset in the manuscript. Anonymity can still be maintained with patient samples, however, not labelling properly makes it difficult to reproduce the results sometimes. I had to discard that paper altogether as the option of reproducibility was not at all possible. I had similar issues with supplementary files and finding relevant results during my bachelor's project as well...

Conclusion notes

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3383002/

https://royalsocietypublishing.org/doi/pdf/10.1098/rsta.2020.0069

Software archiving

Regarding Software archiving:

Zenodo and Software Heritage repositories are both good options for long-term archiving of software, and the link to these should be provided in the respective journal article/pre-print.

I would tend to push toward Software Heritage only -- or a least make a clear distinction between the 2 solutions -- since Zenodo (nor Figshare) have been designed for software archiving.

Software Heritage mission is to collect and archive source code (and source code only). It retains the full history of the project and ease the citation of the software archived by providing an intrinsic identifier called the SWHID.

I can make a PR for this if you agree.

Guix build

Let's give guix a try

Incentives

https://www.bmj.com/content/382/bmj.p1887

Myst markdown

https://myst-tools.org/ looks like a nice lit program tool

More details on benfits of Guix needed (Altuna)

Guix part can be improved it provides much more in comparison to Conda and most python and R packages can be imported via automatic importers (if dependency structure is clear)

rticles ms formatting

Random seeds (from Martin)

They seem to be important for some workflows like community ecology (uses permutational ANOVA), and scRNA-seq analysis with UMAP and t-SNE.

On "End-to-end code coverage"

Dear @markziemann

I'm a bit confused about the expression 'End-to-end code coverage' (maybe influenced by the othe expression 'test coverage'). If the idea is to fully automatize an analysis procedure, could 'fully automated procedure' or 'end-to-end automated procedure' better represent this view?

Also, this idea of automation fits perfectly with a simple 'run all' script that will run an entire procedure analysis and a workflow file that will be smarter in terms of tasks management.

Practical guides2

Mark to provide some tutorials, guides, instructional videos, articles and methods on the following:

Practical guides to data sharing: Learn about the FAIR principles and make the research data available in a public repository.
Extend the scripts to make them end-to-end processes
Practical guides for Conda, Guix and Docker
Practical guides for documenting computational research
Moving from “dev” to “prod”: testing and continuous analysis

Comments from Pierre

To be truly honest, your manuscript is very well advanced and carries
many valuable advises. I don't know if I could improve much what you
wrote so far.

In any case, here are the subjects I may help you with:

I'm a Python guy meaning I know quite well the Python ecosystem.
Regarding Streamlit for instance, I don't think it should be compare to
notebooks. To me it is more similar to what could be done in R with
Shiny. However, the newcomer in Python is MyST
(https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmyst-parser.readthedocs.io%2Fen%2Flatest%2F&data=05%7C01%7Cm.ziemann%40deakin.edu.au%7Cfa29ebc3a7ca4056fffe08db5d402a37%7Cd02378ec168846d585401c28b5f470f6%7C0%7C0%7C638206302395941888%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=bwPE8OfBc37OB2Rl7Jg%2BWlnCKzLehgfYXtyUEfsFuwk%3D&reserved=0) that could really
improve the way we wrote documentation.

Regarding the archival of source code, this is indeed essential as a
repository may at some point disappear. I wouldn't recommend to archive
source code in Figshare nor Zenodo since these data repositories have
been designed for data. Source code should be archived in Software
Heritage (https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.softwareheritage.org%2F&data=05%7C01%7Cm.ziemann%40deakin.edu.au%7Cfa29ebc3a7ca4056fffe08db5d402a37%7Cd02378ec168846d585401c28b5f470f6%7C0%7C0%7C638206302395941888%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2Bd%2BRH1O4d%2FY6BFGWhhomFhoOZuNrtGOcyq2PPwW55QE%3D&reserved=0), that is the universal
archive for source code (disclaimer: I'm a Software Heritage ambassador).

Finally, I guess the GUIX project could get more emphasized since the
aim of the project is to offer fully reproducible results (similar to
what you could get with a VM) but without using any VM. It's conda on
steroids.

These sound like great suggestions. I will work on incorporating them into the MS

Data - link rot

Add git example to figure 3?

https://scfbm.biomedcentral.com/articles/10.1186/1751-0473-8-7/figures/3

Check all the references

Recognizing importance of testing and validation

Hello @pierrepo and @anusuiyaxbora. I'm starting to think that testing and continuous analysis deserve a dedicated section. I'm proposing to move the text from "future directions" to after the "Documentation" subheading under the new subheading "testing and continuous analysis". I propose changing the Fig1 diagram so that the pediment on the top of the pillars is "testing and continuous analysis". I'm happy to make the first draft of this if you are interested. Please let me know your thoughts.

Rebuttal comments from Pierre

R2P1.
In the rebuttal, in the response, typo: "reprodcibility" -> "reproducibility"
This is another interesting use case for reproducibility: https://academic.oup.com/gigascience/article/7/7/giy077/5046609
(but less controversial).

R2P3
In the 'Challenges' section we mention the need for increased education in data science. Maybe we could also mention the need for dedicated courses on reproducibility, at least at the postgraduate/master level?

R3P1
Absolutely! This work is the most comprehensive and accessible I've read so far.

R3P9.
In the rebuttal, in the response, typo: "Modifcation" -> "Modification"
MyST also supports over 400 journal templates (https://mystmd.org/guide/creating-pdf-documents) and it should be straightforward to produce a PDF export (but never tried yet).
By the way, we should probably renamed MyST-NB as simply MyST in the manuscript.

R3P24
Maybe to go in the direction of the reviewer's mind, we could add:
5. Use file formats that are machine-readable and compatible with many different types of software.
For example, the comma-separated values (CSV) and tab-separated values (TSV) formats are simple file
specifications that are suitable for most cases. For very large data or when data type is important, HDF5 or Parquet are suitable well-documented file formats.

Need orcid IDs of authors

Guix guide

A new guide for Guix has been released recently: A guide to reproducible research papers.
It may worth mentioning it in a future update of the paper.

Reviewer 2 points

I am currently working on these points

Inclusion of Case Studies or Examples: To strengthen your argument, consider incorporating case studies or practical examples that demonstrate the effectiveness of your recommendations. Highlighting specific instances where these best practices have enhanced reproducibility or solved particular issues could substantiate your points more effectively.

Expanding on Current Shortcomings: The article mentions that current practices are not meeting the goal of reproducibility. An expanded discussion on the ramifications of these shortcomings on scientific progress could accentuate the urgency and significance of addressing this issue.

Focus on Implementation Challenges: While the article acknowledges the importance of certain practices, an increased focus on potential obstacles researchers may face when trying to implement these suggestions, as well as strategies to overcome these challenges, would be insightful.

Clear Structure: Consider organizing the article into clearly defined sections such as 'Introduction', 'Background', 'Recommendations', 'Challenges', and 'Conclusion'. This could significantly enhance readability and comprehensibility.

Engaging Conclusion: Reworking the conclusion to be more impactful could be beneficial. Instead of merely summarizing the recommendations, it would be useful to indicate the potential trajectory of computational research if these recommendations are adopted, as well as highlighting the potential risks if they aren't.

Instructions to reproducibly generate the manuscript

Documentation

https://twitter.com/Cghlewis/status/1666797320350973952?t=5a6EuHesR7y9GUzulbhM_Q&s=19

Pre-flight check

Hi @pierrepo @sminot @anusuiyaxbora ,
The article is nearing completion so we need to go through some pre-flight checks. Here is the live version: https://ziemann-lab.net/public/5pillars/5pillars.html

My plan is to submit to briefings in bioinformatics, which hasn't published a best practices article like this for many years. I'll also be depositing the preprint at the same time - probably to https://osf.io/

Before I do that, there are some recent changes that I need you to check:

Your names, affiliations and ORCID information
The overall content
Conclusion
Key points (this is a requirement for BiB)
Author biographies (this is a requirement for BiB)
Funding: if you would like to acknowledge specific funding sources. Otherwise I will write "The authors received no specific funding for this work."

If you have noticed something that needs addressing, please raise a new issue and I will manage the changes.

We will provide a supplementary file listing some recommended learning resources. Here is the link (https://ziemann-lab.net/public/5pillars/practical_guides.html). If you have any suggestions, please add them to issue #15

I'm hoping to have this round of changes completed on the ~16th June. If you approve of the manuscript please reply "approve" below. After this, I will send out the pdf version for proof reading and approval.

Thanks for your contribution!

Anusuiya comments

I have gone through the manuscript and the rebuttal letter as well... it all seems great to me!
Just some points, not sure if they are of importance but just wanted to mention:

In the ramification paragraph:

"The ramifications of irreproducible and unreliable research include misleading the community, wasting research funds, slowing scientific progress, ethical issues while handling clinical data, eroding public confidence in science and tarnishing the reputation of associated institutions and colleagues. In certain clinical research cases, irreproducible bioinformatics has the potential to place patient safety at risk." - maybe something along the words that I've mentioned in green can be added? - as it is a risky issue when it really comes to clinical research

For R2 P3: Focus on Implementation Challenges: While the article acknowledges the importance of certain practices, an increased focus on potential obstacles researchers may face when trying to implement these suggestions, as well as strategies to overcome these challenges, would be insightful.

Obstacles while implementing:

Additional time required to learn a method that can be reproducible in the future - it can be difficult for novice researchers - time is required well in advance to get trained
Lack of guides that are direct in guiding how to perform reproducible analysis
Not having information to replicate the original analysis - for example, version number, options chosen to produce the results, lack of original datasets, i.e., unavailability of publicly accessible datasets (I think this point is already mentioned)
Ultimately creating a tool or code that is flexible, adaptable and reproducible for a variety of analysis types.

For R2 P3: Adding more to the conclusion - Our paper provides recommendations that can ensure the reproducibility of original research is maintained. The methods should be validated continuously in order to confirm that they are functional. Validation should be done by different individuals as well to identify concordance levels. - Just a couple of additional sentences can be added
"Common Workflow Language" - should I search more on this to add?
Strong incentives and more automation.? - When it comes to incentives on why to follow the FAIR principles and our guidelines - we can add that - automation for reproducibility could be a potential business avenue to explore by companies (in the biomedical sciences field); researchers can avoid future lawsuits for un-FAIR means

Practical guides

https://github.com/markziemann/5pillars/blob/41bb93e261a92a6c518cfdd82f3e2246cac1669b/guides/practical_guides.Rmd#LL58C1-L62C18

Hi @anusuiyaxbora I'm putting together a document containing some helpful links for folks wanting to implement the 5 pillars aproach so we can point them to these resources to get started. I'm looking for tutorials, methods, videos and articles for the following:

Intro to R and Rstudio (2-3 items)
Intro to Python and iPython IDE (2-3 items)
Intro to vscode (2-3 items)

Reviewer 3 comments

I Will start addressing these comments

Comments to the Author
The paper is mostly well written and addresses an important topic: computational reproducibility and tools to achieve it. The authors describe five "pillars" under which various tools and techniques can be categorized. The paper is pretty comprehensive in what it covers and has useful supplementary material. My main issue with the paper is that it reiterates many things that have been covered in other papers. The authors cite 11 such papers. Many of the topics covered in submitted paper have already been covered well in other papers. I am most familiar with reference 13, which covers many of the same topics, although this paper goes into more detail on many issues. The authors could do more to differentiate their paper from previous ones and perhaps remove some topics that are already covered well elsewhere.

Aside from that, I have listed below some relatively minor issues that, if addressed, would improve the paper. When I list page numbers, I am using the PDF page numbers rather than the numbers shown in the top-left corner of the manuscript.

Page 3, Line 36: "bioinformatics data analysts (not tool developers)". The authors state these individuals as the primary audience, but some parts of the paper seem to be targeted at a more technical audience. Or maybe I am misunderstanding the intent. If the audience is "bioinformatics data analysts," that would imply people who are bioinformaticians but are analyzing data rather than creating tools. However, a much bigger audience (and perhaps more important audience) are non-bioinformaticians who analyze data.
Page 3, Line 43: "enshrined by code" (this language is awkward)
Page 3, Lines 45-47: What about tasks that cannot be automated? U see that this topic is addressed later. But this part implies that everything can be automated.
Page 3, Line 52: It says that spreadsheets are "overused and misused." This is subjective and not backed by evidence, other than the well-known examples of gene symbols being formatted as dates.
Page 3, Line 56: It is not necessarily true that analyses performed using web tools are not reproducible. Although rare, some web tools facilitate reproducibility by providing code or configuration files and/or allowing the apps to be executed locally.
Page 5, Line 15, type-o = "authors provided along the"
Page 6, Line 21: "quantum leap" (this term is overly optimistic in this context)
How do you get from a notebook to an actual paper submission if you have to do custom formatting of the document, including references? My understanding is that this is still not possible, but please correct me if I'm wrong.
What about when your data files are too large to fit on a personal computer?
What about computationally intensive tasks that must be performed using specialized computing environments like the cloud or clusters?
Page 6, line 32: The master script idea was already mentioned earlier.
Section on version control: This section focuses mostly on using VC for software development (that is my interpretation). To be consistent with the introduction, it should focus more on data analyses. Although I use VC for analyses, I feel that simpler approaches are better in many cases. For example Dropbox and Google Drive provide some version-control and backup functionality and do not require the same level of knowledge as git.
Page 7: Many data analysts will not know what JupyterLab or VS Code are. References are also needed.
Page 9, line 7: References are missing for these other tools.
Page 9, lines 7-8: I believe you, but I am not aware of evidence that supports this claim.
The Biocontainers project should be mentioned.
Page 10, lines 51-52: I disagree that the risk is small. There are many instances of using genomic data to identify individuals who have committed crimes.
Figures 2 and 4 are very similar to figures used in reference 13.
Figure 5: I don't think it's really necessary to resummarize the FAIR principles. People can go to the source article for that.
Page 11, line 13: I disagree that repositories like GEO and SRA are FAIR. There are lots of problems with FAIRness in these repositories.
Page 11, lines 13-15: That's not true for some of these disciplines. Ecology has NEON, evo bio has NCBI.
Page 11, line 34: Need evidence to back this up.
Page 11, line 39: It is not necessarily true that CSV files are better than Excel. Excel can retain information about data types, for example, whereas CSVs do not. It depends on what you are trying to accomplish.
One thing that could be added is something about Common Workflow Language. It's a community-supported specification for accomplishing many of the objectives described here. There are some recent papers about this.
Page 14, lines 9-10: The big question is how to make more progress. We have the tools to achieve reproducibility, but why are we rarely achieving it? The paper mentions incentives and lack of training, which are true. You might consider elaborating a bit. My cynical view is that writing more papers and tutorials will do little without strong incentives and more automation.

Pyodide/Observable notebooks any good?

The ability to inspect the objects with an interactive code window is super important, which is why I love the approach here [https://observablehq.com/@gnestor/pyodide-demo]. I think this is worth mentioning in the 5 pillars paper.