The sic from neurodata

make sure there is logical and cohesive flow

A.1.1

i'd add a screenshot where you enter the relevant info.

todo before posting:

update figures 4 and 5 for new plotting scripts
update docker of ndmg
update docker of sic to use new ndmg docker
update figure 3 panel C in paper
reduce number of "lists" in the paper

create web-app version of ndmg pipeline/dockers

how to know it is running

you write that stuff is verbose.
hand hold a bit more.
show a screenshot here of what they will see.

fig 3 panel C

what about scalar functions of the multvariate glocal graph stats?
eg:
nnz
avg/max degree
avg/max weight
avg/max cc
avg/max scan stat
avg/max centrality
max eigen

?

of note, technically "scan stat-1" is the max of "locality stat",
so we should update the axis label in fig 4 & 5
ylabels i think should say "Unit (\times 10^x)", that is, add '\times" i think?
y-label should never be "value" :)

i totally love it.
one thought though: when reading the appendix, the color contrast between different heading levels is a salient feature on the page. i don't think we want the heading level to be so salient.
so, i recommend just a single color for all headings.
do you have a precedent image for other people doing something like that?

of note, i changed it to see how i felt and liked it better, but definitely this is asking your opinion...

AMI

@gkiar brother suggests if we create an AMI (a cheap one) that has everything installed already, then the "reproduction" can merely be "click this link to be brought to our jupyter notebook in the cloud".

this is one step closer....what do you think?

@disa-mhembere @randalburns

the eventual goal would be to have a "launcher" that could link to a wide variety of different "scientific cloud containers for extensible and reproducable research" (siccer)

add link from neurostorm

"best" is an emotionally laden word, avoid if possible

add screenshots showing terminal (like the one you sent me)?

feedback edit

i made some comments, and i sent to brett's person for more detailed minor edits.
in general, it is great.

our plan, iirc, is to submit post cluster deployment in the cloud, right?
i've now lost track of where we are documenting this plan?
does not seem to be github, it is asana?
please link to it in the comment, and then close this issue when you address my minor comments

i did a

% @gkiar <blah>

for my comments in the text

incorporate references from scott

http://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0087-0
http://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0073-6
https://gigascience.biomedcentral.com/articles/10.1186/s13742-015-0092-3
http://gigascience.biomedcentral.com/articles/10.1186/s13742-016-0135-4

paper updates

Continuing from: #5

coding

finalize instructions for simply installing and launching the pipeline from EC2

writing

add bioconductor link (#6)
cite mybinder
make reproducibility: instr for launching instance, ~~installing docker~~, and running
make methods: more like what it was (i.e. description of tools and services used)

response letter

If you could please provide the following, it would be much appreciated 😄 :

example letter of response to reviewers' feedback
detailed instructions of how to write one of these letters (potentially less important if an example is provided)

Thanks!!!

push to arxiv

appendix

still a lot of "we". too many.
also, too many "you", and "our"

finalize and submit sic

DoD: revised version sent back to reviewers and co-authors, and uploaded to arxiv.

figure 4 numbers

i think i've commented on this before, but it doesn't seem right to me still.
we can only report numbers up to a reasonable number of significant digits,
especially in axis labels, which should never have more than 2 really.
more specifically:

degree \leq 70
cc \leq 1
edge weight too many sigdigs
scan-stat too many sigdigs, and it is actually "locality stat"

also, title and x-axis are confounded in some of these. for example, most of the panel titles are really just the x-axis, but not all . eg, "spectrum" the x-axis is vertex number. for all the others, i don't think we have a panel title, just an x-axis (i think we can drop the title for Spectrum).

the same goes for fig 5 obviously.

also, figs 4 & 5 are now the least pretty part of the paper. why not use plotly? making figures prettier is nice, but not having non-sense label numbers is important. let's definitely do that asap (since its already online).

in general, replace "in order to" with "to"

nothing happens

http://scienceinthe.cloud/notebooks/sic_ndmg.ipynb

waited 15 minutes, but the first cell never completed.
i tried several times :(

methods changes for sic

each bullet is a piece of feedback directly from a reviewer. in comments underneath each is the response. Once I have a response for each, I will bake the question and answer into the relevant paragraphs in the methods section.

Methods

in methods section

for each decision, put a paragraph per decision, including why we made the decision,
and the details of what it means to have made that decision.

eg, for S3, what commands do we use?
for BIDS, did we need to organize derivatives, subject level/group level, somehow to stay in accordance with them.

jupyter vs. Rnotebooks (which now exist), we don't need a good reason, we can say other options would also be great.

etc.

can you put the author affiliations as "endnotes

they aren't pretty or particularly useful for the narrative, i find them quite distracting.

pre-sending to nicole

take out group level and see if it feels empty
fix figure 1 MECE <- more about scientist and less about six things, OR, vice versa
improve separability of computing and deployment
push deployment above computing
make 5, 6 for steps
tables from appendix in results
push demo data into docker

make so it can compile locally for jovo

not particularly urgent, but possibly related to arxiv not accepting it:

[Compiling /Users/jovo/Research/Projects/Greg/sic/sic.tex]

TraditionalBuilder: Engine: pdflatex. Invoking latexmk... done.

Errors:

/usr/local/texlive/2014/texmf-dist/tex/latex/fontspec/fontspec.sty:41: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [ }]

No warnings.

[Done!]

fig 1

challenges and address of challenges now in parallel structure in text.
but, figure does not allude to any of the challenges.
perhaps it could/should
(possibly post submission)

let's cite this too

http://www.sciencedirect.com/science/article/pii/S1053811915004656

be awesome

session?

is that common usage? i would have thought "scan"?
show me a precedent if you think session is more prevalant?

what do you think about bioconductor?

it is somewhat related: https://www.bioconductor.org/
perhaps we should cite put our contribution in context of that.
what do you think?

sic jupyter notebook

still says "group level analysis"
it is particularly confusing because it is literally impossible to do with only 1 subject.

update description of how to use

'if you get "process interrupted" as an output for any step refresh the page and start again from the top; it's due to the server rebooting which it is scheduled to do every few hours.'

Lit Review

Read the following:

Virtual imaging laboratories for marker discovery in neurodegenerative diseases: http://www.nature.com/nrneurol/journal/v7/n8/pdf/nrneurol.2011.99.pdf
Neuroimaging study designs, computational analyses and data provenance using the LONI pipeline: http://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0013070&type=printable
neuGRID: http://www.futuremedicine.com/doi/abs/10.2217/fnl.09.53
NeuroDebian: http://www.physoc.org/proceedings/abstract/Proc%20Physiol%20Soc%2031PCA100
Neurodebian on AWS (EC2): https://www.nitrc.org/forum/forum.php?forum_id=3664
Mouse Neuroimaging Phenotyping in the Cloud: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6469527
Large-Scale Analysis of Neuroimaging Data on Commercial Clouds with Content-Aware Resource Allocation Strategies: http://journals.sagepub.com/doi/pdf/10.1177/1094342013519483
Data integration: Combined imaging and electrophysiology data in the cloud: http://www.sciencedirect.com/science/article/pii/S1053811915004656

Summarizing these works and provide context for ours

write single paragraph to this end

SIC Updates

Next steps from: neurodata/m2g#99

code

setup aws micro instance ami which launches a jupyter server
lock the terminal of the jupyter server so they can only run the commands we want (@alexbaden)
launch SIC ECS job from CLI in jupyter server instance
determine where to store data derivatives/how to provide access for our users

writing

add screenshot of terminal interface for users
rework intro to focus on SIC, not just scientific containers
Add AWS setup instructions/description to Methods
Push most of current Reproduction Instruction into Methods
add instructions for the new interface

results section

should have the following subsections:

intro paragraphs
subject level analysis - briefly describe each step, i think this has to include computing subject level graph stats. reference future manuscript.
population level analysis - briefly describe each step, also reference future manuscript. i think this is just plotting them all together?
extension options - describe the kinds of extensions, refer to figure.
demo - explain what the demo does

i've added blank sections for you :)

"Other tools"

that does not belong in methods, that goes in discussion

figs feedback

fantastic draft.

fig 1: looks great

table 1: i'd left align everything. i don't agree with all the pro's and cons, let's discuss

fig 2: can't read things, and figure shown is a bit redundant. i presume the caption is just a stand in. let's brainstorm a bit about how to make it more clear, zooming in on crucial bits, for example.

fig 3: i think it should be fig 3 and 4 (separately). i never like (a) and (b) the way latex does it. i think axes should typically be log X, rather than X, so you don't need to write 10^n all the time. nnz one isn't very clear, i'd use violin plot. though color is gratuitous here, might be worth it?

let's discuss more tomorrow

paragraph 2 of paper

there are several challenges, para 2 should enumerate them each in a sentence:

data not open access
data not organized in a fashion immediately amenable to analysis
code not open source
code not organized in a fashion immediately amenable to run
various software dependencies/installations (of particular versions)
various hardware dependencies

paragraph 3 provides conceptual explaination of our resolution of all of these.

makes sense?

intro

literally enumerate the challenges in par 2.
first,
second,
third,
etc.

you write "first", but then no "second".

in par 3, build up the solution:
data standards solve some stuff
jupyter solves some stuff
containers solve some stuff
science in the cloud solves all stuff

paragraph 3 is where you get to put our work in the context of previous art, explicitly.

add population mean and variance (possibly post submission)

paper narrative

the paper is currently written like a "how to" manual that one might get from ikea.
it is very easy to read, but not exactly what technical journals want.
after figuring out very carefully what is the main gap that we are filling,
i think the following organization would be a big improvement

a "methods" section, including how to (in general)
a. organize data into standard specification,
b. get data online,
c. put code in virtual machine,
d. get VM in cloud,
e. run on data
f. store derivatives in standard spec.

imagine a paragraph describing each of those steps, plus a figure/panel demonstrating each "in general".

the main usecase: sic:ndmg.
which explains the choices we made for each step (BIDS, S3, docker, EC2, etc.)
(again, a figure making each decision clear), and why.

why BIDS vs something else?
why S3 vs Google vs etc...

extensions
a. replace data with different data
b. replace some code with different code
c. update docker
d. add analyses

(again, a figure showing the impact of each decision, eg, when replacing data, showing the results change, when updating code, results change, changing algorithm, results change, add analysis, results change, etc.)

discussion points out that this sic framework can be applied quite widely, and how it can be imrpoved (1 click auto-launch of stuff), etc.

i'm not sure whether 1&2 are completely seperable, and i'm not sure whether we need 3 to get the paper into gigascience.

i am sure that the way it is currently written will raise eyebrows.
the thing that will be most difficult about this paper is that it is already so far from what people expect, we want the structure to be as familiar as possible, so they can really focus their cognitive energy on the contents, rather than the structure/tone.

right not, structure, tone, and contents are all unfamiliar. let's get only 1 of them unfamiliar (contents)?

20/60/20

that is the ratio of words of intro/results/discussion.

our discussion should put our work back into the greater context,
explaining caveats, related efforts, next steps, each with a paragraph.

does the neurodata ocp_template repo make any of this clear? if not, i'll fix
(i'm not sure i ever pointed you to it).

Reviewer feedback

Overview

Organized below is the feedback we received in the first submission of the SIC manuscript to Gigascience. I attempted to break the suggestions out into bulleted lists where each bullet corresponds to an action I can take/item I can address. An indented quote block is text from the reviewer explaining the bulleted items nearby.

My plan is to address each of these in the manuscript, and as I do, add a comment of my own discussing how I addressed the changes, as I will need to upload that in resubmission.

My goal is to be done addressing all contents of this issue by January 15th, 2017, one month from today.

Web Service

clarifying instructions #42
clarifying instructions #41

Figures

potentially mirroring challenges of text in figure #39
fixing axes labels #37
While the authors have cost estimates spread throughout the paper, I believe further discussion is necessary.
- Thus, perhaps it is advisable that the authors to include for the pipeline in Fig 2, who much time did each step take, how much did it cost, etc (maybe a table)?

It would help the readers to understand for a typically sized study how much does it cost to upload data, store them for X days/months, download them, and for computation. Based on our experience what was costly to store was the registration non-linear warps on the cloud and we had to keep special scripts to keep clean our data store.

Minor formatting

First line of discussion, there is a double the.

Lit review

additional paper to cite #43
ported issue for handling lit review #45

In its current form, it suffers from a few main issues (that some could be remedied):

Lack of a fair literature review. The way the authors present it, it appears they are the first to have attempted this. For example, what is the relevance between what the authors present and:
- G. B. Frisoni, A. Redolfi, D. Manset, M.-E. Rousseau, A. Toga, and A. C. Evans, "Virtual imaging laboratories for marker discovery in neurodegenerative diseases," Nature Reviews Neurology, vol. 7, no. 8, pp. 429-438, Jul. 2011.
- I. Dinov, K. Lozev, P. Petrosyan, Z. Liu, P. Eggert, J. Pierce, A. Zamanyan, S. Chakrapani, J. Van Horn, D. S. Parker, R. Magsipoc, K. Leung, B. Gutman, R. Woods, and A. Toga, "Neuroimaging study designs, computational analyses and data provenance using the LONI pipeline," PLoS ONE, vol. 5, no. 9, pp. e13 070+, Sep. 2010.
- neuGRID
- outGRID
- the effort on NeuroDebian
- Neurodebian on AWS (EC2) https://www.nitrc.org/forum/forum.php?forum_id=3664
- M. Minervini, M. Damiano, V. Tucci, A. Bifone, A. Gozzi, S.A. Tsaftaris, "Mouse Neuroimaging Phenotyping in the Cloud," 3rd International Conference on Image Processing Theory, Tools and Applications, Special Session on Special Session on High Performance Computing in Computer Vision Applications (HPC-CVA) , Istanbul, Turkey, Oct 15-18, 2012.
- M. Minervini, C. Rusu, M. Damiano, V. Tucci, A. Bifone, A. Gozzi, S.A. Tsaftaris, "Large-Scale Analysis of Neuroimaging Data on Commercial Clouds with Content-Aware Resource Allocation Strategies," International Journal of High Performance Computing Applications, Jan 17, 2014.

I personally find relevance to the above methods at least in terms of motivation (albeit some may have used different methods). Obviously the last two were authored by my team a few years back, on the basis of a different Python based backbone that is now defunct (PiCloud). But the second one (last in the list), it went even beyond that: it considered optimization of resources (type of Amazon instance) with a machine learning method that predicted resource needs for non-linear registration in a pipeline of atlas based segmentation.
I am really fond of the approach of the authors as it adopts newer technologies (containers etc) that can perhaps make such systems future-proof. I should note that some of the technologies are used also by other systems on different applications. For example, there is US based initiative called CyVerse (iPlant) which the authors could explore as well.

Feasibility

Lack of discussion on how the current approach can be extended to use other tools such as freesurfer, ANTs etc

As I am sure you are aware, the same neuroimaging tools don't work for everyone. While I agree with the idea of having standardized pipelines, the ability to evolve said pipelines (as forks) can help the system evolve and (even) be maintained. Can you please expand on this.

Unfortunately, from at least how I understand the code, it appears that to do the same pipeline for the NKI1 dataset (40 scans) the process is linear (ie one scan after the others). This is enforced by the comment of the authors in the discussion, related to Kubernetes, "would help enable SIC to scale well when working with big-data or running many parallel jobs. " If this is true, the SIC framework loses one of the greatest aspects of cloud computing: that of scalability.

The authors should comment on this, particularly as this would make a proper fit for the GigaScience journal.

In my vision, the main difficulty to address in the proposed pipeline, is the inherent complexity. For instance, while the authors propose the use of Docker containers to create easily setup scripts and data loading, in a real scenario there are two main criticisms: 1) the complexity of creating the Docker container by the research groups, for instance, considering the data scientists associated to the MRI problem may not have that knowledge; 2) to run the containers, it is still needed some technology background.

Thus, the methodology and guidelines should be considered to approach the problem, and the strengths and weakness should be presented in discussion.

Methods

reproduction instructions and extensibility instructions are the "results"

in other words, either make them part of results section,
or remove the "results" section.

currently, our results section is to vacuous.

gap

what is the main "gap" we are filling with "science in the cloud"

my feeling is that "reproducability" is but 1 of many things,
and sells your work short.

science in the cloud solves many problems:

i want to interact with data from anywhere
i want anybody else to be able to interact with data from anywhere
i want transparency of all the steps performed in the analysis to be able to evaluate other work
i want to be able to run somebody else's analysis again on their data to verify their claims
i want to perform a "sensitivity analysis" on their data by varying their procedures minimally
i want to run the same analysis on my data to see whether i find the same thing
i want to build on their work to catapult science into the next great discovery

in a certain sense, i think all of the above steps are just steps toward, or special cases of, the 7th step: extending science.

so, my sense is that the abstract & intro should focus on extending, rather than reproducing.

i know that you've designed things in a particular way that makes certain extensions harder, and certain reproductions easier. that is all true.
nonetheless, the gap we are trying to fill is about extending, and we are taking 1 step in that direction (1 huge step).

in the abstract and intro, we want to start as big as possible, and only like in the last sentence/paragraph provide details of which aspect of this monumental challenge we are addressing.

does that make sense?

some parallel cloud deployment

either ecs or something else, run nki through it.

update preamble

i tried to combine your header with my preamble in here:
https://www.overleaf.com/4641465qqjdgv

i am updating this document for class.

however, i cannot get both the main font and the title font working properly unfortunately.
please help?

clarify cell 1

what is happening that is taking 3-4 minutes?
seems like most of it is downloading?
does it make sense to break it up into multiple cells?

neurodata / sic Goto Github PK

sic's Introduction

extensible-science-paper

Links

sic's People

Contributors

Stargazers

Watchers

Forkers

sic's Issues

Methods

Read the following:

Summarizing these works and provide context for ours

Overview

Web Service

Figures

Minor formatting

Lit review

Feasibility

Methods

Recommend Projects

Recommend Topics

Recommend Org