Giter Club home page Giter Club logo

sic's Introduction

sic's People

Contributors

gkiar avatar

Stargazers

 avatar Peter Morgan avatar mnarayan avatar Chris Gorgolewski avatar

Watchers

James Cloos avatar William Gray Roncal avatar Randal Burns avatar  avatar Alex Eusman avatar Matt Lee avatar  avatar paper2code - bot avatar

sic's Issues

A.1.1

i'd add a screenshot where you enter the relevant info.

todo before posting:

  • update figures 4 and 5 for new plotting scripts
  • update docker of ndmg
  • update docker of sic to use new ndmg docker
  • update figure 3 panel C in paper
  • reduce number of "lists" in the paper

how to know it is running

you write that stuff is verbose.
hand hold a bit more.
show a screenshot here of what they will see.

fig 3 panel C

  • what about scalar functions of the multvariate glocal graph stats?
    eg:

  • nnz

  • avg/max degree

  • avg/max weight

  • avg/max cc

  • avg/max scan stat

  • avg/max centrality

  • max eigen

?

  • of note, technically "scan stat-1" is the max of "locality stat",
    so we should update the axis label in fig 4 & 5

  • ylabels i think should say "Unit (\times 10^x)", that is, add '\times" i think?

  • y-label should never be "value" :)

sic format

i totally love it.
one thought though: when reading the appendix, the color contrast between different heading levels is a salient feature on the page. i don't think we want the heading level to be so salient.
so, i recommend just a single color for all headings.
do you have a precedent image for other people doing something like that?

of note, i changed it to see how i felt and liked it better, but definitely this is asking your opinion...

AMI

@gkiar brother suggests if we create an AMI (a cheap one) that has everything installed already, then the "reproduction" can merely be "click this link to be brought to our jupyter notebook in the cloud".

this is one step closer....what do you think?

@disa-mhembere @randalburns

the eventual goal would be to have a "launcher" that could link to a wide variety of different "scientific cloud containers for extensible and reproducable research" (siccer)

feedback edit

i made some comments, and i sent to brett's person for more detailed minor edits.
in general, it is great.

our plan, iirc, is to submit post cluster deployment in the cloud, right?
i've now lost track of where we are documenting this plan?
does not seem to be github, it is asana?
please link to it in the comment, and then close this issue when you address my minor comments

i did a

% @gkiar <blah>

for my comments in the text

paper updates

Continuing from: #5

coding

  • finalize instructions for simply installing and launching the pipeline from EC2

writing

  • add bioconductor link (#6)
  • cite mybinder
  • make reproducibility: instr for launching instance, installing docker, and running
  • make methods: more like what it was (i.e. description of tools and services used)

response letter

If you could please provide the following, it would be much appreciated ๐Ÿ˜„ :

  • example letter of response to reviewers' feedback
  • detailed instructions of how to write one of these letters (potentially less important if an example is provided)

Thanks!!!

appendix

still a lot of "we". too many.
also, too many "you", and "our"

figure 4 numbers

i think i've commented on this before, but it doesn't seem right to me still.
we can only report numbers up to a reasonable number of significant digits,
especially in axis labels, which should never have more than 2 really.
more specifically:

  • degree \leq 70
  • cc \leq 1
  • edge weight too many sigdigs
  • scan-stat too many sigdigs, and it is actually "locality stat"

also, title and x-axis are confounded in some of these. for example, most of the panel titles are really just the x-axis, but not all . eg, "spectrum" the x-axis is vertex number. for all the others, i don't think we have a panel title, just an x-axis (i think we can drop the title for Spectrum).

the same goes for fig 5 obviously.

also, figs 4 & 5 are now the least pretty part of the paper. why not use plotly? making figures prettier is nice, but not having non-sense label numbers is important. let's definitely do that asap (since its already online).

methods changes for sic

each bullet is a piece of feedback directly from a reviewer. in comments underneath each is the response. Once I have a response for each, I will bake the question and answer into the relevant paragraphs in the methods section.

Methods

  • Data Storage

    point to emphasize: have de-identified data and store it any way that is publicly accessible that makes you happy.

    • what kind of protocols should be considered? Only HTTP?

      either

    • If we considered to virtualize the machines, the users might want to have different access points and applied mount for instance, via NFS or CIFS.

      sure

    • Moreover, could be another API used as for instance mount the Storage as a Volume?

      sure

  • Cloud environments

    point to emphasize: middleware provides flexibility for deployment across varied compute resources

    • do you consider to use API middleware to solve the problem of different providers? There are libraries that allow to run machines from multiple clouds.

      middleware can definitely solve the problem of multiple providers; in a single "cloud" (i.e. amazon or google, but not both), such middleware can be used if one chooses but is not necessary

  • Docker

    point to emphasize: the cloud and docker enables scalability in resources and consistent performance across resources. prebuilt images and packages make such deployment relatively easy (as compared to managing a local cluster/compute resource)

    • is proposed to run in AWS EC2 in the case study. But what are the differences between run in a local datacenter?

      compute is "infinitely" scalable, machines are isolated, and hardware is consistent, in the cloud --data centers are none of these.

    • Moreover, AWS has already a service dedicated to Docker containers. Could you consider to use this kind of tools in your approach?

      Yup, ECS is awesome and we will update our deployment strategy to use it

    • On the other hand, there are already tools like Totum that may facilitate the deployment of Docker containers. Could be a pre-installed machine help to deploy new containers?

      Sure, pick a machine with docker or install docker yourself, makes no difference

  • Open standards for data

    point to emphasize: data standards make tools interoperable and goodly; data should be anonymized or equivalent so that security is never an issue.

    • what are the standards and how they are used? It should be clarified in the manuscript.

      this doesn't really make sense to me, but my best guess at answering is to say that standards are documented and community accepted schemas for organizing data, and when one's data is compliant with the standard it enables generality of tools to apply out-of-the-box to a wider range of datasets.

    • Did you consider several levels of security? For instance, only allow the reviewers to access the container - online available?

      again I don't really get this sentence... General policy on security is that data should be anonymized or de-identified, and there is nothing to worry about.

  • What are the differences of this architecture comparing with only publishing a README with instructions? Easy for end-user, complex for developer/researcher.

    Creating a docker container is not significantly harder for developer/researcher, as they had to install all of the given dependencies in order for their tool to run, and write them down in a readme in order for it to be documented. docker is simply writing them down in a script which is interpreted by a virtualization engine to do the installation for you

  • Docker vs Vagrant?

    answer this one is discussion not methods
    vagrant is a layer on top of virtualization, and can sit on top of docker even. They are not really comparable in terms of execution, just in that they both document a set of installation requirements

    • Could be a virtual machine do the same? What are the differences for the proposed pipeline? This kind of technical details should be addressed in the discussion, because in the end, the manuscript is placed as a technical research paper.

      answer this one is discussion not methods
      virtual machines could do the same, but have considerably more overhead and "hard-drive" files which can bloat the system. The benefit of docker is that ultimately if you are running pipelines, you are running a set of scripts and then exiting the environment - all else considered equal, the less overhead the better, leaving more resources available to the pipeline.

in methods section

for each decision, put a paragraph per decision, including why we made the decision,
and the details of what it means to have made that decision.

eg, for S3, what commands do we use?
for BIDS, did we need to organize derivatives, subject level/group level, somehow to stay in accordance with them.

jupyter vs. Rnotebooks (which now exist), we don't need a good reason, we can say other options would also be great.

etc.

pre-sending to nicole

  • take out group level and see if it feels empty
  • fix figure 1 MECE <- more about scientist and less about six things, OR, vice versa
  • improve separability of computing and deployment
  • push deployment above computing
  • make 5, 6 for steps
  • tables from appendix in results
  • push demo data into docker

make so it can compile locally for jovo

not particularly urgent, but possibly related to arxiv not accepting it:

[Compiling /Users/jovo/Research/Projects/Greg/sic/sic.tex]

TraditionalBuilder: Engine: pdflatex. Invoking latexmk... done.

Errors:

/usr/local/texlive/2014/texmf-dist/tex/latex/fontspec/fontspec.sty:41: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [ }]

No warnings.

[Done!]

fig 1

challenges and address of challenges now in parallel structure in text.
but, figure does not allude to any of the challenges.
perhaps it could/should
(possibly post submission)

session?

is that common usage? i would have thought "scan"?
show me a precedent if you think session is more prevalant?

sic jupyter notebook

still says "group level analysis"
it is particularly confusing because it is literally impossible to do with only 1 subject.

update description of how to use

  • 'if you get "process interrupted" as an output for any step refresh the page and start again from the top; it's due to the server rebooting which it is scheduled to do every few hours.'

Lit Review

Read the following:

Summarizing these works and provide context for ours

  • write single paragraph to this end

SIC Updates

Next steps from: neurodata/m2g#99

code

  • setup aws micro instance ami which launches a jupyter server
  • lock the terminal of the jupyter server so they can only run the commands we want (@alexbaden)
  • launch SIC ECS job from CLI in jupyter server instance
  • determine where to store data derivatives/how to provide access for our users

writing

  • add screenshot of terminal interface for users
  • rework intro to focus on SIC, not just scientific containers
  • Add AWS setup instructions/description to Methods
  • Push most of current Reproduction Instruction into Methods
  • add instructions for the new interface

results section

should have the following subsections:

  1. intro paragraphs
  2. subject level analysis - briefly describe each step, i think this has to include computing subject level graph stats. reference future manuscript.
  3. population level analysis - briefly describe each step, also reference future manuscript. i think this is just plotting them all together?
  4. extension options - describe the kinds of extensions, refer to figure.
  5. demo - explain what the demo does

i've added blank sections for you :)

"Other tools"

that does not belong in methods, that goes in discussion

figs feedback

fantastic draft.

fig 1: looks great

table 1: i'd left align everything. i don't agree with all the pro's and cons, let's discuss

fig 2: can't read things, and figure shown is a bit redundant. i presume the caption is just a stand in. let's brainstorm a bit about how to make it more clear, zooming in on crucial bits, for example.

fig 3: i think it should be fig 3 and 4 (separately). i never like (a) and (b) the way latex does it. i think axes should typically be log X, rather than X, so you don't need to write 10^n all the time. nnz one isn't very clear, i'd use violin plot. though color is gratuitous here, might be worth it?

let's discuss more tomorrow

paragraph 2 of paper

there are several challenges, para 2 should enumerate them each in a sentence:

  • data not open access
  • data not organized in a fashion immediately amenable to analysis
  • code not open source
  • code not organized in a fashion immediately amenable to run
  • various software dependencies/installations (of particular versions)
  • various hardware dependencies

paragraph 3 provides conceptual explaination of our resolution of all of these.

makes sense?

intro

literally enumerate the challenges in par 2.
first,
second,
third,
etc.

you write "first", but then no "second".

in par 3, build up the solution:
data standards solve some stuff
jupyter solves some stuff
containers solve some stuff
science in the cloud solves all stuff

paragraph 3 is where you get to put our work in the context of previous art, explicitly.

paper narrative

the paper is currently written like a "how to" manual that one might get from ikea.
it is very easy to read, but not exactly what technical journals want.
after figuring out very carefully what is the main gap that we are filling,
i think the following organization would be a big improvement

  1. a "methods" section, including how to (in general)
    a. organize data into standard specification,
    b. get data online,
    c. put code in virtual machine,
    d. get VM in cloud,
    e. run on data
    f. store derivatives in standard spec.

imagine a paragraph describing each of those steps, plus a figure/panel demonstrating each "in general".

  1. the main usecase: sic:ndmg.
    which explains the choices we made for each step (BIDS, S3, docker, EC2, etc.)
    (again, a figure making each decision clear), and why.

why BIDS vs something else?
why S3 vs Google vs etc...

  1. extensions
    a. replace data with different data
    b. replace some code with different code
    c. update docker
    d. add analyses

(again, a figure showing the impact of each decision, eg, when replacing data, showing the results change, when updating code, results change, changing algorithm, results change, add analysis, results change, etc.)

  1. discussion points out that this sic framework can be applied quite widely, and how it can be imrpoved (1 click auto-launch of stuff), etc.

i'm not sure whether 1&2 are completely seperable, and i'm not sure whether we need 3 to get the paper into gigascience.

i am sure that the way it is currently written will raise eyebrows.
the thing that will be most difficult about this paper is that it is already so far from what people expect, we want the structure to be as familiar as possible, so they can really focus their cognitive energy on the contents, rather than the structure/tone.

right not, structure, tone, and contents are all unfamiliar. let's get only 1 of them unfamiliar (contents)?

20/60/20

that is the ratio of words of intro/results/discussion.

our discussion should put our work back into the greater context,
explaining caveats, related efforts, next steps, each with a paragraph.

does the neurodata ocp_template repo make any of this clear? if not, i'll fix
(i'm not sure i ever pointed you to it).

Reviewer feedback

Overview

Organized below is the feedback we received in the first submission of the SIC manuscript to Gigascience. I attempted to break the suggestions out into bulleted lists where each bullet corresponds to an action I can take/item I can address. An indented quote block is text from the reviewer explaining the bulleted items nearby.

My plan is to address each of these in the manuscript, and as I do, add a comment of my own discussing how I addressed the changes, as I will need to upload that in resubmission.

My goal is to be done addressing all contents of this issue by January 15th, 2017, one month from today.


Web Service

  • clarifying instructions #42
  • clarifying instructions #41

Figures

  • potentially mirroring challenges of text in figure #39

  • fixing axes labels #37

  • While the authors have cost estimates spread throughout the paper, I believe further discussion is necessary.

    • Thus, perhaps it is advisable that the authors to include for the pipeline in Fig 2, who much time did each step take, how much did it cost, etc (maybe a table)?

It would help the readers to understand for a typically sized study how much does it cost to upload data, store them for X days/months, download them, and for computation. Based on our experience what was costly to store was the registration non-linear warps on the cloud and we had to keep special scripts to keep clean our data store.

Minor formatting

  • First line of discussion, there is a double the.

Lit review

  • additional paper to cite #43
  • ported issue for handling lit review #45

In its current form, it suffers from a few main issues (that some could be remedied):

  • Lack of a fair literature review. The way the authors present it, it appears they are the first to have attempted this. For example, what is the relevance between what the authors present and:
    • G. B. Frisoni, A. Redolfi, D. Manset, M.-E. Rousseau, A. Toga, and A. C. Evans, "Virtual imaging laboratories for marker discovery in neurodegenerative diseases," Nature Reviews Neurology, vol. 7, no. 8, pp. 429-438, Jul. 2011.
    • I. Dinov, K. Lozev, P. Petrosyan, Z. Liu, P. Eggert, J. Pierce, A. Zamanyan, S. Chakrapani, J. Van Horn, D. S. Parker, R. Magsipoc, K. Leung, B. Gutman, R. Woods, and A. Toga, "Neuroimaging study designs, computational analyses and data provenance using the LONI pipeline," PLoS ONE, vol. 5, no. 9, pp. e13 070+, Sep. 2010.
    • neuGRID
    • outGRID
    • the effort on NeuroDebian
    • Neurodebian on AWS (EC2) https://www.nitrc.org/forum/forum.php?forum_id=3664
    • M. Minervini, M. Damiano, V. Tucci, A. Bifone, A. Gozzi, S.A. Tsaftaris, "Mouse Neuroimaging Phenotyping in the Cloud," 3rd International Conference on Image Processing Theory, Tools and Applications, Special Session on Special Session on High Performance Computing in Computer Vision Applications (HPC-CVA) , Istanbul, Turkey, Oct 15-18, 2012.
    • M. Minervini, C. Rusu, M. Damiano, V. Tucci, A. Bifone, A. Gozzi, S.A. Tsaftaris, "Large-Scale Analysis of Neuroimaging Data on Commercial Clouds with Content-Aware Resource Allocation Strategies," International Journal of High Performance Computing Applications, Jan 17, 2014.

I personally find relevance to the above methods at least in terms of motivation (albeit some may have used different methods). Obviously the last two were authored by my team a few years back, on the basis of a different Python based backbone that is now defunct (PiCloud). But the second one (last in the list), it went even beyond that: it considered optimization of resources (type of Amazon instance) with a machine learning method that predicted resource needs for non-linear registration in a pipeline of atlas based segmentation.
I am really fond of the approach of the authors as it adopts newer technologies (containers etc) that can perhaps make such systems future-proof. I should note that some of the technologies are used also by other systems on different applications. For example, there is US based initiative called CyVerse (iPlant) which the authors could explore as well.

Feasibility

  • Lack of discussion on how the current approach can be extended to use other tools such as freesurfer, ANTs etc

As I am sure you are aware, the same neuroimaging tools don't work for everyone. While I agree with the idea of having standardized pipelines, the ability to evolve said pipelines (as forks) can help the system evolve and (even) be maintained. Can you please expand on this.

Unfortunately, from at least how I understand the code, it appears that to do the same pipeline for the NKI1 dataset (40 scans) the process is linear (ie one scan after the others). This is enforced by the comment of the authors in the discussion, related to Kubernetes, "would help enable SIC to scale well when working with big-data or running many parallel jobs. " If this is true, the SIC framework loses one of the greatest aspects of cloud computing: that of scalability.

  • The authors should comment on this, particularly as this would make a proper fit for the GigaScience journal.

In my vision, the main difficulty to address in the proposed pipeline, is the inherent complexity. For instance, while the authors propose the use of Docker containers to create easily setup scripts and data loading, in a real scenario there are two main criticisms: 1) the complexity of creating the Docker container by the research groups, for instance, considering the data scientists associated to the MRI problem may not have that knowledge; 2) to run the containers, it is still needed some technology background.

  • Thus, the methodology and guidelines should be considered to approach the problem, and the strengths and weakness should be presented in discussion.

Methods

  • ported issue for handling methods #46

  • Data Storage

    • what kind of protocols should be considered? Only HTTP?
    • If we considered to virtualize the machines, the users might want to have different access points and applied mount for instance, via NFS or CIFS.
    • [ ]Moreover, could be another API used as for instance mount the Storage as a Volume?
  • Cloud environments

    • do you consider to use API middleware to solve the problem of different providers? There are libraries that allow to run machines from multiple clouds.
  • Docker

    • is proposed to run in AWS EC2 in the case study. But what are the differences between run in a local datacenter?
    • Moreover, AWS has already a service dedicated to Docker containers. Could you consider to use this kind of tools in your approach?
    • On the other hand, there are already tools like Totum that may facilitate the deployment of Docker containers. Could be a pre-installed machine help to deploy new containers?
  • Open standards for data

    • what are the standards and how they are used? It should be clarified in the manuscript.
  • Did you consider several levels of security? For instance, only allow the reviewers to access the container - online available?

  • What are the differences of this architecture comparing with only publishing a README with instructions? Easy for end-user, complex for developer/researcher.

  • Docker vs Vagrant?

    • Could be a virtual machine do the same? What are the differences for the proposed pipeline? This kind of technical details should be addressed in the discussion, because in the end, the manuscript is placed as a technical research paper.

gap

what is the main "gap" we are filling with "science in the cloud"

my feeling is that "reproducability" is but 1 of many things,
and sells your work short.

science in the cloud solves many problems:

  1. i want to interact with data from anywhere
  2. i want anybody else to be able to interact with data from anywhere
  3. i want transparency of all the steps performed in the analysis to be able to evaluate other work
  4. i want to be able to run somebody else's analysis again on their data to verify their claims
  5. i want to perform a "sensitivity analysis" on their data by varying their procedures minimally
  6. i want to run the same analysis on my data to see whether i find the same thing
  7. i want to build on their work to catapult science into the next great discovery

in a certain sense, i think all of the above steps are just steps toward, or special cases of, the 7th step: extending science.

so, my sense is that the abstract & intro should focus on extending, rather than reproducing.

i know that you've designed things in a particular way that makes certain extensions harder, and certain reproductions easier. that is all true.
nonetheless, the gap we are trying to fill is about extending, and we are taking 1 step in that direction (1 huge step).

in the abstract and intro, we want to start as big as possible, and only like in the last sentence/paragraph provide details of which aspect of this monumental challenge we are addressing.

does that make sense?

clarify cell 1

what is happening that is taking 3-4 minutes?
seems like most of it is downloading?
does it make sense to break it up into multiple cells?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.