Giter Club home page Giter Club logo

fh_wdl101_cromwell's Introduction

WDL 101: Running WDL workflows at Fred Hutch using Cromwell

Render Bookdown and Coursera

This course was created from this GitHub template.

You can see the rendered course material here: WDL 101

If you would like to contribute to this course material, take a look at the getting started GitHub wiki pages.

About this course

This course introduces using Cromwell to run WDL workflows using the Fred Hutch HPC cluster.

Encountering problems?

If you are encountering any problems with this course, please file a GitHub issue.

Creative Commons License
All materials in this course are licensed under a Creative Commons Attribution 4.0 International License unless noted otherwise.

fh_wdl101_cromwell's People

Contributors

actions-user avatar avahoffman avatar cansavvy avatar jayoung avatar jhudsl-robot avatar vortexing avatar

Watchers

 avatar

Forkers

jayoung

fh_wdl101_cromwell's Issues

add big picture section?

hey there,

I know I already suggested finishing up with a 'what next?' section, but here's an additional idea.

Now that I know how to run workflows, troubleshoot them etc, I still find myself a bit confused about how I would put this into practise. The material as currently presented gets deep into the weeds, which is useful/necessary. But I'm finding it hard to keep the big picture in mind.

Here's my current understanding of the big picture of how this will work in practise:

  1. (one-time only for each user) set up a cromwell database
  2. (once a week or so, when you're running workflows) fire up a cromwell server to interact with the database
  3. use the Shiny app and/or the R+fh.wdlR to run/troubleshoot/monitor workflows
  4. after a workflow runs successfully, obtain the "Workflow Output Data" table, use that to locate the output files, and likely copy them to a more permanent directory, rather than /fh/scratch

Maybe it's useful to put that in a 'recap' section, to help the reader tie together everything we've learned in the course.

Janet

move WDL language links?

in the 'submit-jobs-tab' section of the tutorial, I read the following info:
"There is emerging documentation about the WDL specification itself being generated by the openWDL community here. Also, there is some useful, though very detailed, information in the openWDL GitHub repo for the specification itself where you can learn more."

That's great! But it seems wierd to me to put it in that section - it's about how to write WDLs, not about how to submit jobs.

In fact, I have a very big picture comment - I had misunderstood what this course was going to cover. I thought I'd be learning how to write WDLs. It's totally fine that I'm not, but it would be great to set expectations, and to help me figure out where I can learn that.

Suggestion on how: at the very beginning, either in the front page or the intro it would be good to have a "What is WDL?" section. There, we set expectations that we WON'T be learning WDL here, but provide links to places we can learn WDL, and the links to the WDL specs that are currently in the 'submit jobs tab' section of the guide.

Add some help to understand 'validate workflow'

in this section I'd love a few words to help me understand what to look for when I click 'Validate workflow' in the Shiny app.

I think all I need to know at this point in the course is that the output looks a bit complex (especially for people who don't use R), but if I see valid=TRUE (or something like that), then I can proceed, and if I don't, then the other output will help me troubleshoot my WDL file.

Is that pretty much what you look for when you validate a workflow? what's the key marker of a good WDL here - valid, and/or validWorkflow and/or isRunnableWorkflow?

warn user that server connection gets dropped sometimes

Not sure if it's just me, but periodically the shiny server drops my connection (it just happened).

The symptom is that pretty much anything I try to to yields a red error message: "An error has occurred. Check your logs or contact the app author for clarification."

Easy to fix - I just reconnect to the server - but beginners will find this mysterious. Suggest adding a heads-up about that

New Course - Template Update Enrollment

The original template: https://github.com/jhudsl/OTTR_Template is always a work in progress.
We are working on adding more features and smoothing out bugs as we go.

If you want to receive updates from the original template to your course template, you will need to enroll this repository to the template updates by adding it to the sync.yml file.

tiny edit on runtime variables section

tiniest of tiny edits: in this section, on runtime variables:

I see this written:
Other formats that are accepted include: "memory: 2GB"

I think (at least VScode's linter thinks) that the 2Gb needs quotes around it to work. to make it clearer also drop the quotes around the entire thing and write it like this:
Other formats that are accepted include: memory: "2GB",

clarify workflow labels

I find something confusing when I get to this step:
https://hutchdatascience.org/FH_WDL101_Cromwell/using-shiny-to-manage-workflows.html#submit-a-workflow

In the 'submit workflow' section, the shiny app lets us put in a label and a secondary label.

question - what's the point of those labels? where do they appear for a job?

I tried adding a label (hello2a) and a secondary label (hello2b) to a test job (thehello_hostname test workflow), and I don't see those labels appear anywhere in the track jobs data of the shiny app, or anywhere in the job output in /fh/scratch. I also don't find anything if I use the labels to filter in any of the track jobs filtering sections.

add a note in the fh.wdlR page about how to upload subworkflows?

maybe this is a more advanced thing, but maybe it still goes in WDL101? it seems like the WDL101 guide is converging on "how to do cromwell at the Hutch" in contrast to WDL102 covering "how to code WDL more generally"?

It was at first a mystery to me how to provide sub-workflow wdls, but you helped me in Slack. Might be good to have it in the WDL101 to refer back to.

Here's how I noted that for myself:

We can make a zip bundle of the extra WDLs on the linux command line like this:

zip subwdls sub1.wdl sub2.wdl

And we can submit that zip bundle to Cromwell via fh.wdlR using the Dependencies option of cromwellSubmitBatch():

thisJob <- cromwellSubmitBatch(WDL = "my_workflow.wdl", 
                               Params = "my_inputs.json", 
                               Dependencies = "subwdls.zip")

chapter 4.1 - docker isn't the Hutch default config

in this section: https://hutchdatascience.org/FH_WDL101_Cromwell/fred-hutch-customizations.html#standard-runtime-variables

the top bit says "These runtime variables are both the defaults for our Fred Hutch configuration" and it lists below docker: "ubuntu:latest".

Sounds from that like I will be working in a container by default, but I think I can tell (from some tests I just ran) that the default is NOT to run within a docker container. My test is this: I add which codeml to diy-cromwell-server/testWorkflows/helloHostname.wdl and I run it (codeml is something I installed in my own gizmo environment). When the runtime block is empty, which codeml succeeds in finding my executable (/home/jayoung/malik_lab_shared/linux_gizmo/bin/codeml), so I think it can't be running within a container. But if I add which codeml to helloSingularityHostname.wdl, where the docker container is specifically requested using docker: "ubuntu:latest", it returns nothing, as I would expect when running in the container.

Solution could be just to rewrite the top bit. Maybe a separate spot would be a good place to summarize the default Hutch config - simply discuss here common config options the user MIGHT want to mess with. Or just show after each of these options what the default actually is.

Also in the Hutch-specific section: is dockerSL: "ubuntu:latest" actually the default, as implied by that section's header? Again, maybe split out list of possible config variables from the list of defaults.

I also totally don't understand the thing about the soft links right now. It might get clearer once I actually start running stuff within containers/scratch. Examples might be the best way to help me see how I'd use that.

wrap-up section for guide?

some sort of wrap-up section might be helpful. something like
'Chapter 6. What next?'

for me, 'what next?' is definitely that I want to learn wdl syntax. this wrap-up section could be a good spot to list the sources of help you linked to elsewhere in the guide (right now they're buried somewhere in the middle). Both Hutch-based help, as well as external links to WDL syntax/language definition.

I don't know what Amanda's owl meme referred to, but I can totally imagine - right now I have a big gap in my ability to actually use WDL. The guide was very helpful in seeing how I can submit/monitor/troubleshoot a workflow here at the Hutch, but figuring out the workflow itself is a huge black box for me.

chapter 4 confusion

hey,

chapter 4 (Fred Hutch customizations) is confusing to me right now, for some inter-related reasons:

  1. a lot of this stuff doesn't make a lot of sense to me YET because I know nothing about the WDL language. Not sure how best to handle that - maybe a chapter 3.5 that covers the very basics of WDL? or simply direct the learner to an external WDL basics tutorial? Also maybe more obviously point me to examples within the diy-cromwell-server/testWorkflows files where we actually use these parameters. I do find them when I go digging for them.

  2. there's a statement in section 4.2: "you can edit these in the config file if you’d like OR you can specify these variables in your runtime block in each task to change only the variables you want to change from the default for that particular task.". Presumably that applies to the standard variables AND the Hutch custom variables? Might make sense to split out discussion of the variables themselves from HOW we can specify those variables.

  3. what is the config file mentioned in 4.2? is it the cromUserConfig.txt file we used when we spun up our cromwell server? or something that's specific to an individual workflow? I think I can see examples in diy-cromwell-server/testWorkflows where you customize within the runtime blocks, but I'm not sure whether there's an example using a config file.

  4. Section 4.4: perhaps split that to a separate chapter? it's not about customizing the workflows. This would be a good location for the links to external WDL docs that I was already suggesting you move from their current location

  5. "We'll discuss some of the available customizations to help you run WDLs on our cluster in a simple way that still allows those workflows to be portable to other computing platforms." If I understand this right, I can have something like partition: "campus-new" in a WDL and it would still run on a non-Hutch system? Or do you mean I could run it using the fh-S3-AWS configuration just as easily as I can on the in-house cluster, but maybe that campus-new setting would make things crash on a totally external system?

In terms of my own learning, I think I kind of understand how to customize, but I think this doc could be laid out more clearly.

thanks!

j

add instruction for how to merge >1 json?

One thing it took me a minute to figure out is what to do when I have >1 json file. We need to concatenate them for validation (the workflow submission allows you to upload >1 json, but the validation doesn't)

Here's a note I wrote for myself. Could maybe add something like this somewhere in WDL101. Not sure where the best spot for it would be.

To merge two or more json files from the command line, we can use jq

jq -s '.[0] * .[1]' file1.json file2.json > combined.json
jq -s '.[0] * .[1] * .[2]' file1.json file2.json file3.json > combined.json

Maybe there's a better way - I would hope there's a more generalized solution where you don't need to mess with the bit in quotes when you change the number of files you're merging.

add some help to interpret mystery errors

a suggestion for this spot, or nearby

as a new user who makes mistakes, I keep getting a mystery error (see suggestion here). "Error: An error has occurred. Check your logs or contact the app author for clarification".

it's because I don't have an active database connection.

maybe you'll figure out a way to make the error more informative (that would be great), or maybe it's fine to just help us out in the tutorial. Something like 'don't panic if you see this, you probably just need to re-connect to the server'

Idea: troubleshooting example

hi,

The troubleshooting I did yesterday to get annovar working (together with an in-person troubleshoot with Amy last month) made me appreciate the value of understanding failures.

How about having a section of the tutorial where we supply a workflow we know isn't going to work?

We make it fail for some very simple reason, perhaps a problem we think will crop up often - maybe input files aren't found. The tutorial could walk the user through how they would troubleshoot that.

Troubleshooting example 1: today I specified the WDL file but forgot to specify the associated JSON input file before I clicked 'submit workflow'. It took me a while to figure out what I did wrong (and yes, now I know how to troubleshoot that, but I think some people would appreciate a walk-through of what to do with that 'submit jobs-troubleshoot' button).

Troubleshooting example 2: let's say the WDL and JSON are fine, but there's something wrong with the WDL code or perhaps the input files aren't available. Some people will figure out the troubleshooting from the 'track jobs' tab just fine, but I bet others would benefit from a walk-through.

what do you think?

Janet

New Course - Set Repository Settings

For more information on these settings see instructions in Starting a new OTTR course.

New Course - Templates to Edit

Follow the instructions here in ottrproject.org for details on how to start editing your OTTR course

The following files need to be edited to get this new course started!

Files that need edited upon creating a new course.

  • README.md - Fill in all the { }.
  • index.Rmd - title: should be updated.
  • 01-intro.Rmd - replace the information there with information pertinent to this new course.
  • 02-chapter_of_course.Rmd - This Rmd has examples of how to set things up, if you don't need it as a reference, it can be deleted.

Files that need to be edited upon adding each new chapter (including upon creating a new course):

  • _bookdown.yml - The list of Rmd files that need to be rendered needs to be updated. See instructions.
  • book.bib - any citations need to be added. See instructions.

Picking a style

See more about customizing style on this page in the guide.
By default this course template will use the jhudsl data science lab style. However, you can customize and switch this to another style set.

Using a style set

Read more about the style sets here.

  • On a new branch, copy the style-sets/<set-name>/index.Rmd and style-sets/<set-name>/_output.yml to the top of the repository to overwrite the default index.Rmd and _output.yml.
  • Copy over all the files in the style-sets/<set-name>/copy-to-assets to the assets folder in the top of the repository.
  • Create a pull request with these changes, and double check the rendered preview to make sure that the style is what you are looking for.

Files that need to be edited upon adding new packages that the book's code uses:

  • docker/Dockerfile needs to have the new package added so it will be installed. See instructions.
  • The code chunk in index.Rmd should be edited to add the new package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.