fhdsl / fh_wdl101_cromwell Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 1.0 22.27 MB

An introduction to using Cromwell and WDL at the Fred Hutch

Home Page: https://hutchdatascience.org/FH_WDL101_Cromwell/

License: Creative Commons Attribution 4.0 International

HTML 5.18% JavaScript 9.63% CSS 46.27% TeX 4.36% Dockerfile 2.53% R 32.03%

course wdl fredhutch hutch-course

fh_wdl101_cromwell's Introduction

WDL 101: Running WDL workflows at Fred Hutch using Cromwell

This course was created from this GitHub template.

You can see the rendered course material here: WDL 101

If you would like to contribute to this course material, take a look at the getting started GitHub wiki pages.

About this course

This course introduces using Cromwell to run WDL workflows using the Fred Hutch HPC cluster.

Encountering problems?

If you are encountering any problems with this course, please file a GitHub issue.

All materials in this course are licensed under a Creative Commons Attribution 4.0 International License unless noted otherwise.

fh_wdl101_cromwell's People

Contributors

Watchers

Forkers

jayoung

fh_wdl101_cromwell's Issues

link to WDL spec v 1 rather than development version?

in this section you provide a link to the spec page for the development version of WDL:
"Also, there is some useful, though very detailed, information in the openWDL GitHub repo for the specification itself where you can learn more."

but perhaps for this purpose it's better to link to the version 1 spec, not the development version (here - https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md)

chapter numbering minor fix

in this section it says "Chapter 3 showed you how to use the Hutch Shiny app" but it should say "Chapter 4"

add big picture section?

hey there,

I know I already suggested finishing up with a 'what next?' section, but here's an additional idea.

Now that I know how to run workflows, troubleshoot them etc, I still find myself a bit confused about how I would put this into practise. The material as currently presented gets deep into the weeds, which is useful/necessary. But I'm finding it hard to keep the big picture in mind.

Here's my current understanding of the big picture of how this will work in practise:

(one-time only for each user) set up a cromwell database
(once a week or so, when you're running workflows) fire up a cromwell server to interact with the database
use the Shiny app and/or the R+fh.wdlR to run/troubleshoot/monitor workflows
after a workflow runs successfully, obtain the "Workflow Output Data" table, use that to locate the output files, and likely copy them to a more permanent directory, rather than /fh/scratch

Maybe it's useful to put that in a 'recap' section, to help the reader tie together everything we've learned in the course.

Janet

Add developing and testing workflows section on this page?

LInk out to new wdl-docs but perhaps include this information in this course? Or is it better in a different course? ???

womtoolValidate function doesn't work for me

in this section of the guide, the womtoolValidate function doesn't work for me - I filed an issue over in the fh.wdlR repo.

move WDL language links?

in the 'submit-jobs-tab' section of the tutorial, I read the following info:
"There is emerging documentation about the WDL specification itself being generated by the openWDL community here. Also, there is some useful, though very detailed, information in the openWDL GitHub repo for the specification itself where you can learn more."

That's great! But it seems wierd to me to put it in that section - it's about how to write WDLs, not about how to submit jobs.

In fact, I have a very big picture comment - I had misunderstood what this course was going to cover. I thought I'd be learning how to write WDLs. It's totally fine that I'm not, but it would be great to set expectations, and to help me figure out where I can learn that.

Suggestion on how: at the very beginning, either in the front page or the intro it would be good to have a "What is WDL?" section. There, we set expectations that we WON'T be learning WDL here, but provide links to places we can learn WDL, and the links to the WDL specs that are currently in the 'submit jobs tab' section of the guide.

Reminder - Add user feedback method

To help users report issues or areas of improvement for your course, you should provide a clear method of feedback for your users to route their concerns through.

Read this chapter from an OTTR-made course about how to obtain user feedback.

Add some help to understand 'validate workflow'

in this section I'd love a few words to help me understand what to look for when I click 'Validate workflow' in the Shiny app.

I think all I need to know at this point in the course is that the output looks a bit complex (especially for people who don't use R), but if I see valid=TRUE (or something like that), then I can proceed, and if I don't, then the other output will help me troubleshoot my WDL file.

Is that pretty much what you look for when you validate a workflow? what's the key marker of a good WDL here - valid, and/or validWorkflow and/or isRunnableWorkflow?

warn user that server connection gets dropped sometimes

Not sure if it's just me, but periodically the shiny server drops my connection (it just happened).

The symptom is that pretty much anything I try to to yields a red error message: "An error has occurred. Check your logs or contact the app author for clarification."

Easy to fix - I just reconnect to the server - but beginners will find this mysterious. Suggest adding a heads-up about that

New Course - Template Update Enrollment

The original template: https://github.com/jhudsl/OTTR_Template is always a work in progress.
We are working on adding more features and smoothing out bugs as we go.

If you want to receive updates from the original template to your course template, you will need to enroll this repository to the template updates by adding it to the sync.yml file.

Follow these instructions to enroll your course repository to receive these updates.
Ensure that you have followed these instructions to add the jhudsl-robot as a collaborator to your repository.

tiny edit on runtime variables section

tiniest of tiny edits: in this section, on runtime variables:

I see this written:
Other formats that are accepted include: "memory: 2GB"

I think (at least VScode's linter thinks) that the 2Gb needs quotes around it to work. to make it clearer also drop the quotes around the entire thing and write it like this:
Other formats that are accepted include: memory: "2GB",

clarify workflow labels

I find something confusing when I get to this step:
https://hutchdatascience.org/FH_WDL101_Cromwell/using-shiny-to-manage-workflows.html#submit-a-workflow

In the 'submit workflow' section, the shiny app lets us put in a label and a secondary label.

question - what's the point of those labels? where do they appear for a job?

I tried adding a label (hello2a) and a secondary label (hello2b) to a test job (thehello_hostname test workflow), and I don't see those labels appear anywhere in the track jobs data of the shiny app, or anywhere in the job output in /fh/scratch. I also don't find anything if I use the labels to filter in any of the track jobs filtering sections.

add a note in the fh.wdlR page about how to upload subworkflows?

maybe this is a more advanced thing, but maybe it still goes in WDL101? it seems like the WDL101 guide is converging on "how to do cromwell at the Hutch" in contrast to WDL102 covering "how to code WDL more generally"?

It was at first a mystery to me how to provide sub-workflow wdls, but you helped me in Slack. Might be good to have it in the WDL101 to refer back to.

Here's how I noted that for myself:

We can make a zip bundle of the extra WDLs on the linux command line like this:

zip subwdls sub1.wdl sub2.wdl

And we can submit that zip bundle to Cromwell via fh.wdlR using the Dependencies option of cromwellSubmitBatch():

thisJob <- cromwellSubmitBatch(WDL = "my_workflow.wdl", 
                               Params = "my_inputs.json", 
                               Dependencies = "subwdls.zip")

chapter 4.1 - docker isn't the Hutch default config

in this section: https://hutchdatascience.org/FH_WDL101_Cromwell/fred-hutch-customizations.html#standard-runtime-variables

the top bit says "These runtime variables are both the defaults for our Fred Hutch configuration" and it lists below docker: "ubuntu:latest".

Sounds from that like I will be working in a container by default, but I think I can tell (from some tests I just ran) that the default is NOT to run within a docker container. My test is this: I add which codeml to diy-cromwell-server/testWorkflows/helloHostname.wdl and I run it (codeml is something I installed in my own gizmo environment). When the runtime block is empty, which codeml succeeds in finding my executable (/home/jayoung/malik_lab_shared/linux_gizmo/bin/codeml), so I think it can't be running within a container. But if I add which codeml to helloSingularityHostname.wdl, where the docker container is specifically requested using docker: "ubuntu:latest", it returns nothing, as I would expect when running in the container.

Solution could be just to rewrite the top bit. Maybe a separate spot would be a good place to summarize the default Hutch config - simply discuss here common config options the user MIGHT want to mess with. Or just show after each of these options what the default actually is.

Also in the Hutch-specific section: is dockerSL: "ubuntu:latest" actually the default, as implied by that section's header? Again, maybe split out list of possible config variables from the list of defaults.

I also totally don't understand the thing about the soft links right now. It might get clearer once I actually start running stuff within containers/scratch. Examples might be the best way to help me see how I'd use that.

wrap-up section for guide?

some sort of wrap-up section might be helpful. something like
'Chapter 6. What next?'

for me, 'what next?' is definitely that I want to learn wdl syntax. this wrap-up section could be a good spot to list the sources of help you linked to elsewhere in the guide (right now they're buried somewhere in the middle). Both Hutch-based help, as well as external links to WDL syntax/language definition.

I don't know what Amanda's owl meme referred to, but I can totally imagine - right now I have a big gap in my ability to actually use WDL. The guide was very helpful in seeing how I can submit/monitor/troubleshoot a workflow here at the Hutch, but figuring out the workflow itself is a huge black box for me.

chapter 4 confusion

hey,

chapter 4 (Fred Hutch customizations) is confusing to me right now, for some inter-related reasons:

a lot of this stuff doesn't make a lot of sense to me YET because I know nothing about the WDL language. Not sure how best to handle that - maybe a chapter 3.5 that covers the very basics of WDL? or simply direct the learner to an external WDL basics tutorial? Also maybe more obviously point me to examples within the diy-cromwell-server/testWorkflows files where we actually use these parameters. I do find them when I go digging for them.
there's a statement in section 4.2: "you can edit these in the config file if you’d like OR you can specify these variables in your runtime block in each task to change only the variables you want to change from the default for that particular task.". Presumably that applies to the standard variables AND the Hutch custom variables? Might make sense to split out discussion of the variables themselves from HOW we can specify those variables.
what is the config file mentioned in 4.2? is it the cromUserConfig.txt file we used when we spun up our cromwell server? or something that's specific to an individual workflow? I think I can see examples in diy-cromwell-server/testWorkflows where you customize within the runtime blocks, but I'm not sure whether there's an example using a config file.
Section 4.4: perhaps split that to a separate chapter? it's not about customizing the workflows. This would be a good location for the links to external WDL docs that I was already suggesting you move from their current location
"We'll discuss some of the available customizations to help you run WDLs on our cluster in a simple way that still allows those workflows to be portable to other computing platforms." If I understand this right, I can have something like partition: "campus-new" in a WDL and it would still run on a non-Hutch system? Or do you mean I could run it using the fh-S3-AWS configuration just as easily as I can on the in-house cluster, but maybe that campus-new setting would make things crash on a totally external system?

In terms of my own learning, I think I kind of understand how to customize, but I think this doc could be laid out more clearly.

thanks!

add instruction for how to merge >1 json?

One thing it took me a minute to figure out is what to do when I have >1 json file. We need to concatenate them for validation (the workflow submission allows you to upload >1 json, but the validation doesn't)

Here's a note I wrote for myself. Could maybe add something like this somewhere in WDL101. Not sure where the best spot for it would be.

To merge two or more json files from the command line, we can use jq

jq -s '.[0] * .[1]' file1.json file2.json > combined.json
jq -s '.[0] * .[1] * .[2]' file1.json file2.json file3.json > combined.json

Maybe there's a better way - I would hope there's a more generalized solution where you don't need to mess with the bit in quotes when you change the number of files you're merging.

broken link

in this section there's a broken link.

It's this one:
"but check SciWiki for updated information"
I get "403 Forbidden" at that link

maybe you want to link to this page? https://sciwiki.fredhutch.org/compdemos/gizmo_partition_index/

add some help to interpret mystery errors

a suggestion for this spot, or nearby

as a new user who makes mistakes, I keep getting a mystery error (see suggestion here). "Error: An error has occurred. Check your logs or contact the app author for clarification".

it's because I don't have an active database connection.

maybe you'll figure out a way to make the error more informative (that would be great), or maybe it's fine to just help us out in the tutorial. Something like 'don't panic if you see this, you probably just need to re-connect to the server'

formatting

in the very last paragraph here, the bullet point formatting didn't come out right in the web-rendered version:
https://hutchdatascience.org/FH_WDL101_Cromwell/introduction.html#using-cromwell

Idea: troubleshooting example

hi,

The troubleshooting I did yesterday to get annovar working (together with an in-person troubleshoot with Amy last month) made me appreciate the value of understanding failures.

How about having a section of the tutorial where we supply a workflow we know isn't going to work?

We make it fail for some very simple reason, perhaps a problem we think will crop up often - maybe input files aren't found. The tutorial could walk the user through how they would troubleshoot that.

Troubleshooting example 1: today I specified the WDL file but forgot to specify the associated JSON input file before I clicked 'submit workflow'. It took me a while to figure out what I did wrong (and yes, now I know how to troubleshoot that, but I think some people would appreciate a walk-through of what to do with that 'submit jobs-troubleshoot' button).

Troubleshooting example 2: let's say the WDL and JSON are fine, but there's something wrong with the WDL code or perhaps the input files aren't available. Some people will figure out the troubleshooting from the 'track jobs' tab just fine, but I bet others would benefit from a walk-through.

what do you think?

Janet

Files that need edited upon creating a new course.

README.md - Fill in all the { }.
index.Rmd - title: should be updated.
01-intro.Rmd - replace the information there with information pertinent to this new course.
02-chapter_of_course.Rmd - This Rmd has examples of how to set things up, if you don't need it as a reference, it can be deleted.

Files that need to be edited upon adding each new chapter (including upon creating a new course):

_bookdown.yml - The list of Rmd files that need to be rendered needs to be updated. See instructions.
book.bib - any citations need to be added. See instructions.

Picking a style

See more about customizing style on this page in the guide.
By default this course template will use the jhudsl data science lab style. However, you can customize and switch this to another style set.

Using a style set

Files that need to be edited upon adding new packages that the book's code uses:

docker/Dockerfile needs to have the new package added so it will be installed. See instructions.
The code chunk in index.Rmd should be edited to add the new package.