vertebrateresequencing / wr Goto Github PK

View Code? Open in Web Editor NEW

30.0 10.0 12.0 5.91 MB

High performance Workflow Runner

License: GNU General Public License v3.0

Go 95.54% CSS 0.11% JavaScript 0.28% HTML 3.76% Makefile 0.13% Shell 0.18%

openstack workflow golang lsf kubernetes workflow-runner

wr's People

Contributors

Stargazers

Watchers

Forkers

dkj kjsanger blai3epascal andymwallace theobarberbany mgcam ash-g777 sb10 cancerit jenniferliddle ac55-sanger jamison413

wr's Issues

OpenStack scheduler: spawned servers only need 2 ports

To save on port usage (which may have a small quota for users), spawned servers do not need the web interface port to be open. Solving this probably involves having a special security group for spawned servers, so we can also disallow access from world, restricting to same network.

This likely causes #38.

OpenStack scheduler: timed-out servers leave behind volumes

@dkj reports that if a new server is requested with a delete-on-terminate volume, but wr's own timeout on the server becoming READY is reached, wr's request to delete the server does not result in the deletion of the volume.

OpenStack scheduler: not always using all quota and cores

Testing with testjobsOpenStackDelta things ramped up normally but then hit a ceiling of 27 instances (54 cores) and actually dropped simultaneous jobs to 13.

OpenStack scheduler: back-off on new server spawns on encountering failures

It appears that when failures to contact services via the API occur there is no back of in requests resulting in relatively rapid retries:

$ tail -f ~/.wr_production/log
2017/01/26 16:29:37 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:38 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:39 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:40 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:40 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:41 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}

Hosts stopped spawning successfully after 41 workers (each with --disk 500).

Manager stop needs to wait longer in OpenStack mode

wr manager stop when running with a cloud scheduler needs to wait more than 15s and have better detection of things working or not.

Manager: implement database backup

We need automatic database backup to the configured location, as well as a command for manual backup by the user to a desired location at any time.

Cloud schedulers: env vars should be of runner, not manager

Remote mode env vars should be the env vars of the runner, not the manager, since the runner could be on a different OS image (and eg. $HOME) than the manager!

LSF scheduler: adding while running results in pend

Adding new commands while runners from a previous add are still in LSF seems to result in nothing running, those runners exiting, and no new ones spawning.

Status webpage: stdout/err should display as pre-formatted

Also, probably want to filter blocks of '\r' terminated lines (keep first/last), progress bars...

kick should be implemented

wr status references wr kick but this has not yet been implemented. Remove the reference until implemented. Kick does a retry on buried jobs.

Test suite fails randomly in Travis

Likely because many tests rely on tight millisecond timings, test suite almost never passes while running in Travis. May need to completely rethink how tests are done.

Example failures follow (line numbers may be out of date).

/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 1389:
Expected: true
Actual: false
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 799:
Expected: nil
Actual: 'command [sleep 0.1 && true | true] finished running, but will need to be rerun due to a jobqueue server error: receive time out'
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 1134:
Expected: nil
Actual: 'receive time out'
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 418:
Expected: true
Actual: false
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 945:
Expected: 'print'
Actual: ''
(Should be equal)
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 1039:
Expected: nil
Actual: 'command [perl -e 'for (1..3) { sleep(1) }'] finished running, but will need to be rerun due to a jobqueue server error: receive time out'
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 1279:
Expected: nil
Actual: 'command [echo added] finished running, but will need to be rerun due to a jobqueue server error: receive time out'
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 513:
Expected: nil
Actual: 'listen tcp 0.0.0.0:11301: bind: address already in use'
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 278:
Expected: nil
Actual: 'receive time out'
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 346:
Expected: '9500'
Actual: '100'
(Should be equal)
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 245:
Expected '' to NOT be nil (but it was)!
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
Line 1361:
Expected: nil
Actual: 'receive time out'
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/scheduler/scheduler_test.go
Line 119:
Expected: false
Actual: true
/home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/scheduler/scheduler_test.go
Line 113:
Expected: true
Actual: false

Runner should kill commands that go over reserved resources

Since not all schedulers will kill the wr runner when the command it runs goes over memory/time reservations, runner itself should track this and issue the kill signal in a friendly LSF-type way, but only if actually threatening machine memory.

Status webpage: show disk request and usage

The --disk option for commands should display, and if we can figure it out, the disk usage as well.

Tests needed for WebI and Cmd

There are no tests for the command line or web interfaces. They are required.

OpenStack scheduler: surface errors to web interface

Errors and other useful information regarding what is happening in OpenStack should appear on the status web page. Eg. user needs to know that their command will never run because:

jobqueue scheduler runCmd error: no OS image with prefix [CentOS 6] was found

Status webpage: sometimes jobs appear pending when they aren't

After vrpipe add -f testjobs3 -i foo10k, the web interface can end up with a few jobs pending yet 10000 complete, and a refresh fixes.

Long-running commands should have the option of being checkpointed

Need to investigate checkpointing; is there a way to do it automagically? Or at least allow the user to specify an option to turn checkpointing on for particular commands that support it.

http://mediawiki.internal.sanger.ac.uk/index.php/Checkpoint_and_Restart

Repeated instance deployment attempts on failure to launch runners?

Only seems to happen with our Precise (and Trusty) images:

(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ wr cloud deploy -f '^m.*$' -o 'Ubuntu Precise'    --network_dns 172.18.255.1,172.18.255.3  -k 2400
info: please wait while openstack resources are created...
info: please wait while a server is spawned on openstack...
info: please wait while I start 'wr manager' on the openstack server at 172.27.82.221...
info: wr manager remotely started on wr-production-830a25c6-c9ee-480b-8735-e38bc2731c75:47353
info: wr's web interface can be reached locally at http://localhost:47354
(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ wr version
v0.5.0-5-g7eccab5
(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ echo w | wr add  -r 0 -m 16G --cpus 16
warning: command working directories defaulting to /tmp since the manager is running remotely
info: Added 1 new commands (0 were duplicates) to the queue using default identifier 'manually_added'

Works okay if I use Xenial instead.

(Being forced to stick with Precise for the moment due to threading problems with other parts of my stack on Xenial)

OpenStack cloud: tie new server timeout to volume creation

Currently the timeout on new servers is 20mins, to allow for what can sometimes become very slow volume creation. If we have waited longer than 2 mins, however, we should only continue waiting if we can confirm a volume is being created and is in a normal state.

OpenStack --disk needs to handle cleanup

If hosts are spawned using the --disk option more checking is required before reuse, options include:

Automatically delete the work space before a new job is started (use a true tmp location via mktemp -d -t <prefix>). Bad dockers writing files as root will cause this to fail.
Automatically teardown the host on completion.
Check the available space before reuse.

Implement bash completion

Cobra is used for the command-line frontends, which apparently support bash completion of the sub-commands. The documentation is poor however and I couldn't figure it out.

https://github.com/spf13/cobra/blob/master/bash_completions.md

Sometimes commands get stuck pending

After a vrpipe add -f testjobs5 -i foofails and hitting retry, I managed to end up with 1 incomplete job that was stuck in pending state: the manager somehow lost track of the counts and didn't spawn a runner! 3 were delayed, 2 completed, 1 stuck in pending.

For now it can be recovered by manually 'vrpipe runner', but this is not acceptable!

Remote manager should back up its db automatically

A wr cloud deployd manager should try and sync up its database with the local one somehow, while the local is connected. It should also try and back up its database to the configured persistent cloud storage (eg. S3-like) in OpenStack mode.

Minimum disk option for cloud deploy?

Our OpenStack flavors mostly have only 20TB volumes - how about specifying a minimum disk size (for "boot from image, copy to volume" if the flavor's disk is less than that requested)? I am anticipating wanting between 40TB and 300TB for temporary space.... Perhaps there is a better way of achieving this?

Alter cloud deployment max servers while live

Once wr cloud deploy -m 15 has been used, it would be useful to be able to change your mind and alter the max servers without having to do a teardown.

This would need to be some sort of command proxied through to the remote manager that got it to alter its scheduler settings.

Cloud deployment should work from MacOS -> linux

Currently wr cloud deploy works by just copying over the local wr executable to the remote host. However this will fail if it's a MacOS wr executable and the remote OS is linux. Need an alternative way of doing things. Not sure what that might be, other than attempting a download or build of wr from github on the remote?

Status shows old stats in run state

Should not be showing previous run stats like Std* and peak mem and exit code in run state.

OpenStack scheduler: some jobs can get stuck pending

Testing with testjobsSmall, cmds 2, 4 and 7 get stuck pending. These need additional servers to be spawned, and they do get spawned (more or less; the volume does not seem to get made for 7). They just don't get run. This is repeatable.

Cmd env vars and expected resource requirements should be editable

The status webpage should make buried and delayed commands editable so you can override desired environment variables and resource requirements such as memory, time and cpu.

wr build fails to 'go get'

Building wr from source I get this error message with a fresh binary install
of Go:

$ go version
go version go1.7.5 linux/amd64

[kdj@sf2-farm-srv2 go]$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/nfs/users/nfs_k/kdj/dev/external/go"
GORACE=""
GOROOT="/nfs/users/nfs_k/kdj/.local/lib/go"
GOTOOLDIR="/nfs/users/nfs_k/kdj/.local/lib/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build000434645=/tmp/go-build"
CXX="g++"
CGO_ENABLED="1"

$ go get -u -d -tags netgo github.com/VertebrateResequencing/wr
package github.com/pivotal-golang/bytefmt: code in directory /nfs/users/nfs_k/kdj/dev/external/go/src/github.com/pivotal-golang/bytefmt expects import "code.cloudfoundry.org/bytefmt"

Looks like a change to canonical import path

cloudfoundry/bytefmt@4ca94c8

Capture all manager panics to log file

If the manager panics for any reason, it drops dead silently with nothing in the log. It would be nice to somehow implement a catch-all to get the panic in to the log file.

Resource Token system

Some user commands may need to use some limited external resource, and so the user may need to limit the number of commands accessing that resource at once, or specify a delay between the starting of commands that use the resource. Example resources are databases or large data downloads etc.

We can have something like a "resource token name" option on commands added to wr add, and a new wr token command to set max simultaneous and/or delay on tokens. With an ability to specify an arbitrary command to run that exits non-zero if resources are depleted and a new command shouldn't start.

Potentially you could spawn a token manager at a given port, which maintains its own separate bolt db, for use by multiple people, and have a special option that uses the separate token manager. Maybe wr manager start defaults to using an internal token system, with the option of specifying the port and ip of an external one. For sanity, the external token server shouldn't auto-start, but should need to be manually brought up before wr manager start --token_server_ip ... will work. You'd have admin bring it up somewhere central.

Status: bring up past commands and add paging

You should be able to search for commands by RepGroup or any DepGroup or by the command line itself. Need to implement paging of these results to look through them all, and be able to do something to all of them (like restart them). Paging should also apply to the normal status bars, letting you see all of the eg. "running" commands instead of just a random 1 of them.

Manager drain continues to spawn runners

wr manager drain in local mode continues to spawn new runners afterwards. It shouldn't.

Job status should indicate disk full state

Jobs using the --disk option can appear running when the device has filled.

LSF scheduler: memory learning does not always result in new job array

Once the manager learns how much memory a command uses, it is supposed to kill the initial job array in LSF and bsub a new one with lower memory reservation. But this does not always happen.

Status webpage: refreshing while running causes problems

Refreshing the status webpage in the middle of progress bars updating on eg. wr add -f testjobsDeps4 results in random total numbers of ~250 on each step, instead of 200.

If doing a repeat set of jobs on testjobsDeps3, if I let ~1000 complete and refresh, the final complete count will be ~1000 less that reality. Actually, it is the 4000 complete that is wrong, since as repeats the final total should be 2000 complete. Numbers still aren't quite right and random numbers can be
left stuck in various states.

On a fresh db, this resulted in step1 having expected 2000 complete, and other steps having 10-49 extra complete.

Refreshing in the middle of updating can also seem to break the scheduler, resulting in real stuck pending jobs, requiring a manual wr runner to fix.

Status webpage: add rerun button

There should be a rerun button on any completed command that gets displayed.

Workflows: add CWL support

Implement "proper" workflows by adding CWL support, which would parse the CWL and resolve it down in to the existing way of adding commands with dependencies.

Possibly useful:
https://github.com/robertkrimen/otto javascript in go
https://github.com/sbinet/go-python python bindings in go
http://spikeekips.tumblr.com/post/97743913387/embedding-python-in-golang-part-2 example of above being used (https://github.com/spikeekips/embedding-python-in-golang)

https://github.com/MG-RAST/AWE is written in go and has cwl support? CWL webpage points to this fork as having alpha cwl support: https://github.com/wgerlach/AWE, specifically https://github.com/wgerlach/AWE/tree/master/lib/core/cwl

You can test conformance of a cwl runner with:
https://github.com/common-workflow-language/cwltest

Example workflows:
https://github.com/common-workflow-language/workflows/
I have this cloned to ~/src/git/cwl-example-workflows

OpenStack scheduler: minimum disk requirement of images is ignored

Images can specify a minimum disk requirement. If this is higher than the disk of a flavor that gets picked, we get a generic error from gophercloud and deployment fails. It needs to be taken in to account when picking a flavor.

Deploy and Manager options should be config options

All arguments to wr cloud deploy and wr manager start should have defaults coming from the config file, so they don't have to be specified on the command line all the time.

OpenStack scheduler: subsequent invocations of wr add result in pending

wr cloud deploy -f '^m.*$' -o 'Ubuntu Xenial'
echo 'sleep 120' | wr add
echo 'sleep 120 && echo 2' | wr add

Results in 1 running, 1 pending on a 2 core m1.small.
If I keep adding more and more it eventually spawns a new m1.small.

echo -e 'sleep 120\nsleep 120 && echo 2' | wr add

Works as expected, with both commands starting immediately.

Security: restrict usage of the manager to the owner

Both command line and web interfaces should be restricted to the user who started the manager (or a web configurable set of people or people in unix groups). This involves making the web use ssl cert and going https.

The latter is important because the web interface displays environment variables, which may contain the user's passwords to something.

MacOS: make compatible

wr is not currently fully compatible with MacOS due to the expectation of /proc/meminfo. Detect MacOS and use whatever memory info system it has instead.

OpenStack - hosts not shutdown when jobs complete

Hosts aren't shutting down when jobs complete.

Using v0.5, keepalive default of 120 sec.

Disk space learning and checking

Currently only the OpenStack scheduler pays attention to the disk requirement of commands by spawning a server with a large enough disk. We should also make this part of the LSF reservation, and check disk locally for local mode.

Is it possible to reliably do something useful to learn disk space usage? We can go part-way there for commands with a $wr_output:/.../file specifier (when that gets implemented)?

OpenStack scheduler: memory usage is wrong

Keiran reports that the memory usage reported for his jobs is wrong when running in OpenStack; may be related to use of Docker?

Jobs report a peak of ~700MB RAM, but on the running host it's more like 7-8GB:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1493 ubuntu 20 0 95524 2784 1636 S 0.0 0.0 0:00.08 sshd: ubuntu@notty
1500 ubuntu 20 0 22108 5152 888 S 0.0 0.0 0:43.28 /tmp/wr runner -q cmds -s 16384:60:8:
19710 ubuntu 20 0 11248 1808 1616 S 0.0 0.0 0:00.00 bash -c echo '{ "reference": { "path"
19711 ubuntu 20 0 11312 1980 1700 S 0.0 0.0 0:00.00 bash /usr/local/bin/dockstore tool la
19740 ubuntu 20 0 19.198g 1.104g 8020 S 0.0 1.8 19:44.48 java io.dockstore.client.cli.Client t
19859 ubuntu 20 0 69888 30496 4928 S 0.0 0.0 0:00.63 /usr/bin/python /usr/local/bin/cwltoo
19882 ubuntu 20 0 382048 15340 10160 S 0.0 0.0 0:01.03 docker run -i --volume=/tmp/./datasto
19954 ubuntu 20 0 18052 2828 2564 S 0.0 0.0 0:00.13 /bin/bash /opt/wtsi-cgp/bin/mapping.s
19989 ubuntu 20 0 4364 652 580 S 0.0 0.0 0:00.00 /usr/bin/time -f command:%C\nreal:%e
19990 ubuntu 20 0 158312 35384 4600 S 0.0 0.1 0:00.54 /usr/bin/perl /opt/wtsi-cgp/bin/bwa_m
20016 ubuntu 20 0 4508 844 764 S 0.0 0.0 0:00.00 sh -c /usr/bin/time /var/spool/cwl/tm
20017 ubuntu 20 0 4232 624 556 S 0.0 0.0 0:00.00 /usr/bin/time /var/spool/cwl/tmpMap_L
20018 ubuntu 20 0 9576 2536 2336 S 0.0 0.0 0:00.00 /bin/bash /var/spool/cwl/tmpMap_LP600
20019 ubuntu 20 0 14.027g 5.985g 3696 S 691.7 9.5 2110:14 /opt/wtsi-cgp/bin/bwa mem -Y -K 10000
20020 ubuntu 20 0 7652 4040 1228 S 6.3 0.0 3:58.94 /opt/wtsi-cgp/bin/reheadSQ -d /var/sp
20021 ubuntu 20 0 1254028 1.045g 8144 R 55.1 1.7 21:41.23 /opt/wtsi-cgp/bin/bamsort fixmate=1 i

OpenStack - teardown when bad password

If you mistype your password when sourcing the OpenStack API file and then call teardown you get an authentication error as you would expect. This results in having to do a --force teardown.

Could the command be a little more resilient and check for auth before going into an unrecoverable state.

Use case: this could catch user error as is if the password is wrong you may be attempting to teardown the wrong tenant.

Dependencies: choose to make them un-"live"

We need a way to say "don't make this dependency live" during wr add, and a way to turn this property on and off for selected (by rep_group or dep_group etc.) cmds at any time (equivalent of de/reactivating setups in vrpipe). Stuff would still come up as "dependent" in the web interface, but instead of starting, user would be able to dismiss the re-run, which would dismiss it in all downstream as well.

OpenStack scheduler: timed-out servers leave behind volumes

It appears that when submitting a large number of jobs requiring the --disk option that as the system scales up wr abandons the volume request and starts a new one.

This results in unattached volumes periodically in the scale up process which can eat the tenant allocation.

Seems to be reproducible with 1.5TB volumes.