vertebrateresequencing / wr Goto Github PK
View Code? Open in Web Editor NEWHigh performance Workflow Runner
License: GNU General Public License v3.0
High performance Workflow Runner
License: GNU General Public License v3.0
To save on port usage (which may have a small quota for users), spawned servers do not need the web interface port to be open. Solving this probably involves having a special security group for spawned servers, so we can also disallow access from world, restricting to same network.
This likely causes #38.
@dkj reports that if a new server is requested with a delete-on-terminate volume, but wr's own timeout on the server becoming READY is reached, wr's request to delete the server does not result in the deletion of the volume.
Testing with testjobsOpenStackDelta things ramped up normally but then hit a ceiling of 27 instances (54 cores) and actually dropped simultaneous jobs to 13.
It appears that when failures to contact services via the API occur there is no back of in requests resulting in relatively rapid retries:
$ tail -f ~/.wr_production/log
2017/01/26 16:29:37 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:38 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:39 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:40 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:40 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:41 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
Hosts stopped spawning successfully after 41 workers (each with --disk 500
).
wr manager stop
when running with a cloud scheduler needs to wait more than 15s and have better detection of things working or not.
We need automatic database backup to the configured location, as well as a command for manual backup by the user to a desired location at any time.
Remote mode env vars should be the env vars of the runner, not the manager, since the runner could be on a different OS image (and eg. $HOME) than the manager!
Adding new commands while runners from a previous add are still in LSF seems to result in nothing running, those runners exiting, and no new ones spawning.
Also, probably want to filter blocks of '\r' terminated lines (keep first/last), progress bars...
wr status
references wr kick
but this has not yet been implemented. Remove the reference until implemented. Kick does a retry on buried jobs.
Likely because many tests rely on tight millisecond timings, test suite almost never passes while running in Travis. May need to completely rethink how tests are done.
Example failures follow (line numbers may be out of date).
Since not all schedulers will kill the wr runner
when the command it runs goes over memory/time reservations, runner itself should track this and issue the kill signal in a friendly LSF-type way, but only if actually threatening machine memory.
The --disk option for commands should display, and if we can figure it out, the disk usage as well.
There are no tests for the command line or web interfaces. They are required.
Errors and other useful information regarding what is happening in OpenStack should appear on the status web page. Eg. user needs to know that their command will never run because:
jobqueue scheduler runCmd error: no OS image with prefix [CentOS 6] was found
After vrpipe add -f testjobs3 -i foo10k, the web interface can end up with a few jobs pending yet 10000 complete, and a refresh fixes.
Need to investigate checkpointing; is there a way to do it automagically? Or at least allow the user to specify an option to turn checkpointing on for particular commands that support it.
http://mediawiki.internal.sanger.ac.uk/index.php/Checkpoint_and_Restart
Only seems to happen with our Precise (and Trusty) images:
(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ wr cloud deploy -f '^m.*$' -o 'Ubuntu Precise' --network_dns 172.18.255.1,172.18.255.3 -k 2400
info: please wait while openstack resources are created...
info: please wait while a server is spawned on openstack...
info: please wait while I start 'wr manager' on the openstack server at 172.27.82.221...
info: wr manager remotely started on wr-production-830a25c6-c9ee-480b-8735-e38bc2731c75:47353
info: wr's web interface can be reached locally at http://localhost:47354
(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ wr version
v0.5.0-5-g7eccab5
(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ echo w | wr add -r 0 -m 16G --cpus 16
warning: command working directories defaulting to /tmp since the manager is running remotely
info: Added 1 new commands (0 were duplicates) to the queue using default identifier 'manually_added'
Works okay if I use Xenial instead.
(Being forced to stick with Precise for the moment due to threading problems with other parts of my stack on Xenial)
Currently the timeout on new servers is 20mins, to allow for what can sometimes become very slow volume creation. If we have waited longer than 2 mins, however, we should only continue waiting if we can confirm a volume is being created and is in a normal state.
If hosts are spawned using the --disk
option more checking is required before reuse, options include:
mktemp -d -t <prefix>
). Bad dockers writing files as root will cause this to fail.Cobra is used for the command-line frontends, which apparently support bash completion of the sub-commands. The documentation is poor however and I couldn't figure it out.
https://github.com/spf13/cobra/blob/master/bash_completions.md
After a vrpipe add -f testjobs5 -i foofails
and hitting retry, I managed to end up with 1 incomplete job that was stuck in pending state: the manager somehow lost track of the counts and didn't spawn a runner! 3 were delayed, 2 completed, 1 stuck in pending.
For now it can be recovered by manually 'vrpipe runner', but this is not acceptable!
A wr cloud deploy
d manager should try and sync up its database with the local one somehow, while the local is connected. It should also try and back up its database to the configured persistent cloud storage (eg. S3-like) in OpenStack mode.
Our OpenStack flavors mostly have only 20TB volumes - how about specifying a minimum disk size (for "boot from image, copy to volume" if the flavor's disk is less than that requested)? I am anticipating wanting between 40TB and 300TB for temporary space.... Perhaps there is a better way of achieving this?
Once wr cloud deploy -m 15
has been used, it would be useful to be able to change your mind and alter the max servers without having to do a teardown.
This would need to be some sort of command proxied through to the remote manager that got it to alter its scheduler settings.
Currently wr cloud deploy
works by just copying over the local wr executable to the remote host. However this will fail if it's a MacOS wr executable and the remote OS is linux. Need an alternative way of doing things. Not sure what that might be, other than attempting a download or build of wr from github on the remote?
Should not be showing previous run stats like Std* and peak mem and exit code in run state.
Testing with testjobsSmall, cmds 2, 4 and 7 get stuck pending. These need additional servers to be spawned, and they do get spawned (more or less; the volume does not seem to get made for 7). They just don't get run. This is repeatable.
The status webpage should make buried and delayed commands editable so you can override desired environment variables and resource requirements such as memory, time and cpu.
Building wr from source I get this error message with a fresh binary install
of Go:
$ go version
go version go1.7.5 linux/amd64
[kdj@sf2-farm-srv2 go]$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/nfs/users/nfs_k/kdj/dev/external/go"
GORACE=""
GOROOT="/nfs/users/nfs_k/kdj/.local/lib/go"
GOTOOLDIR="/nfs/users/nfs_k/kdj/.local/lib/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build000434645=/tmp/go-build"
CXX="g++"
CGO_ENABLED="1"
$ go get -u -d -tags netgo github.com/VertebrateResequencing/wr
package github.com/pivotal-golang/bytefmt: code in directory /nfs/users/nfs_k/kdj/dev/external/go/src/github.com/pivotal-golang/bytefmt expects import "code.cloudfoundry.org/bytefmt"
Looks like a change to canonical import path
If the manager panics for any reason, it drops dead silently with nothing in the log. It would be nice to somehow implement a catch-all to get the panic in to the log file.
Some user commands may need to use some limited external resource, and so the user may need to limit the number of commands accessing that resource at once, or specify a delay between the starting of commands that use the resource. Example resources are databases or large data downloads etc.
We can have something like a "resource token name" option on commands added to wr add
, and a new wr token
command to set max simultaneous and/or delay on tokens. With an ability to specify an arbitrary command to run that exits non-zero if resources are depleted and a new command shouldn't start.
Potentially you could spawn a token manager at a given port, which maintains its own separate bolt db, for use by multiple people, and have a special option that uses the separate token manager. Maybe wr manager start
defaults to using an internal token system, with the option of specifying the port and ip of an external one. For sanity, the external token server shouldn't auto-start, but should need to be manually brought up before wr manager start --token_server_ip ...
will work. You'd have admin bring it up somewhere central.
You should be able to search for commands by RepGroup or any DepGroup or by the command line itself. Need to implement paging of these results to look through them all, and be able to do something to all of them (like restart them). Paging should also apply to the normal status bars, letting you see all of the eg. "running" commands instead of just a random 1 of them.
wr manager drain
in local mode continues to spawn new runners afterwards. It shouldn't.
Jobs using the --disk
option can appear running when the device has filled.
Once the manager learns how much memory a command uses, it is supposed to kill the initial job array in LSF and bsub a new one with lower memory reservation. But this does not always happen.
Refreshing the status webpage in the middle of progress bars updating on eg. wr add -f testjobsDeps4
results in random total numbers of ~250 on each step, instead of 200.
If doing a repeat set of jobs on testjobsDeps3, if I let ~1000 complete and refresh, the final complete count will be ~1000 less that reality. Actually, it is the 4000 complete that is wrong, since as repeats the final total should be 2000 complete. Numbers still aren't quite right and random numbers can be
left stuck in various states.
On a fresh db, this resulted in step1 having expected 2000 complete, and other steps having 10-49 extra complete.
Refreshing in the middle of updating can also seem to break the scheduler, resulting in real stuck pending jobs, requiring a manual wr runner
to fix.
There should be a rerun button on any completed command that gets displayed.
Implement "proper" workflows by adding CWL support, which would parse the CWL and resolve it down in to the existing way of adding commands with dependencies.
Possibly useful:
https://github.com/robertkrimen/otto javascript in go
https://github.com/sbinet/go-python python bindings in go
http://spikeekips.tumblr.com/post/97743913387/embedding-python-in-golang-part-2 example of above being used (https://github.com/spikeekips/embedding-python-in-golang)
https://github.com/MG-RAST/AWE is written in go and has cwl support? CWL webpage points to this fork as having alpha cwl support: https://github.com/wgerlach/AWE, specifically https://github.com/wgerlach/AWE/tree/master/lib/core/cwl
You can test conformance of a cwl runner with:
https://github.com/common-workflow-language/cwltest
Example workflows:
https://github.com/common-workflow-language/workflows/
I have this cloned to ~/src/git/cwl-example-workflows
Images can specify a minimum disk requirement. If this is higher than the disk of a flavor that gets picked, we get a generic error from gophercloud and deployment fails. It needs to be taken in to account when picking a flavor.
All arguments to wr cloud deploy
and wr manager start
should have defaults coming from the config file, so they don't have to be specified on the command line all the time.
wr cloud deploy -f '^m.*$' -o 'Ubuntu Xenial'
echo 'sleep 120' | wr add
echo 'sleep 120 && echo 2' | wr add
Results in 1 running, 1 pending on a 2 core m1.small.
If I keep adding more and more it eventually spawns a new m1.small.
echo -e 'sleep 120\nsleep 120 && echo 2' | wr add
Works as expected, with both commands starting immediately.
Both command line and web interfaces should be restricted to the user who started the manager (or a web configurable set of people or people in unix groups). This involves making the web use ssl cert and going https.
The latter is important because the web interface displays environment variables, which may contain the user's passwords to something.
wr is not currently fully compatible with MacOS due to the expectation of /proc/meminfo. Detect MacOS and use whatever memory info system it has instead.
Hosts aren't shutting down when jobs complete.
Using v0.5, keepalive default of 120 sec.
Currently only the OpenStack scheduler pays attention to the disk requirement of commands by spawning a server with a large enough disk. We should also make this part of the LSF reservation, and check disk locally for local mode.
Is it possible to reliably do something useful to learn disk space usage? We can go part-way there for commands with a $wr_output:/.../file specifier (when that gets implemented)?
Keiran reports that the memory usage reported for his jobs is wrong when running in OpenStack; may be related to use of Docker?
Jobs report a peak of ~700MB RAM, but on the running host it's more like 7-8GB:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1493 ubuntu 20 0 95524 2784 1636 S 0.0 0.0 0:00.08 sshd: ubuntu@notty
1500 ubuntu 20 0 22108 5152 888 S 0.0 0.0 0:43.28 /tmp/wr runner -q cmds -s 16384:60:8:
19710 ubuntu 20 0 11248 1808 1616 S 0.0 0.0 0:00.00 bash -c echo '{ "reference": { "path"
19711 ubuntu 20 0 11312 1980 1700 S 0.0 0.0 0:00.00 bash /usr/local/bin/dockstore tool la
19740 ubuntu 20 0 19.198g 1.104g 8020 S 0.0 1.8 19:44.48 java io.dockstore.client.cli.Client t
19859 ubuntu 20 0 69888 30496 4928 S 0.0 0.0 0:00.63 /usr/bin/python /usr/local/bin/cwltoo
19882 ubuntu 20 0 382048 15340 10160 S 0.0 0.0 0:01.03 docker run -i --volume=/tmp/./datasto
19954 ubuntu 20 0 18052 2828 2564 S 0.0 0.0 0:00.13 /bin/bash /opt/wtsi-cgp/bin/mapping.s
19989 ubuntu 20 0 4364 652 580 S 0.0 0.0 0:00.00 /usr/bin/time -f command:%C\nreal:%e
19990 ubuntu 20 0 158312 35384 4600 S 0.0 0.1 0:00.54 /usr/bin/perl /opt/wtsi-cgp/bin/bwa_m
20016 ubuntu 20 0 4508 844 764 S 0.0 0.0 0:00.00 sh -c /usr/bin/time /var/spool/cwl/tm
20017 ubuntu 20 0 4232 624 556 S 0.0 0.0 0:00.00 /usr/bin/time /var/spool/cwl/tmpMap_L
20018 ubuntu 20 0 9576 2536 2336 S 0.0 0.0 0:00.00 /bin/bash /var/spool/cwl/tmpMap_LP600
20019 ubuntu 20 0 14.027g 5.985g 3696 S 691.7 9.5 2110:14 /opt/wtsi-cgp/bin/bwa mem -Y -K 10000
20020 ubuntu 20 0 7652 4040 1228 S 6.3 0.0 3:58.94 /opt/wtsi-cgp/bin/reheadSQ -d /var/sp
20021 ubuntu 20 0 1254028 1.045g 8144 R 55.1 1.7 21:41.23 /opt/wtsi-cgp/bin/bamsort fixmate=1 i
If you mistype your password when sourcing the OpenStack API file and then call teardown you get an authentication error as you would expect. This results in having to do a --force
teardown.
Could the command be a little more resilient and check for auth before going into an unrecoverable state.
Use case: this could catch user error as is if the password is wrong you may be attempting to teardown the wrong tenant.
We need a way to say "don't make this dependency live" during wr add
, and a way to turn this property on and off for selected (by rep_group or dep_group etc.) cmds at any time (equivalent of de/reactivating setups in vrpipe). Stuff would still come up as "dependent" in the web interface, but instead of starting, user would be able to dismiss the re-run, which would dismiss it in all downstream as well.
It appears that when submitting a large number of jobs requiring the --disk
option that as the system scales up wr abandons the volume request and starts a new one.
This results in unattached volumes periodically in the scale up process which can eat the tenant allocation.
Seems to be reproducible with 1.5TB volumes.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.