Giter Club home page Giter Club logo

wr's People

Contributors

ac55-sanger avatar dkj avatar kjsanger avatar sb10 avatar theobarberbany avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wr's Issues

OpenStack scheduler: spawned servers only need 2 ports

To save on port usage (which may have a small quota for users), spawned servers do not need the web interface port to be open. Solving this probably involves having a special security group for spawned servers, so we can also disallow access from world, restricting to same network.

This likely causes #38.

OpenStack scheduler: back-off on new server spawns on encountering failures

It appears that when failures to contact services via the API occur there is no back of in requests resulting in relatively rapid retries:

$ tail -f ~/.wr_production/log
2017/01/26 16:29:37 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:38 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:39 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:40 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:40 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}
2017/01/26 16:29:41 jobqueue scheduler runCmd error: Expected HTTP response code [200 202] when accessing [POST http://172.27.66.32:8774/v2.1/5377aa07ba904065b10bfceeda021fdc/os-volumes_boot], but got 403 instead
{"forbidden": {"message": "Maximum number of ports exceeded", "code": 403}}

Hosts stopped spawning successfully after 41 workers (each with --disk 500).

Manager: implement database backup

We need automatic database backup to the configured location, as well as a command for manual backup by the user to a desired location at any time.

kick should be implemented

wr status references wr kick but this has not yet been implemented. Remove the reference until implemented. Kick does a retry on buried jobs.

Test suite fails randomly in Travis

Likely because many tests rely on tight millisecond timings, test suite almost never passes while running in Travis. May need to completely rethink how tests are done.

Example failures follow (line numbers may be out of date).

  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 1389:
    Expected: true
    Actual: false
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 799:
    Expected: nil
    Actual: 'command [sleep 0.1 && true | true] finished running, but will need to be rerun due to a jobqueue server error: receive time out'
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 1134:
    Expected: nil
    Actual: 'receive time out'
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 418:
    Expected: true
    Actual: false
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 945:
    Expected: 'print'
    Actual: ''
    (Should be equal)
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 1039:
    Expected: nil
    Actual: 'command [perl -e 'for (1..3) { sleep(1) }'] finished running, but will need to be rerun due to a jobqueue server error: receive time out'
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 1279:
    Expected: nil
    Actual: 'command [echo added] finished running, but will need to be rerun due to a jobqueue server error: receive time out'
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 513:
    Expected: nil
    Actual: 'listen tcp 0.0.0.0:11301: bind: address already in use'
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 278:
    Expected: nil
    Actual: 'receive time out'
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 346:
    Expected: '9500'
    Actual: '100'
    (Should be equal)
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 245:
    Expected '' to NOT be nil (but it was)!
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/jobqueue_test.go
    Line 1361:
    Expected: nil
    Actual: 'receive time out'
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/scheduler/scheduler_test.go
    Line 119:
    Expected: false
    Actual: true
  • /home/travis/gopath/src/github.com/VertebrateResequencing/wr/jobqueue/scheduler/scheduler_test.go
    Line 113:
    Expected: true
    Actual: false

Runner should kill commands that go over reserved resources

Since not all schedulers will kill the wr runner when the command it runs goes over memory/time reservations, runner itself should track this and issue the kill signal in a friendly LSF-type way, but only if actually threatening machine memory.

OpenStack scheduler: surface errors to web interface

Errors and other useful information regarding what is happening in OpenStack should appear on the status web page. Eg. user needs to know that their command will never run because:

jobqueue scheduler runCmd error: no OS image with prefix [CentOS 6] was found

Repeated instance deployment attempts on failure to launch runners?

Only seems to happen with our Precise (and Trusty) images:

(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ wr cloud deploy -f '^m.*$' -o 'Ubuntu Precise'    --network_dns 172.18.255.1,172.18.255.3  -k 2400
info: please wait while openstack resources are created...
info: please wait while a server is spawned on openstack...
info: please wait while I start 'wr manager' on the openstack server at 172.27.82.221...
info: wr manager remotely started on wr-production-830a25c6-c9ee-480b-8735-e38bc2731c75:47353
info: wr's web interface can be reached locally at http://localhost:47354
(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ wr version
v0.5.0-5-g7eccab5
(venv)dj3@sf2-farm-srv1:~/team117/cloud/npg_cloud_realign_project$ echo w | wr add  -r 0 -m 16G --cpus 16
warning: command working directories defaulting to /tmp since the manager is running remotely
info: Added 1 new commands (0 were duplicates) to the queue using default identifier 'manually_added'

Works okay if I use Xenial instead.

(Being forced to stick with Precise for the moment due to threading problems with other parts of my stack on Xenial)

OpenStack cloud: tie new server timeout to volume creation

Currently the timeout on new servers is 20mins, to allow for what can sometimes become very slow volume creation. If we have waited longer than 2 mins, however, we should only continue waiting if we can confirm a volume is being created and is in a normal state.

OpenStack --disk needs to handle cleanup

If hosts are spawned using the --disk option more checking is required before reuse, options include:

  • Automatically delete the work space before a new job is started (use a true tmp location via mktemp -d -t <prefix>). Bad dockers writing files as root will cause this to fail.
  • Automatically teardown the host on completion.
  • Check the available space before reuse.

Sometimes commands get stuck pending

After a vrpipe add -f testjobs5 -i foofails and hitting retry, I managed to end up with 1 incomplete job that was stuck in pending state: the manager somehow lost track of the counts and didn't spawn a runner! 3 were delayed, 2 completed, 1 stuck in pending.

For now it can be recovered by manually 'vrpipe runner', but this is not acceptable!

Remote manager should back up its db automatically

A wr cloud deployd manager should try and sync up its database with the local one somehow, while the local is connected. It should also try and back up its database to the configured persistent cloud storage (eg. S3-like) in OpenStack mode.

Minimum disk option for cloud deploy?

Our OpenStack flavors mostly have only 20TB volumes - how about specifying a minimum disk size (for "boot from image, copy to volume" if the flavor's disk is less than that requested)? I am anticipating wanting between 40TB and 300TB for temporary space.... Perhaps there is a better way of achieving this?

Alter cloud deployment max servers while live

Once wr cloud deploy -m 15 has been used, it would be useful to be able to change your mind and alter the max servers without having to do a teardown.

This would need to be some sort of command proxied through to the remote manager that got it to alter its scheduler settings.

Cloud deployment should work from MacOS -> linux

Currently wr cloud deploy works by just copying over the local wr executable to the remote host. However this will fail if it's a MacOS wr executable and the remote OS is linux. Need an alternative way of doing things. Not sure what that might be, other than attempting a download or build of wr from github on the remote?

OpenStack scheduler: some jobs can get stuck pending

Testing with testjobsSmall, cmds 2, 4 and 7 get stuck pending. These need additional servers to be spawned, and they do get spawned (more or less; the volume does not seem to get made for 7). They just don't get run. This is repeatable.

wr build fails to 'go get'

Building wr from source I get this error message with a fresh binary install
of Go:

$ go version
go version go1.7.5 linux/amd64

[kdj@sf2-farm-srv2 go]$ go env
GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/nfs/users/nfs_k/kdj/dev/external/go"
GORACE=""
GOROOT="/nfs/users/nfs_k/kdj/.local/lib/go"
GOTOOLDIR="/nfs/users/nfs_k/kdj/.local/lib/go/pkg/tool/linux_amd64"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build000434645=/tmp/go-build"
CXX="g++"
CGO_ENABLED="1"

$ go get -u -d -tags netgo github.com/VertebrateResequencing/wr
package github.com/pivotal-golang/bytefmt: code in directory /nfs/users/nfs_k/kdj/dev/external/go/src/github.com/pivotal-golang/bytefmt expects import "code.cloudfoundry.org/bytefmt"

Looks like a change to canonical import path

cloudfoundry/bytefmt@4ca94c8

Capture all manager panics to log file

If the manager panics for any reason, it drops dead silently with nothing in the log. It would be nice to somehow implement a catch-all to get the panic in to the log file.

Resource Token system

Some user commands may need to use some limited external resource, and so the user may need to limit the number of commands accessing that resource at once, or specify a delay between the starting of commands that use the resource. Example resources are databases or large data downloads etc.

We can have something like a "resource token name" option on commands added to wr add, and a new wr token command to set max simultaneous and/or delay on tokens. With an ability to specify an arbitrary command to run that exits non-zero if resources are depleted and a new command shouldn't start.

Potentially you could spawn a token manager at a given port, which maintains its own separate bolt db, for use by multiple people, and have a special option that uses the separate token manager. Maybe wr manager start defaults to using an internal token system, with the option of specifying the port and ip of an external one. For sanity, the external token server shouldn't auto-start, but should need to be manually brought up before wr manager start --token_server_ip ... will work. You'd have admin bring it up somewhere central.

Status: bring up past commands and add paging

You should be able to search for commands by RepGroup or any DepGroup or by the command line itself. Need to implement paging of these results to look through them all, and be able to do something to all of them (like restart them). Paging should also apply to the normal status bars, letting you see all of the eg. "running" commands instead of just a random 1 of them.

Status webpage: refreshing while running causes problems

Refreshing the status webpage in the middle of progress bars updating on eg. wr add -f testjobsDeps4 results in random total numbers of ~250 on each step, instead of 200.

If doing a repeat set of jobs on testjobsDeps3, if I let ~1000 complete and refresh, the final complete count will be ~1000 less that reality. Actually, it is the 4000 complete that is wrong, since as repeats the final total should be 2000 complete. Numbers still aren't quite right and random numbers can be
left stuck in various states.

On a fresh db, this resulted in step1 having expected 2000 complete, and other steps having 10-49 extra complete.

Refreshing in the middle of updating can also seem to break the scheduler, resulting in real stuck pending jobs, requiring a manual wr runner to fix.

Workflows: add CWL support

Implement "proper" workflows by adding CWL support, which would parse the CWL and resolve it down in to the existing way of adding commands with dependencies.

Possibly useful:
https://github.com/robertkrimen/otto javascript in go
https://github.com/sbinet/go-python python bindings in go
http://spikeekips.tumblr.com/post/97743913387/embedding-python-in-golang-part-2 example of above being used (https://github.com/spikeekips/embedding-python-in-golang)

https://github.com/MG-RAST/AWE is written in go and has cwl support? CWL webpage points to this fork as having alpha cwl support: https://github.com/wgerlach/AWE, specifically https://github.com/wgerlach/AWE/tree/master/lib/core/cwl

You can test conformance of a cwl runner with:
https://github.com/common-workflow-language/cwltest

Example workflows:
https://github.com/common-workflow-language/workflows/
I have this cloned to ~/src/git/cwl-example-workflows

OpenStack scheduler: subsequent invocations of wr add result in pending

wr cloud deploy -f '^m.*$' -o 'Ubuntu Xenial'
echo 'sleep 120' | wr add
echo 'sleep 120 && echo 2' | wr add

Results in 1 running, 1 pending on a 2 core m1.small.
If I keep adding more and more it eventually spawns a new m1.small.

echo -e 'sleep 120\nsleep 120 && echo 2' | wr add

Works as expected, with both commands starting immediately.

Security: restrict usage of the manager to the owner

Both command line and web interfaces should be restricted to the user who started the manager (or a web configurable set of people or people in unix groups). This involves making the web use ssl cert and going https.

The latter is important because the web interface displays environment variables, which may contain the user's passwords to something.

MacOS: make compatible

wr is not currently fully compatible with MacOS due to the expectation of /proc/meminfo. Detect MacOS and use whatever memory info system it has instead.

Disk space learning and checking

Currently only the OpenStack scheduler pays attention to the disk requirement of commands by spawning a server with a large enough disk. We should also make this part of the LSF reservation, and check disk locally for local mode.

Is it possible to reliably do something useful to learn disk space usage? We can go part-way there for commands with a $wr_output:/.../file specifier (when that gets implemented)?

OpenStack scheduler: memory usage is wrong

Keiran reports that the memory usage reported for his jobs is wrong when running in OpenStack; may be related to use of Docker?

Jobs report a peak of ~700MB RAM, but on the running host it's more like 7-8GB:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1493 ubuntu 20 0 95524 2784 1636 S 0.0 0.0 0:00.08 sshd: ubuntu@notty
1500 ubuntu 20 0 22108 5152 888 S 0.0 0.0 0:43.28 /tmp/wr runner -q cmds -s 16384:60:8:
19710 ubuntu 20 0 11248 1808 1616 S 0.0 0.0 0:00.00 bash -c echo '{ "reference": { "path"
19711 ubuntu 20 0 11312 1980 1700 S 0.0 0.0 0:00.00 bash /usr/local/bin/dockstore tool la
19740 ubuntu 20 0 19.198g 1.104g 8020 S 0.0 1.8 19:44.48 java io.dockstore.client.cli.Client t
19859 ubuntu 20 0 69888 30496 4928 S 0.0 0.0 0:00.63 /usr/bin/python /usr/local/bin/cwltoo
19882 ubuntu 20 0 382048 15340 10160 S 0.0 0.0 0:01.03 docker run -i --volume=/tmp/./datasto
19954 ubuntu 20 0 18052 2828 2564 S 0.0 0.0 0:00.13 /bin/bash /opt/wtsi-cgp/bin/mapping.s
19989 ubuntu 20 0 4364 652 580 S 0.0 0.0 0:00.00 /usr/bin/time -f command:%C\nreal:%e
19990 ubuntu 20 0 158312 35384 4600 S 0.0 0.1 0:00.54 /usr/bin/perl /opt/wtsi-cgp/bin/bwa_m
20016 ubuntu 20 0 4508 844 764 S 0.0 0.0 0:00.00 sh -c /usr/bin/time /var/spool/cwl/tm
20017 ubuntu 20 0 4232 624 556 S 0.0 0.0 0:00.00 /usr/bin/time /var/spool/cwl/tmpMap_L
20018 ubuntu 20 0 9576 2536 2336 S 0.0 0.0 0:00.00 /bin/bash /var/spool/cwl/tmpMap_LP600
20019 ubuntu 20 0 14.027g 5.985g 3696 S 691.7 9.5 2110:14 /opt/wtsi-cgp/bin/bwa mem -Y -K 10000
20020 ubuntu 20 0 7652 4040 1228 S 6.3 0.0 3:58.94 /opt/wtsi-cgp/bin/reheadSQ -d /var/sp
20021 ubuntu 20 0 1254028 1.045g 8144 R 55.1 1.7 21:41.23 /opt/wtsi-cgp/bin/bamsort fixmate=1 i

OpenStack - teardown when bad password

If you mistype your password when sourcing the OpenStack API file and then call teardown you get an authentication error as you would expect. This results in having to do a --force teardown.

Could the command be a little more resilient and check for auth before going into an unrecoverable state.

Use case: this could catch user error as is if the password is wrong you may be attempting to teardown the wrong tenant.

Dependencies: choose to make them un-"live"

We need a way to say "don't make this dependency live" during wr add, and a way to turn this property on and off for selected (by rep_group or dep_group etc.) cmds at any time (equivalent of de/reactivating setups in vrpipe). Stuff would still come up as "dependent" in the web interface, but instead of starting, user would be able to dismiss the re-run, which would dismiss it in all downstream as well.

OpenStack scheduler: timed-out servers leave behind volumes

It appears that when submitting a large number of jobs requiring the --disk option that as the system scales up wr abandons the volume request and starts a new one.

This results in unattached volumes periodically in the scale up process which can eat the tenant allocation.

Seems to be reproducible with 1.5TB volumes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.