ohsu-comp-bio / funnel Goto Github PK

View Code? Open in Web Editor NEW

117.0 21.0 30.0 46.88 MB

Funnel is a toolkit for distributed task execution via a simple, standard API.

Home Page: https://ohsu-comp-bio.github.io/funnel

License: MIT License

Makefile 1.06% Shell 2.42% JavaScript 6.23% HTML 1.75% CSS 4.20% Go 83.46% Dockerfile 0.88%

ga4gh google-cloud gridengine slurm pbs-torque docker aws-batch kubernetes

funnel's Introduction

Funnel

https://ohsu-comp-bio.github.io/funnel/

Funnel is a toolkit for distributed, batch task execution, including a server, worker, and a set of compute, storage, and database backends. Given a task description, Funnel will find a worker to execute the task, download inputs, run a series of (Docker) containers, upload outputs, capture logs, and track the whole process.

Funnel is an implementation of the GA4GH Task Execution Schemas, an effort to standardize the APIs used for task execution across many platforms.

Funnel provides an API server, multiple storage backends (local FS, S3, Google Bucket, etc.), multiple compute backends (local, HTCondor, Google Cloud, etc.), and a web dashboard.

funnel's People

Contributors

Stargazers

Watchers

funnel's Issues

Codebase cleanup

Things we'd like to clean up:

Rename "tes" -> "funnel" where appropriate #2
move python code to "python" directory (py_tes.py, tes-client, tes-runner.py, etc) #53
rename "src" to "funnel"? Could be issues with GOPATH here
move "gce" directory to "deployments/gce"
move internal protobuf files to live next to the files they generate
rename generated TES protobuf package from "ga4gh_task_exec" to "tes"
rename generated Funnel internal protobuf package from "ga4gh_task_ref" to "api"? Not sure on best name yet.
use cobra and move tes-server and tes-worker commands to src/cmd/ (or funnel/cmd/)
rename "share" to "web"

Local worker WorkDir config

While running funnel in local mode, I had config including the server's WorkDir, e.g.

Storage:
  - Local:
      AllowedDirs:
      - /home/strucka/mc3
      - /home/ubuntu/data-mount/mc3-bunny
WorkDir: /home/ubuntu/data-mount/mc3-bunny/funnel-work

During debugging, I was surprised to find that the worker's files were not being written to this WorkDir path. Instead the worker config defaults to "tes-work-dir", and this doesn't get passed on to the worker (again, local mode).

Perhaps we can simplify this config and make it more obvious?

Local workers can stay alive if parent panics

I noticed a case where I wrote some bad code in tes-worker, which caused a panic. But the pool/slots had already started and when the parent died, the worker subprocesses stayed running. There should be a way to catch a panic at the top-level, run some cleanup code, and return a nice error message with the panic maybe.

Simplify deployment

It would be great to make deployment so easy that anyone can launch a Funnel service in < 5 min.

Perhaps Funnel could be packaged as a Docker image, with instructions on how to run it on your laptop or server.

On Google Cloud, the ideal might be to run as an AppEngine app, since most users would probably remain within the free tier. By contrast, running a VM just for Funnel could incur charges of $25/mo for a basic VM, which wouldn't scale as easily as AppEngine. With AppEngine, it's possible to write code that can work outside of AppEngine too.

Support task dependency

During the earlier discussion of the TES spec, one idea was to add support for task dependency -- ie, run this task after this other task id completes. That got removed, or at least postponed. It would be a super useful feature though. All of the popular job schedulers -- SGE, Slurm, HTCondor -- support it, and for a reason.

Possible implementations:

client side, in a command-line; we're doing this in dsub right now
server side, with a new field in the TES payload

For the command-line, we're calling it --after taskId1[,taskId2,...]

funnel task commands should have reasonable timeouts

Use cgroups

Funnel is not actually setting resource limits on docker containers. So containers might take more resources than they scheduled.

Support task arrays (horizontally scaled)

Popular job schedulers like SGE, Slurm, and HTCondor support a single submission of an array of like tasks to be executed in parallel that differ only in their input parameters. Eg, an array of like tasks that each run on a different input file. Or a parameter sweep across all values of a variable from 1 to 100.

Of course, users can do this today with a loop over the input values. But then it's harder to track the status and retry failures.

In dsub, we're built some basic support for batches of like tasks, using a TSV with one row per input parameter set. In CWL, task arrays are handled with dot products and inner products.

This can probably be handled client-side. It's worth discussing though.

Note that this is different from the task list that TES/Funnel currently supports, where a sequence of tasks is executed on a single VM in series. Maybe we could call it task arrays (parallel) and task lists (sequential) or task sequences.

code: find a better mock library

mockery/testify isn't working out. The mocks use interface{} in their function signatures, which leads to very unfriendly error messages coming from the tests.

I'd like to have a mock generator that generates functions with signatures that match interfaces.

https://github.com/golang/mock

Mount read-only inputs into read-write directories

The status quo is that tools are expected to symlink files into a writable directory, to work around readonly input directories. Tools frequently forget to do this, and that will probably always be true. It also clogs up tools with symlinking code.

We could implement a simple feature that automatically creates a directory with those symlinks set up, which would save us time fixing tools, and save authors the trouble of symlinking input files.

Investigate worker sync RPC error

Couldn't save worker update.

I see this error more than I would expect, even in simple, lightweight workloads. Need to investigate what's going on.

This error is logged when a gRPC request in Worker.Sync times out (default is 1 second). Possibly the timeout should be longer. Possibly this is happening because the database is locked by another request/transaction. These are just off-the-top ideas.

Document Configuration Options

I think we should have a section the the README that lists all the configuration options.

Additionally, the example config needs to be updated.

Shared file system backend: don't localize input files

"localizing input files" means to copy/link task input files from their original location into a TES-managed job execution directory. This is similar to Cromwell behavior. Probably, this is done in order to simplify the process of mounting the input files into a Docker container – instead of mounting each file individually, you can just mount one directory containing all inputs.

Localizing files adds complexity to the code implementing TES, and possibly introduces some unwanted behavior, and we ( @adamstruck and I) were debating whether it's necessary.

One example of unwanted behavior that I have run into in Cromwell is that if the execution directory and the original file live on different physical drives (e.g. the execution directory is an NFS mount, and the original file is on a local drive), then the input file cannot be hard linked and must be copied. This added some overhead to setting up jobs under Cromwell and is one of those non-obvious gotchas that could easily be missed and cause some degraded performance.

One important edge case to consider is: what if the input file is not on the shared file system? If the file isn't localized, then the file won't be correctly shared with the worker and the job would fail. Some configuration either in TES or in the client-side library could probably do some sanity checking to prevent this, by ensuring all input files are a child of the shared file system.

Originally:
https://github.com/bmeg/task-execution-server/issues/27

Default resource requirements

A Funnel worker needs some minimum amount of resources to run.

If a task message is submitted that doesn't include resources Funnel should pick some reasonable defaults.

Handle dead workers and nodes

Incomplete jobs on dead workers should probably be rescheduled. Also, should dead/gone workers be automatically cleaned up from the database at some point (after a configurable time)?

Change exported function/method/type to be private where possible

If it isn't necessary for the function/method/type to be exported, we should refactor it to be private.

Originally:
https://github.com/bmeg/task-execution-server/issues/11

Funnel protobuf style

Captialize enums
GetWorkers -> ListWorkers
Remove GetQueueInfo
Job -> Task

Protect against tools which log data to stdout/err

While debugging mc3, I realized the pileup tool logs to standard out, and probably funnel is tailing this and writing back tons of updates to the server.

We have one layer of protection in that we only store a limited amount of the output stream, but I suspect this would still be a big problem if a few tasks are allowed to write lots of updates to the server.

Object storage download caching

Implement the worker's --swift path/to/cache/dir option, which would download Swift files to the specified directory. Each job can then hardlink or copy files in the cache directory to its own folder in the volume directory, e.g. volumes/job_123456qwerty/, which will be mounted onto Docker.

Originally:
https://github.com/bmeg/task-execution-server/issues/12

Getting hostIP in worker can crash the job when there's no internet connection

The worker/job runner gets the worker's host IP (externalIP() helper) and sends this back as part of the job logs. When there is no internet connection (for me, I was testing on my laptop without internet) the job will fail hard when getting this IP.

Expected behavior is to handle the error gracefully and not return a host IP in the job logs.

Branding TES -> Funnel

Change program names, documentation, etc where appropriate.

Dashboard hardcoded to "localhost:8000"

In the job detail page, there's a curl command at the bottom of the page which is hard-coded, e.g.

curl localhost:8000/v1/jobs/b3qir37ntf6gcu8gcd30

Low-level technical diagram

Add a visualization of how the different components come together through TES. This will be valuable for new comers to understand the big picture of TES.

Originally:
https://github.com/bmeg/task-execution-server/issues/10

Possible to infer relative path for volume

Found while submitting this task:

{
    "name": "Test Google Storage",
    "project": "Funnel",
    "description": "Simple Echo Command",
    "resources": {},
    "executors": [
      {
        "image_name": "ubuntu",
        "cmd": ["md5sum", "/tmp/test_file"],
        "stdout": "stdout",
        "stderr": "stderr"
      }
    ],
    "inputs": [
      {
        "name": "infile",
        "description": "File to be MD5ed",
        "url": "gs://smc-rna-funnel/md5-input.txt",
        "type": "FILE",
        "path": "/tmp/test_file"
      }
    ],
    "outputs": [
      {
        "name": "outfile",
        "description": "MD5 output",
        "url": "gs://smc-rna-funnel/md5-output.txt",
        "type": "FILE",
        "path": "stdout"
      }
    ]
}

And the output. The key is in the stderr logs:

{
  "id": "b3v7vtfntf6g2u90l810",
  "state": "ERROR",
  "name": "Test Google Storage",
  "project": "Funnel",
  "description": "Simple Echo Command",
  "inputs": [
    {
      "name": "infile",
      "description": "File to be MD5ed",
      "url": "gs://smc-rna-funnel/md5-input.txt",
      "path": "/tmp/test_file"
    }
  ],
  "outputs": [
    {
      "name": "outfile",
      "description": "MD5 output",
      "url": "gs://smc-rna-funnel/md5-output.txt",
      "path": "stdout"
    }
  ],
  "resources": {},
  "executors": [
    {
      "image_name": "ubuntu",
      "cmd": [
        "md5sum",
        "/tmp/test_file",
        ">",
        "/tmp/md5_out"
      ],
      "stdout": "stdout",
      "stderr": "stderr"
    }
  ],
  "logs": [
    {
      "logs": [
        {
          "start_time": "2017-04-24T22:45:13Z",
          "end_time": "2017-04-24T22:45:13Z",
          "stderr": "invalid value \"/opt/funnel/funnel-work-dir/b3v7vtfntf6g2u90l810:.:rw\" for flag -v: .:rw is not an absolute path\nSee 'docker run --help'.\n",
          "exit_code": 2,
          "host_ip": "10.138.0.66"
        }
      ]
    }
  ]
}

Support email or SMS notification of job completion

Job schedulers like SGE I think have support for emailing a user when a job succeeds or fails. For long-running jobs, this is pretty useful.

Since this is the 21st century, SMS would be cool. Email is useful too. The email/SMS could provide a link to the log file or output folder.

Maybe this is better done from a UI, since then the email/SMS could contain a link to the UI.

Support task dependency on input file existence

No job scheduler I know of has this feature. The idea is that a task can be submitted now, and it will run when its input files exist.

The use case we've encountered for it is when the input files are generated by an out-of-band process, like a user uploading files to a bucket, or a lab automation system generating outputs and writing them to a defined location. Or different users are responsible for generating the inputs and running the task, so that the task can't know the task id that has to complete first.

This is maybe an edge case, and there are workarounds. Still, it would be pretty cool, and if we implement task dependency, it's not that hard to add this as well.

Possible implementations:

client side, in the CLI; eg, --after files-exist vs --after task-id=taskId1[,taskId2,...]
server-side, taking tasks from the queue, checking their preconditions, and running only if met

Initial Contact Logging for Worker

Have the worker announce "I can talk to the server" upon start-up.

https://github.com/bmeg/task-execution-server/issues/13

Add command-line interface

One thing we've found super useful at Google/Verily is to have a qsub-style command-line interface to the Pipelines API. We've got a version called dsub now at:

https://github.com/googlegenomics/task-submission-tools

We'd like to enhance dsub so that it supports TES as a backend. Then it could support local execution, job schedulers on HPC clusters, and multiple cloud providers.

Some possibilities:

we could add Funnel as a backend to dsub.
we could fork dsub and add it as a CLI to Funnel
we could contribute dsub to the ga4gh repo once it supports pluggable backends

Build web dashboard bundle

The dashboard currently requires all of node_modules since there is not javascript bundle. When copying around Funnel server code for deployment, this overhead is too much. We can easily use ~~babel~~ browserify/webpack/other to build a small javascript bundle.

Determine the conformance tests we want

This is more of a meta/planning issue. We should come up with a list of conformance tests we'd like to write and run.

https://github.com/bmeg/task-execution-server/issues/46

funnel task create request is cacheable?

I'm not sure why, but funnel task create is behaving differently than the equivalent curl command when run against GCE.

With some debugging to dump the HTTP response body:

./bin/funnel task create --server http://35.185.217.211/ examples/hello-world.json
DEBU tes        BODY                            unknown={"jobs":[{"jobID":"b3mkkmnntf6gd19v3t10","state":"Initializing","task":{"name":"Hello world","description":"Simple Echo Command"}}]}

The body contains a cached response from a previous request to ListTasks. GCE is caching http responses. It is strange that GCE is caching POST responses, but curl is doing something that prevents this.

Inspect container log spam

While testing on GCE today I saw the worker spam the logs with this message:

Apr 17 20:51:39 funnel-worker dockerd[2359]: time="2017-04-17T20:51:39.763366662Z" level=error msg="Handler for GET /v1.24/containers/b3qijhvntf6gcu8gcd2g-0/json returned error: No such container: b3qijhvntf6gcu8gcd2g-0"

The task did run successfully, but I'm guessing that while the container was starting these messages were spamming.

Worker should get storage information from the GetServerConfig endpoint

Currently some of the storage information is hard-coded into the worker engine code. The worker should be querying the TES server endpoint GetServerConfig and using this information.

https://github.com/bmeg/task-execution-server/issues/45

funnel task get should handle HTTP 500 gracefully

./bin/funnel task -S http://35.185.237.209/ get b3r98kvntf6gc4l5648g | jq
Error: [STATUS CODE - 500]	<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
    <title>Site Not Found</title>
    <style type="text/css">/**
 * @provides phabricator-fatal-config-template-css
 */
body {
  overflow-y: scroll;
  background: #f9f9f9;
  margin: 0;
  padding: 0;
  font: 13px/1.231 'Segoe UI', 'Segoe UI Web Regular', 'Segoe UI Symbol',
    'Helvetica Neue', Helvetica, Arial, sans-serif;
  text-align: left;
  -webkit-text-size-adjust: none;
}

body.in-flight {
  background: #41506e;
  color: #e0e0e0;
}

.in-flight-error-detail {
  max-width: 760px;
  margin: 72px auto;
  background: rgba(255, 255, 255, 0.25);
  border-radius: 3px;
  padding: 8px 16px;
}

.in-flight-error-title {
   padding: 12px 8px;
   font-size: 24px;
   font-weight: 500;
   margin: 0;
}

.in-flight-error-body {
   padding: 4px 12px 12px;
}
</style>
<style type="text/css">/**
 * @provides unhandled-exception-css
 */

.unhandled-exception-detail {
  max-width: 760px;
  margin: 24px auto;
  background: #fff;
  border: 1px solid #c0392b;
  border-radius: 3px;
  padding: 0 8px;
}

.unhandled-exception-detail .unhandled-exception-title {
  color: #c0392b;
  padding: 12px 8px;
  border-bottom: 1px solid #f4dddb;
  font-size: 16px;
  font-weight: 500;
  margin: 0;
}

.unhandled-exception-detail .unhandled-exception-body {
  padding: 16px 12px;
}
</style>
  </head>
  <body class="unhandled-exception"><div class="unhandled-exception-detail"><h1 class="unhandled-exception-title">Site Not Found</h1><div class="unhandled-exception-body">This request asked for &quot;/v1/jobs/b3r98kvntf6gc4l5648g&quot; on host &quot;35.185.237.209&quot;, but no site is configured which can serve this request.</div></div></body>
</html>
Usage:
  funnel task get [taskID ...] [flags]

Global Flags:
  -S, --server string    (default "http://localhost:8000")

parse error: Invalid numeric literal at line 1, column 8

Resolve symlinked outputs

There's a common case where tools will symlink files into an output directory, which Funnel then tries to upload. The upload fails since the symlinked path is relative to the container's filesystem and Funnel runs on the host filesystem.

We can improve this by doing some simple path rewriting to map the container path to a volume mounted on the host filesystem.

Command plural forms, e.g. task/tasks

One thing about docker/kubectl/gcloud/etc. that I've always struggled with is that I can never remember if the command is singular or plural. I have a terrible memory.

Maybe we can make funnel commands support both, e.g. funnel task and funnel tasks

Refine examples

Since we are planning on bundling these into the binary we should make sure the names of the examples are clear and that each is defining a clear concept.

Godoc is broken

The go get command cannot install this package because of the following issues:

    Unrecognized import path "funnel/config" (wrapper.go:5:2)
    Unrecognized import path "funnel/proto/funnel" (sched.go:15:2)
    Unrecognized import path "funnel/logger" (sched.go:14:2)
    Unrecognized import path "funnel/proto/tes" (sched.go:16:2)
    Unrecognized import path "funnel/scheduler" (sched.go:17:2)

https://godoc.org/github.com/ohsu-comp-bio/funnel/src/funnel/scheduler/gce

Possibly this is another reason to make this project go-gettable

storage/local: strategy for user/group permissions

Task outputs often become inputs to other tasks, such as when TES/Funnel is being used by a workflow engine. When containers write output files, they are usually owned as root, which creates a tricky problem of file ownership on the host.

The best solution we've found is to use setgid on a directory to ensure the group owning the file is the group the Funnel server is in.

Funnel code should do the setgid automatically when creating a working directory, with logs explaining what is happening.

What happens when the working directory already exists? Do we modify it to have these permissions? Do we log a warning without modifying anything?

Watch for jobs that never start

Possibly closely related to #25

Task Message Validation

There are several fields in the TES Schema that are documented as being required, but not treated as such by Funnel.

Desired behavior:

message validation occurs after POST to /v1/jobs
- 2xx response returned upon successful validation otherwise returns 4xx.

Worker shutdown does not completely flush logs

The worker shutdown is not integrated with the job and step runner very well, which can result in an incomplete shutdown. For example, if there are buffered stdout/err logs from in the step runner, worker shutdown may not flush these to the server.

Remove "buildtools" config in Makefile

I added this extra "buildtools" config to the Makefile because I was getting packages showing up in the source where I didn't expect/want them, but now I understand that I just didn't have my GOPATH set up correctly. This issue is reminder for me to clean up the mess I've made and remove that "buildtools" stuff.

Notably, Go 1.8 will set a default GOPATH, so this shouldn't be as confusing for noobs like me in the future, yay. golang/go#17262

Generic cluster scheduler

One thing lost from the smarter-sched branch is the ability to start workers in a cluster manually and have them register with a scheduler backend. The "local" scheduler was the pre-existing scheduler backend, and this now starts and manages only a single worker.

Support pub/sub for task status

The model of polling for task status works fine for Pipelines API and TES in most cases. For very large scale use, pub/sub is a nicer way to handle it, imposing less load on the polling server. This is definitely lower priority than a bunch of other features. It only really applies at scale.

For a Google Cloud implementation, PubSub is a simple enough way to go. AppEngine Task Queues can handle the backlog of tasks to remain in quota, and to queue up dependent jobs.

For a platform-independent solution, something like RabbitMQ or Celery might be a good solution. On Google Cloud, it would likely be more expensive and harder to maintain than Cloud PubSub though.

Upgrade TES

New features:

Test annoyances

During testing, I frequently run into the same issues and because I'm running the tests dozens of times an hour, these get pretty annoying.

logs for all tests are all merged together with no clear separator.
the 'tes-wait' container doesn't stop and keeps the 5000 port locked, which causes failures
referring to a single test isn't simple: nosetests tests/test_client.py:TestTaskREST.test_cancel

Config validation

There are parts of the config that are not validated, and if they're not correct everything blows up. Need some simple validation for:

GCE scheduler backend

cmd/config: --config flag loading bug

../../bin/funnel server -config funnel.config.yml
INFO tes        Using config file               path=/Users/buchanae/projects/funnel/deployments/gce/onfig
ERRO tes        Failure reading config          error=open /Users/buchanae/projects/funnel/deployments/gce/onfig: no such file or directory path=/Users/buchanae/projects/funnel/deployments/gce/onfig

Explore stateless Funnel/TES

The Task Execution Service in the ga4gh is a service with a few endpoints that can provide a thin wrapper over one or more backend job schedulers.

Funnel looks to be that plus a UI plus multiple implementations of backend job schedulers and connectors for existing job schedulers.

What's the roadmap? How should this fit with ga4gh and TES?

Possibilities:

TES is a toy reference implementation that isn't meant to be used for real work; for real work, look to other implementations
OR
TES in the ga4gh repo becomes a useable implementation with plugins for multiple backends, including a local runner, major cloud providers, and common job schedulers.

For UI and Funnel:

ga4gh later adds a UI that calls TES plus related services; ie, maybe Funnel turns into this
OR
Funnel is a separate project with a UI; it relies only on TES for backends
OR
Funnel is a separate implementation of TES + a UI that's useable. The ga4gh TES remains a reference implementation that isn't meant to be useable for any real work.

ohsu-comp-bio / funnel Goto Github PK

funnel's Introduction

Funnel

funnel's People

Contributors

Stargazers

Watchers

Forkers

funnel's Issues

Recommend Projects

Recommend Topics

Recommend Org