hammerlab / coclobas Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 1.0 235 KB

Configurable Cloudy Batch Scheduler

Makefile 0.19% OCaml 99.78% Standard ML 0.03%

coclobas's People

Contributors

Stargazers

Watchers

Forkers

eric-czech

coclobas's Issues

Add gsutil support to the docker image

Put logs in separate Irmin store

and clean-up once in a while...

Just created a 21 GB overnight; until ran out of disk space ☺

If a job fails very quickly, we never get any logs

Ketrew-plugin: should not store job-id

Right now it's there:
https://github.com/hammerlab/coclobas/blob/add-loop-backoff/src/ketrew_backend/plugin.ml#L10

so it make the jobs not “restartable" (tries to re-run with same ID).

Losing logs

This used to be a huge logfile; seen this a few times now, somehow the logs disappear. Kubectl logs also report nothing, later.

Ketrewplugin:

Needs a readable representation of the command that was run, like YARN's "status" tab.

Make gcloud kube cluster optional

Make tests over SSH connections more robust

Use proper hash function for log names

Here:

coclobas/src/lib/log.ml

Line 65 in dee34cf

(Hashtbl.hash json |> sprintf "%x")

Add a "script" tab for Ketrew plugin

provide `Client.*` functions on the CLI

Detect and retry jobs that Kubernetes failed to start

We see weird errors where kubectl create ... works but then pod setup/scheduling fails for obscure reasons (google-fluentd out-of-mem; filesystems not well mounted; ...).

If we can detect those (i.e. if we know that the job's actual commands didn't begin) we could try to restart them a few times.

all the jobs could start with a phone-home, or creating a witness-file?
if kubectl get ... never succeeds can we assume the job didn't start?

Kubernetes keeps causing us a lot of grief and instability. We could build a backend that uses GCP instances directly (also gives us more flexibility in terms of providing just the resources a job requests, cf. #19).

Document what coclobas does

OCaml interface to Google's Kube (ideally, more, eventually)
HTTP server and CLI client for scheduling kube jobs
Keeps logs (that Kube loses)
Throttles submission to keep Kube from crashing
… what else does it "fix" about plain Kube?

Create simple `docker run` local scheduler

Hard to find stdout

Find the standard out for a workflow seems to currently require opening the root node of the workflow, clicking "Backend details/queries" and then clicking "kubectl-logs". The process is overall cumbersome and "kubectl-log" seems particularly opaque. Why can't there just be "stdout" and "stderr" links associated with a workflow?

Should job be in Still/started_running state if it's "pending"?

Curiously, the logs call returns (no output, but it's green http://link.isaachodes.io/3A0V472e1M1R/Screen%20Shot%202016-10-18%20at%202.28.29%20AM.png)

Pending: http://link.isaachodes.io/370S3s0b1z00/Screen%20Shot%202016-10-18%20at%202.28.44%20AM.png

Errors getting pods crashing workflow

Client.Response: URI: http://127.0.0.1:8082/job/describe?id=1ddb3227-d1ad-5ffb-8b57-fbc26e66fada, Meth: GET, Resp: ((encoding (Fixed 242)) (headers ((content-length 242))) (version HTTP_1_1)

Reason: "Error-from-coclobas: Updating failed 11 times: [ Shell-command failed:\nCommand:\n\nkubectl get pod 1ddb3227-d1ad-5ffb-8b57-fbc26e66fada -o=json\n\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n\nError from server: pods \"1ddb3227-d1ad-5ffb-8b57-fbc26e66fada\" not found\n\n ]"

Got this on a bunch (not all) of my jobs in a recent workflow on newest version of coclobas.

Allow configuration of "machine type" from the CLI

Right now we're stuck with the default "n1-highmem-8".

Return kubectl errors instead of just Bad_request messages

kubectl on Google cloud has so many ways of failing that going to find the logs has become too cumbersome.

Use `gci` (or another alternative) as the base node image

Context: #57

Basically, the issue is that the new default gci image has a completely new way of managing the node by making use of a minimal distribution that lacks many of the helpful utilities (e.g. mount.nfs). We are currently working around this problem by opting-out of using gci when creating clusters, but Google has officially deprecated the container-vm:

From Release Notes - September 27, 2016:

... The old container-vm is now deprecated; it will be supported for a limited time...

In the long run, we probably have to learn how to deal with the new image and revert #57.

How to set resource requests for container

Not clear what we should set resource request to be - memory and cpu
Not clear how we have one pod <-- > one vm

docker/please.sh: wait for `Ready`

The script should start the server and then provide some form of visual feedback when Coclobas returns Ready.

Ketrew plugin: make dates readable

Not just “seconds:”

Allow for multiple machine-type clusters

Would be nice if every job didn't have to run on a e.g. 50gb pod when most jobs probably require way less.

e.g. a job that makes an HTTP API call should run on a 1gb node, not a 50gb node

If it were possible to schedule multiple jobs at the same time on the same pod (am I using this word correctly?) then this would be less of a problem, but still not ideal.

AWS backend for Coclobas

Cf. #68

Can't check kubectl-logs for some jobs

Error while calling kubectl-logs:
Client.Response: URI: http://127.0.0.1:8082/job/logs?id=3466a59c-9c4b-5a69-a8e8-7ad3763f2c55, Meth: GET, Resp: ((encoding (Fixed 467)) (headers ((content-length 467))) (version HTTP_1_1)
 (status Bad_request) (flush false))

from @smondet

kube says last seen 13m and does not seem to try anything new

Consider new DB

git cloning a DB that had some (less than 50) 5kish node jobs submitted to it doesn't seem to work, so there's no way to interact with the DB without the client itself.

Is there a lot that Irmin buys us?

Just opening a conversation here about the issue. as we've talked a bit offline in the past but should record this dialogue somewhere.

Make `cluster-size + 5` configurable

Cf. #73.

Publish to OPAM

add Dockerfile

ready to run like tests #4

Setting max-nodes on a running cluster doesn't change the max nodes

This should be noted somewhere (and coco should give some feedback). Additionally, if the cluster is already up, we shouldn't require max-nodes to be passed via the CLI when configuring the cluster.

"display" tab should always be displayed in Ketrew GUI

Rename to Numpy4DOS

README

Would be nice to have an explanation of why we need Coclobas, and what it offers over just Kubernetes, for example.

Make a nice "status" page for the ketrew backend that's easy to read

`kubectl create -f` failing intermitantly

Reason: "Error-from-coclobas: Starting failed: Shell-command failed:\nCommand:\n\nkubectl create -f /tmp/coclojob84240d.json\n\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n\nThe connection to the server 104.196.31.136 was refused - did you specify the right host or port?\n\n"

Kube-error: "google: could not find default credentials"

Fixed
this

with this:
kubernetes/kubernetes#28612 (comment)

gcloud config set container/use_client_certificate True

kubectl commands

To create a cluster:

gcloud container clusters create <name> --zone <zone> --num-nodes=<num> --min-nodes=<num>--max-nodes=<num> --enable-autoscaling

e.g.

gcloud container clusters create pici-cioc-test-2 --zone us-east1-c --num-nodes=1 --min-nodes=1 --max-nodes=5 --enable-autoscaling

Set num-nodes = min-nodes when using one node-pool.
We have done the scaling tests to show that the node pool will expand and contract based on resource demand.

To launch a job:

kubectl create -f <path to my pod file>

e.g.

kubectl create -f pod.json

pod.json

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "<name here>",
    "labels": {
      "app": "<put label here>"
    }
  },
  "spec": {
    "restartPolicy" : "Never",
    "containers": [
      {
        "name": "<container name here>",
        "image": "<image name>",
        "command": ["cmd", "param", "param"]
      }
    ]
  }
}

E.g.  (I used this to test failing commands with restartPolicy = Never)

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "test-ubuntu-noop",
    "labels": {
      "app": "test-ubuntu-noop"
    }
  },
  "spec": {
    "restartPolicy" : "Never",
    "containers": [
      {
        "name": "ubuntu-noop",
        "image": "ubuntu",
        "command": ["bash", "-c", "'exit 1'"]
      }
    ]
  }
}

for pod metadata.name, I would generate a unique id for each run. (guid probably random enough where you don't need to verify...).

E.g. With mounting NFS:

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "test-ubuntu-noop",
    "labels": {
      "app": "test-ubuntu-noop"
    }
  },
  "spec": {
    "restartPolicy" : "Never",
    "containers": [
      {
        "name": "ubuntu-noop",
        "image": "ubuntu",
        "command": ["ls", "/datasets"],
        "volumeMounts" : [{
          "name" : "nfs-volume",
          "mountPath" : "/datasets"
        }]
      }
    ],
    "volumes" : [{
      "name" : "nfs-volume",
      "nfs" : {
        "server" : "cioc-test-server",
        "path": "/nfs-pool",
        "readOnly": false
      }
    }]
  }
}

Note that volumeMounts.mountPath determines where the directory will appear in the containers file system while volumes.nfs.path references the nfs server's exported path.

To get job status:

kubectl describe pod <pod name>

describe doesn't support json output yet kubernetes/kubernetes#25353

To get job status for programmatic manipulation:


kubectl get pod <pod name> -o=json

.status.phase has high level state information such as pending, running, succeeded, or failed (http://kubernetes.io/docs/user-guide/pod-states/)

.status.containerStatuses[0].state contains detailed state info about each container in the pod including exitCode, start time, stop time...

To get stdout, stderr:

kubectl logs <pod name>

we might have an issue with only getting recent logs rather than historic logs. (See http://kubernetes.io/docs/user-guide/logging/ known issues section)

To kill a job:

kubectl delete <pod name>

we may need to delete pods that complete successfully as well in order to keep the metadata the master is managing in check.

Add `job/logs` service/query

Save latest `describe` and `logs` of Kube-jobs

Kubernetes forgets about jobs very fast, we should save that info.

Document how to retrieve Coclobas logs

e.g. to better be able to report this issue hammerlab/ketrew#509 (comment)

Extend Coclobas to non-gcloud

Kubernetes pod queueing

Investigate how kubernetes queues pod requests.

Credentials error with Kubernetes

By installing/configuring gcloud a bit manually, on a GCE node I had coclobas refusing to start jobs:

Error from kubernetes:

error: You must be logged in to the server (the server has asked for the client to provide credentials)

(cf. issue kubernetes/kubernetes#30617)

I ran that:

gcloud config set container/use_client_certificate True

and then the gcloud container clusters get-credentials $CLUSTER needs to
happen (make sure by restarting Coclobas).

I'm not sure if my config of gcloud was the problem, or if we should investigate this further.