hammerlab / coclobas Goto Github PK
View Code? Open in Web Editor NEWConfigurable Cloudy Batch Scheduler
Configurable Cloudy Batch Scheduler
and clean-up once in a while...
Just created a 21 GB overnight; until ran out of disk space ☺
Right now it's there:
https://github.com/hammerlab/coclobas/blob/add-loop-backoff/src/ketrew_backend/plugin.ml#L10
so it make the jobs not “restartable" (tries to re-run with same ID).
Needs a readable representation of the command that was run, like YARN's "status" tab.
Here:
Line 65 in dee34cf
We see weird errors where kubectl create ...
works but then pod setup/scheduling fails for obscure reasons (google-fluentd
out-of-mem; filesystems not well mounted; ...).
If we can detect those (i.e. if we know that the job's actual commands didn't begin) we could try to restart them a few times.
kubectl get ...
never succeeds can we assume the job didn't start?Kubernetes keeps causing us a lot of grief and instability. We could build a backend that uses GCP instances directly (also gives us more flexibility in terms of providing just the resources a job requests, cf. #19).
Find the standard out for a workflow seems to currently require opening the root node of the workflow, clicking "Backend details/queries" and then clicking "kubectl-logs". The process is overall cumbersome and "kubectl-log" seems particularly opaque. Why can't there just be "stdout" and "stderr" links associated with a workflow?
Curiously, the logs call returns (no output, but it's green http://link.isaachodes.io/3A0V472e1M1R/Screen%20Shot%202016-10-18%20at%202.28.29%20AM.png)
Pending: http://link.isaachodes.io/370S3s0b1z00/Screen%20Shot%202016-10-18%20at%202.28.44%20AM.png
Client.Response: URI: http://127.0.0.1:8082/job/describe?id=1ddb3227-d1ad-5ffb-8b57-fbc26e66fada, Meth: GET, Resp: ((encoding (Fixed 242)) (headers ((content-length 242))) (version HTTP_1_1)
Reason: "Error-from-coclobas: Updating failed 11 times: [ Shell-command failed:\nCommand:\n
\nkubectl get pod 1ddb3227-d1ad-5ffb-8b57-fbc26e66fada -o=json\n
\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n\nError from server: pods \"1ddb3227-d1ad-5ffb-8b57-fbc26e66fada\" not found\n
\n ]"
Got this on a bunch (not all) of my jobs in a recent workflow on newest version of coclobas.
Right now we're stuck with the default "n1-highmem-8".
kubectl
on Google cloud has so many ways of failing that going to find the logs has become too cumbersome.
Context: #57
Basically, the issue is that the new default gci
image has a completely new way of managing the node by making use of a minimal distribution that lacks many of the helpful utilities (e.g. mount.nfs
). We are currently working around this problem by opting-out of using gci when creating clusters, but Google has officially deprecated the container-vm
:
From Release Notes - September 27, 2016:
... The old container-vm is now deprecated; it will be supported for a limited time...
In the long run, we probably have to learn how to deal with the new image and revert #57.
The script should start the server and then provide some form of visual feedback when Coclobas returns Ready
.
Would be nice if every job didn't have to run on a e.g. 50gb pod when most jobs probably require way less.
e.g. a job that makes an HTTP API call should run on a 1gb node, not a 50gb node
If it were possible to schedule multiple jobs at the same time on the same pod (am I using this word correctly?) then this would be less of a problem, but still not ideal.
Cf. #68
Error while calling kubectl-logs:
Client.Response: URI: http://127.0.0.1:8082/job/logs?id=3466a59c-9c4b-5a69-a8e8-7ad3763f2c55, Meth: GET, Resp: ((encoding (Fixed 467)) (headers ((content-length 467))) (version HTTP_1_1)
(status Bad_request) (flush false))
from @smondet
kube says last seen 13m and does not seem to try anything new
git cloning
a DB that had some (less than 50) 5kish node jobs submitted to it doesn't seem to work, so there's no way to interact with the DB without the client itself.
Is there a lot that Irmin buys us?
Just opening a conversation here about the issue. as we've talked a bit offline in the past but should record this dialogue somewhere.
Cf. #73.
ready to run like tests #4
This should be noted somewhere (and coco should give some feedback). Additionally, if the cluster is already up, we shouldn't require max-nodes to be passed via the CLI when configuring the cluster.
Would be nice to have an explanation of why we need Coclobas, and what it offers over just Kubernetes, for example.
Reason: "Error-from-coclobas: Starting failed: Shell-command failed:\nCommand:\n
\nkubectl create -f /tmp/coclojob84240d.json\n
\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n\nThe connection to the server 104.196.31.136 was refused - did you specify the right host or port?\n
\n"
with this:
kubernetes/kubernetes#28612 (comment)
gcloud config set container/use_client_certificate True
To create a cluster:
gcloud container clusters create <name> --zone <zone> --num-nodes=<num> --min-nodes=<num>--max-nodes=<num> --enable-autoscaling
e.g.
gcloud container clusters create pici-cioc-test-2 --zone us-east1-c --num-nodes=1 --min-nodes=1 --max-nodes=5 --enable-autoscaling
To launch a job:
kubectl create -f <path to my pod file>
e.g.
kubectl create -f pod.json
pod.json
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "<name here>",
"labels": {
"app": "<put label here>"
}
},
"spec": {
"restartPolicy" : "Never",
"containers": [
{
"name": "<container name here>",
"image": "<image name>",
"command": ["cmd", "param", "param"]
}
]
}
}
E.g. (I used this to test failing commands with restartPolicy = Never)
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "test-ubuntu-noop",
"labels": {
"app": "test-ubuntu-noop"
}
},
"spec": {
"restartPolicy" : "Never",
"containers": [
{
"name": "ubuntu-noop",
"image": "ubuntu",
"command": ["bash", "-c", "'exit 1'"]
}
]
}
}
E.g. With mounting NFS:
{
"kind": "Pod",
"apiVersion": "v1",
"metadata": {
"name": "test-ubuntu-noop",
"labels": {
"app": "test-ubuntu-noop"
}
},
"spec": {
"restartPolicy" : "Never",
"containers": [
{
"name": "ubuntu-noop",
"image": "ubuntu",
"command": ["ls", "/datasets"],
"volumeMounts" : [{
"name" : "nfs-volume",
"mountPath" : "/datasets"
}]
}
],
"volumes" : [{
"name" : "nfs-volume",
"nfs" : {
"server" : "cioc-test-server",
"path": "/nfs-pool",
"readOnly": false
}
}]
}
}
To get job status:
kubectl describe pod <pod name>
To get job status for programmatic manipulation:
kubectl get pod <pod name> -o=json
.status.phase has high level state information such as pending, running, succeeded, or failed (http://kubernetes.io/docs/user-guide/pod-states/)
.status.containerStatuses[0].state contains detailed state info about each container in the pod including exitCode, start time, stop time...
To get stdout, stderr:
kubectl logs <pod name>
To kill a job:
kubectl delete <pod name>
Kubernetes forgets about jobs very fast, we should save that info.
e.g. to better be able to report this issue hammerlab/ketrew#509 (comment)
Investigate how kubernetes queues pod requests.
By installing/configuring gcloud
a bit manually, on a GCE node I had coclobas refusing to start jobs:
Error from kubernetes:
error: You must be logged in to the server (the server has asked for the client to provide credentials)
(cf. issue kubernetes/kubernetes#30617
)
I ran that:
gcloud config set container/use_client_certificate True
and then the gcloud container clusters get-credentials $CLUSTER
needs to
happen (make sure by restarting Coclobas).
I'm not sure if my config of gcloud
was the problem, or if we should investigate this further.
Since kube control loses them so quickly, would be nice to keep them around
Now that the logs are not in Irmin any more it's easier to delete older entries.
Should also be able to delete only logs that provide little information (nobody cares about commands that succeeded a while ago).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.