Giter Club home page Giter Club logo

coclobas's People

Contributors

armish avatar ihodes avatar smondet avatar tavinathanson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

eric-czech

coclobas's Issues

Losing logs

This used to be a huge logfile; seen this a few times now, somehow the logs disappear. Kubectl logs also report nothing, later.

Ketrewplugin:

Needs a readable representation of the command that was run, like YARN's "status" tab.

Detect and retry jobs that Kubernetes failed to start

We see weird errors where kubectl create ... works but then pod setup/scheduling fails for obscure reasons (google-fluentd out-of-mem; filesystems not well mounted; ...).

If we can detect those (i.e. if we know that the job's actual commands didn't begin) we could try to restart them a few times.

  • all the jobs could start with a phone-home, or creating a witness-file?
  • if kubectl get ... never succeeds can we assume the job didn't start?

Non-kube backend for Coclobas

Kubernetes keeps causing us a lot of grief and instability. We could build a backend that uses GCP instances directly (also gives us more flexibility in terms of providing just the resources a job requests, cf. #19).

Document what coclobas does

  1. OCaml interface to Google's Kube (ideally, more, eventually)
  2. HTTP server and CLI client for scheduling kube jobs
  3. Keeps logs (that Kube loses)
  4. Throttles submission to keep Kube from crashing
  5. … what else does it "fix" about plain Kube?

Hard to find stdout

Find the standard out for a workflow seems to currently require opening the root node of the workflow, clicking "Backend details/queries" and then clicking "kubectl-logs". The process is overall cumbersome and "kubectl-log" seems particularly opaque. Why can't there just be "stdout" and "stderr" links associated with a workflow?

Errors getting pods crashing workflow

Client.Response: URI: http://127.0.0.1:8082/job/describe?id=1ddb3227-d1ad-5ffb-8b57-fbc26e66fada, Meth: GET, Resp: ((encoding (Fixed 242)) (headers ((content-length 242))) (version HTTP_1_1)

Reason: "Error-from-coclobas: Updating failed 11 times: [ Shell-command failed:\nCommand:\n\nkubectl get pod 1ddb3227-d1ad-5ffb-8b57-fbc26e66fada -o=json\n\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n\nError from server: pods \"1ddb3227-d1ad-5ffb-8b57-fbc26e66fada\" not found\n\n ]"

Got this on a bunch (not all) of my jobs in a recent workflow on newest version of coclobas.

Use `gci` (or another alternative) as the base node image

Context: #57

Basically, the issue is that the new default gci image has a completely new way of managing the node by making use of a minimal distribution that lacks many of the helpful utilities (e.g. mount.nfs). We are currently working around this problem by opting-out of using gci when creating clusters, but Google has officially deprecated the container-vm:

From Release Notes - September 27, 2016:

... The old container-vm is now deprecated; it will be supported for a limited time...

In the long run, we probably have to learn how to deal with the new image and revert #57.

Allow for multiple machine-type clusters

Would be nice if every job didn't have to run on a e.g. 50gb pod when most jobs probably require way less.

e.g. a job that makes an HTTP API call should run on a 1gb node, not a 50gb node

If it were possible to schedule multiple jobs at the same time on the same pod (am I using this word correctly?) then this would be less of a problem, but still not ideal.

Can't check kubectl-logs for some jobs

Error while calling kubectl-logs:
Client.Response: URI: http://127.0.0.1:8082/job/logs?id=3466a59c-9c4b-5a69-a8e8-7ad3763f2c55, Meth: GET, Resp: ((encoding (Fixed 467)) (headers ((content-length 467))) (version HTTP_1_1)
 (status Bad_request) (flush false))

from @smondet

kube says last seen 13m and does not seem to try anything new

Consider new DB

git cloning a DB that had some (less than 50) 5kish node jobs submitted to it doesn't seem to work, so there's no way to interact with the DB without the client itself.

Is there a lot that Irmin buys us?

Just opening a conversation here about the issue. as we've talked a bit offline in the past but should record this dialogue somewhere.

README

Would be nice to have an explanation of why we need Coclobas, and what it offers over just Kubernetes, for example.

`kubectl create -f` failing intermitantly

Reason: "Error-from-coclobas: Starting failed: Shell-command failed:\nCommand:\n\nkubectl create -f /tmp/coclojob84240d.json\n\nStatus: Exited with 1\n\nStandard-output: empty.\n\nStandard-error:\n\nThe connection to the server 104.196.31.136 was refused - did you specify the right host or port?\n\n"

kubectl commands

To create a cluster:

gcloud container clusters create <name> --zone <zone> --num-nodes=<num> --min-nodes=<num>--max-nodes=<num> --enable-autoscaling

e.g.

gcloud container clusters create pici-cioc-test-2 --zone us-east1-c --num-nodes=1 --min-nodes=1 --max-nodes=5 --enable-autoscaling
  • Set num-nodes = min-nodes when using one node-pool.
  • We have done the scaling tests to show that the node pool will expand and contract based on resource demand.

To launch a job:

kubectl create -f <path to my pod file>

e.g.

kubectl create -f pod.json 

pod.json

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "<name here>",
    "labels": {
      "app": "<put label here>"
    }
  },
  "spec": {
    "restartPolicy" : "Never",
    "containers": [
      {
        "name": "<container name here>",
        "image": "<image name>",
        "command": ["cmd", "param", "param"]
      }
    ]
  }
}

E.g.  (I used this to test failing commands with restartPolicy = Never)

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "test-ubuntu-noop",
    "labels": {
      "app": "test-ubuntu-noop"
    }
  },
  "spec": {
    "restartPolicy" : "Never",
    "containers": [
      {
        "name": "ubuntu-noop",
        "image": "ubuntu",
        "command": ["bash", "-c", "'exit 1'"]
      }
    ]
  }
}

  • for pod metadata.name, I would generate a unique id for each run. (guid probably random enough where you don't need to verify...).

E.g. With mounting NFS:

{
  "kind": "Pod",
  "apiVersion": "v1",
  "metadata": {
    "name": "test-ubuntu-noop",
    "labels": {
      "app": "test-ubuntu-noop"
    }
  },
  "spec": {
    "restartPolicy" : "Never",
    "containers": [
      {
        "name": "ubuntu-noop",
        "image": "ubuntu",
        "command": ["ls", "/datasets"],
        "volumeMounts" : [{
          "name" : "nfs-volume",
          "mountPath" : "/datasets"
        }]
      }
    ],
    "volumes" : [{
      "name" : "nfs-volume",
      "nfs" : {
        "server" : "cioc-test-server",
        "path": "/nfs-pool",
        "readOnly": false
      }
    }]
  }
}

  • Note that volumeMounts.mountPath determines where the directory will appear in the containers file system while volumes.nfs.path references the nfs server's exported path.

To get job status:

kubectl describe pod <pod name>

To get job status for programmatic manipulation:


kubectl get pod <pod name> -o=json

.status.phase has high level state information such as pending, running, succeeded, or failed (http://kubernetes.io/docs/user-guide/pod-states/)

.status.containerStatuses[0].state contains detailed state info about each container in the pod including exitCode, start time, stop time...

To get stdout, stderr:

kubectl logs <pod name>

To kill a job:

kubectl delete <pod name>
  • we may need to delete pods that complete successfully as well in order to keep the metadata the master is managing in check.

Credentials error with Kubernetes

By installing/configuring gcloud a bit manually, on a GCE node I had coclobas refusing to start jobs:

Error from kubernetes:

error: You must be logged in to the server (the server has asked for the client to provide credentials)

(cf. issue kubernetes/kubernetes#30617)

I ran that:

gcloud config set container/use_client_certificate True

and then the gcloud container clusters get-credentials $CLUSTER needs to
happen (make sure by restarting Coclobas).

I'm not sure if my config of gcloud was the problem, or if we should investigate this further.

Store kube logs

Since kube control loses them so quickly, would be nice to keep them around

Automatic/partial/selective clean-up of logs

Now that the logs are not in Irmin any more it's easier to delete older entries.

Should also be able to delete only logs that provide little information (nobody cares about commands that succeeded a while ago).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.