Giter Club home page Giter Club logo

oaat-operator's Introduction

master build CodeQL codecov

oaat-operator

oaat-operator is a Kubernetes operator intended to manage a group of tasks of which only one should be running at any time. I make it available in case someone else is interested in using such an operator.

Features

  • Define a list of items to run in a group – only one item in that group will run at a time.
  • Each item is run in a Kubernetes pod.
  • Pass the item name in an environment variable (OAAT_ITEM) or as a string substitution in the command, args or env of the pod's spec.
  • Define the frequency an individual item will be run.
  • Detect whether an item has succeeded or failed, based on the return value from the pod.
  • Continue attempting to run an item until it succeeds.
  • Track last failure and last success times for each item.
  • Track the number of failures since the last success.
  • Specify a cool-off period for an item which has failed (the item will not be selected to run again until the cool-off period has expired).

Approach

oaat-operator is based on the kopf framework and uses two Kubernetes Custom Resource Definitions:

  • OaatType – defines a type of item to be run and the definition of what 'run' means. Currently oaat-operator only supports a Pod as mechanism to run an item.
  • OaatGroup – defines a group of items which are to be run 'one at a time', including the frequency that each item should be run, cool-off timers for item failures, etc.

The operator keeps track of item failures and endeavours to retry failures without blocking un-run items. The intention is to run each item approximately in line with the frequency setting for the OaatGroup.

The operator sets up a timer on the OaatGroup and each time the timer triggers, it will then:

  • if an item is currently running, quit the cycle to wait for the timer to expire again.
  • if an item is not running, determine whether an item is ready to run and, if so, run it.

The operator selects an item to run using the following algorithm:

  • phase one: choose valid item candidates:

    • start with a list of all possible items to run
    • remove from the list items which have been successful within the period in the frequency setting in the OaatGroup
    • remove from the list items which have failed within the period in the failureCoolOff setting in the OaatGroup
  • phase two: choose the item to run from the valid item candidates:

    • if there is just one item, choose it
    • find the item with the oldest success (or has never succeeded)
    • if there is just one item that is "oldest", choose it
    • of the items with the oldest success, find the item with the oldest failure
    • if there is just one item that has both the oldest success and the oldest failure, choose it
    • choose at random (this is likely to occur if no items have been run – i.e. first iteration)

Quick Start

Create the CRDs

kubectl apply -f manifests/01-oaat-operator-crd.yaml

This creates two CRDs: OaatType and OaatGroup.

Create an OaatType

apiVersion: kawaja.net/v1
kind: OaatType
metadata:
  name: oaattest
spec:
  type: pod
  podspec:
    container:
      name: test
      image: busybox
      command: ["sh", "-x", "-c"]
      args:
        - |
          echo "OAAT_ITEM={{oaat_item}}"
          sleep $(shuf -i 10-180 -n 1)
          exit $(shuf -i 0-1 -n 1)

This one sleeps for a random time (between 10 seconds and 3 minutes) and randomly succeeds (50%) or fails (50%).

Create an OaatGroup

apiVersion: kawaja.net/v1
kind: OaatGroup
metadata:
  name: testsimple-oaat
spec:
  frequency: 5m
  oaatType: oaattest
  oaatItems:
    - item1
    - item2

This creates two items, which will be run every 5 minutes.

Start the operator

kubectl apply -f manifests/02-oaat-operator-deployment.yaml

Watch item progress

kubectl get oaatgroup -w

Testing

To run the test suite under pytest, a kubernetes environment such as minkube or k3s is required on the developer's workstation. k3s is currently used for the CI pipeline in GitHub.

See the minikube documentation for details on how to install and set up minikube.

See the k3s documentation for details on how to install and set up k3s.

Additionally, the CRDs must be installed:

kubectl apply -f manifests/01-oaat-operator-crd.yaml

Without this kubernetes testing environment and the CRDs, many of the tests will not succeed.

Limitations

  • oaat-operator is not intended for precise timing of item start times – the check for whether items are ready to run occurs every 60 seconds.
  • Each item in the group will use the same pod specification (other than the string substitutions in the command, args or env). If you want to run different commands, this must be done within the specified pod.
  • The list of items is currently fixed, specified in the OaatGroup.
  • Item pods can only have a single container.
  • Only tested on Kubernetes 1.23 and 1.24

Roadmap

  • Documentation
  • Blackout windows (#2) time windows during which no items will be started. Potentially also provide an option where running items could be stopped during the blackout window.
  • EachOnce (#3) – ensure each item runs once successfully and then stop.
  • Exponential backoff (#4) – rather than just providing a fixed cool-off period exponentially increase the wait.
  • Dynamic item list – use other mechanisms to create the list of items:
    • output of a container (#5)
    • contents of a configmap (#6)
    • result of an API call? (#7)
  • Schema validation (#8) – currently uses some spot checks of certain critical fields; instead, use json-schema to validate the CRD objects against a schema.

History

This started as a "spare time" COVID-19 lockdown project to improve my knowledge of Python and the reliability and completeness of the backups of my home lab.

Previously, I used a big script which sequentially backed up various directories to cloud providers, however if the backup of a directory failed, it would not retry and the issue would often be lost in the logs. When a backup failure was detected, I could run the script again, but it would re-run backups that had been successful already.

I decided I wanted to move to more of a continuous backup approach, with smaller backup jobs which could be easily retried without re-running successful backups.

The features evolved from there.

oaat-operator's People

Contributors

dependabot[bot] avatar kawaja avatar snyk-bot avatar

Stargazers

 avatar

Watchers

 avatar

oaat-operator's Issues

Unit test suite is incomplete

Unit tests for OaatGroup

  • find_job_to_run()
    • no items
    • one item
    • no items within 'frequency'
    • failed items within 'cool off'
    • single oldest success
    • multiple oldest success single oldest failure
    • multiple oldest success multiple oldest failure (mock random)
  • run_item()
    • valid spec
    • invalid spec
    • %%oaat_item%% substitution
  • validate_items()
    • no items
    • set annotations
  • validate_state()
    • invalid state (failed pod creation)
    • valid state
  • validate_running_pod()
    • nothing expected to be running
    • expected to be running, but not
    • expected to be running, and is
    • running with phase update
    • succeeded, but not yet acknowledged
    • unexpected state

Unit tests for Pod

  • update_failure_status()
    • missing pod
    • pod not terminated
    • pod finished before most recent failure
    • pod finished after most recent failure
  • update_success_status()
    • missing pod
    • pod not terminated
    • pod finished before most recent success
    • pod finished after most recent success

Unit tests of Overseer

  • logging
  • get_status()
    • non-existent status
    • non-existent status with default
    • existing status non-None
    • existing status with default
    • existing status None (?)
  • set_status()
    • empty status
    • non-empty status
  • get_label()
    • non-existent label
    • non-existent label with default
    • existing status non-None
    • existing status with default
  • get_kubeobj()
    • without specifying my_pykube_objtype
    • missing object
    • object exists
  • set_annotation()
    • empty annotation
    • non-empty annotation
  • delete()
  • handle_processing_complete()
    • error/warning/info
    • message
    • state
    • no messages

Unit tests for Handler

  • startup() ?
  • oaat_timer() (oaatgroups)
    • loops updated
    • status set
    • phase set
    • start item
    • don't start item when paused
    • call validate_items
    • set error message
    • error if validate_expected_pod_is_running returns (i.e. no exception)
    • call valdiate_expected_pod_is_running when pod_expected
    • don't start time when one already running
  • pod_phasechange()
    • return error on phase change failure
    • phase updated
    • error if update_phase returns (i.e. no exception)
  • pod_succeeded()
    • return error if unable to create pod object
    • status updated
    • error if update_success_status returns (i.e. no exception)
  • pod_failed()
    • return error if unable to create pod object
    • status updated
    • error if update_failure_status returns (i.e. no exception)
  • cleanup_pod()
    • return error if unable to create pod object
    • delete pod if called
    • return error if pod delete fails
    • KOPF tests
      • old pods in failed state exist
      • old pods in succeeded state exist
      • old pods in other states exist
      • no old pods exist
  • oaat_action() (oaatgroups)
    • perform validations
    • return error if unable to create og object
    • return error if oaattype validation fails
    • perform annotation

Blackout windows

Specify time windows during which no items will be started. Potentially also provide an option where running items could be stopped during the blackout window.

Publish container in docker hub

In order for the quick-start in the README to work, there needs to be a container in a registry for the operator. Add one, with a test/release process, into docker hub.

Implement EachOnce

Ensure each item runs once successfully and then stop, deleting the OaatGroup.

Implement dynamic item list: container output

use the output of a container/pod to provide the list of items.

How often to run the container? Once, then store the item list in the metadata of the OaatGroup? Each run, and update the list each time?

Implement CRD schema validation

Currently the operator uses some spot checks of certain critical fields; instead, use json-schema to validate the CRD objects against a schema.

detection of field changes is not working

Oaat operator depends on being able to detect field changes to work effectively. This is not currently working and instead it is falling back on the kopf timer function.

Unit testing should use mocking, not minikube

Used a short-cut in unit testing by using minikube, so it's not really unit testing. Seems reasonable to use minikube for system testing, but should move to a mocking approach for unit testing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.