Giter Club home page Giter Club logo

azure-k8s-autopilot's Introduction

Azure Kubernetes Autopilot

license DockerHub Quay.io Artifact Hub

Kubernetess service for automatic maintenance of an Azure cluster.

  • auto repair (repair nodes if NotReady; VM and VMSS support)
  • auto update (update VMSS instances automatically to latest model; only VMSS)

Supports Azure AKS and custom Azure Kubernetes clusters.

Supports shoutrrr notifications.

(Successor of azure-k8s-autorepair)

Configuration

Usage:
  azure-k8s-autopilot [OPTIONS]

Application Options:
      --log.debug                                                  debug mode [$LOG_DEBUG]
      --log.devel                                                  development mode [$LOG_DEVEL]
      --log.json                                                   Switch log output to json format [$LOG_JSON]
      --dry-run                                                    Dry run (no redeploy triggered) [$DRY_RUN]
      --instance.nodename=                                         Name of node where autopilot is running [$INSTANCE_NODENAME]
      --instance.namespace=                                        Name of namespace where autopilot is running [$INSTANCE_NAMESPACE]
      --instance.pod=                                              Name of pod where autopilot is running [$INSTANCE_POD]
      --azure.environment=                                         Azure environment name (default: AZUREPUBLICCLOUD) [$AZURE_ENVIRONMENT]
      --repautoscaler.scaledown-locktime=                          Prevents cluster autoscaler from scaling down the affected node after
                                                                   update and repair (default: 60m) [$AUTOSCALER_SCALEDOWN_LOCKTIME]
      --kube.node.labelselector=                                   Node Label selector which nodes should be checked
                                                                   [$KUBE_NODE_LABELSELECTOR]
      --lease.enable                                               Enable lease (leader election; enabled by default in docker images)
                                                                   [$LEASE_ENABLE]
      --lease.name=                                                Name of lease lock (default: azure-k8s-autopilot-leader) [$LEASE_NAME]
      --repair.crontab=                                            Crontab of check runs (default: @every 2m) [$REPAIR_CRONTAB]
      --repair.notready-threshold=                                 Threshold (duration) when the automatic repair should be tried (eg.
                                                                   after 10 mins of NotReady state after last successfull heartbeat)
                                                                   (default: 10m) [$REPAIR_NOTREADY_THRESHOLD]
      --repair.concurrency=                                        How many VMs should be redeployed concurrently (default: 1)
                                                                   [$REPAIR_CONCURRENCY]
      --repair.lock-duration=                                      Duration how long should be waited for another redeploy (default: 30m)
                                                                   [$REPAIR_LOCK_DURATION]
      --repair.lock-duration-error=                                Duration how long should be waited for another redeploy in case an error
                                                                   occurred (default: 5m) [$REPAIR_LOCK_DURATION_ERROR]
      --repair.azure.vmss.action=[restart|redeploy|reimage|delete] Defines the action which should be tried to repair the node (VMSS)
                                                                   (default: redeploy) [$REPAIR_AZURE_VMSS_ACTION]
      --repair.azure.vm.action=[restart|redeploy]                  Defines the action which should be tried to repair the node (VM)
                                                                   (default: redeploy) [$REPAIR_AZURE_VM_ACTION]
      --repair.azure.provisioningstate=                            Azure VM provisioning states where repair should be tried (eg. avoid
                                                                   repair in "upgrading" state; "*" to accept all states) (default:
                                                                   succeeded, failed) [$REPAIR_AZURE_PROVISIONINGSTATE]
      --repair.lock-annotation=                                    Node annotation for repair lock time (default:
                                                                   autopilot.webdevops.io/repair-lock) [$REPAIR_LOCK_ANNOTATION]
      --update.crontab=                                            Crontab of check runs (default: @every 15m) [$UPDATE_CRONTAB]
      --update.concurrency=                                        How many VMs should be updated concurrently (default: 1)
                                                                   [$UPDATE_CONCURRENCY]
      --update.lock-duration=                                      Duration how long should be waited for another update (default: 15m)
                                                                   [$UPDATE_LOCK_DURATION]
      --update.lock-duration-error=                                Duration how long should be waited for another update in case an error
                                                                   occurred (default: 5m) [$UPDATE_LOCK_DURATION_ERROR]
      --update.lock-annotation=                                    Node annotation for update lock time (default:
                                                                   autopilot.webdevops.io/update-lock) [$UPDATE_LOCK_ANNOTATION]
      --update.ongoing-annotation=                                 Node annotation for ongoing update lock (default:
                                                                   autopilot.webdevops.io/update-ongoing) [$UPDATE_ONGOING_ANNOTATION]
      --update.exclude-annotation=                                 Node annotation for excluding node for updates (default:
                                                                   autopilot.webdevops.io/exclude) [$UPDATE_EXCLUDE_ANNOTATION]
      --update.azure.vmss.action=[update|update+reimage|delete]    Defines the action which should be tried to update the node (VMSS)
                                                                   (default: update+reimage) [$UPDATE_AZURE_VMSS_ACTION]
      --update.azure.provisioningstate=                            Azure VM provisioning states where update should be tried (eg. avoid
                                                                   repair in "upgrading" state; "*" to accept all states) (default:
                                                                   succeeded, failed) [$UPDATE_AZURE_PROVISIONINGSTATE]
      --update.failed-threshold=                                   Failed node threshold when node update is stopped (default: 2)
                                                                   [$UPDATE_FAILED_THRESHOLD]
      --drain.kubectl=                                             Path to kubectl binary (default: kubectl) [$DRAIN_KUBECTL]
      --drain.enable                                               Enable drain handling [$DRAIN_ENABLE]
      --drain.delete-emptydir-data                                 Continue even if there are pods using emptyDir (local emptydir that will
                                                                   be deleted when the node is drained) [$DRAIN_DELETE_EMPTYDIR_DATA]
      --drain.force                                                Continue even if there are pods not managed by a ReplicationController,
                                                                   ReplicaSet, Job, DaemonSet or StatefulSet [$DRAIN_FORCE]
      --drain.grace-period=                                        Period of time in seconds given to each pod to terminate gracefully. If
                                                                   negative, the default value specified in the pod will be used.
                                                                   [$DRAIN_GRACE_PERIOD]
      --drain.ignore-daemonsets                                    Ignore DaemonSet-managed pods. [$DRAIN_IGNORE_DAEMONSETS]
      --drain.pod-selector=                                        Label selector to filter pods on the node [$DRAIN_POD_SELECTOR]
      --drain.timeout=                                             The length of time to wait before giving up, zero means infinite
                                                                   (default: 0s) [$DRAIN_TIMEOUT]
      --drain.wait-after=                                          Wait after drain to let Kubernetes detach volumes etc (default: 30s)
                                                                   [$DRAIN_WAIT_AFTER]
      --drain.dry-run                                              Do not drain, uncordon or label any node [$DRAIN_DRY_RUN]
      --drain.disable-eviction                                     Force drain to use delete, even if eviction is supported. This will
                                                                   bypass checking PodDisruptionBudgets, use with caution.
                                                                   [$DRAIN_DISABLE_EVICTION]
      --drain.retry-without-eviction                               Retry drain without eviction if first drain failed
                                                                   [$DRAIN_RETRY_WITHOUT_EVICTION]
      --drain.ignore-failure                                       Ignore failed drain and continue with actions [$DRAIN_IGNORE_FAILURE]
      --notification=                                              Shoutrrr url for notifications (https://containrrr.github.io/shoutrrr/)
                                                                   [$NOTIFICATION]
      --server.bind=                                               Server address (default: :8080) [$SERVER_BIND]
      --server.timeout.read=                                       Server read timeout (default: 5s) [$SERVER_TIMEOUT_READ]
      --server.timeout.write=                                      Server write timeout (default: 10s) [$SERVER_TIMEOUT_WRITE]

Help Options:
  -h, --help                                                       Show this help message

for Azure API authentication (using ENV vars) see https://docs.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication

for Kubernetes ServiceAccont is discoverd automatically (or you can use env path KUBECONFIG to specify path to your kubeconfig file)

Metrics

(see :8080/metrics)

Metric Description
autopilot_repair_count Count of repair actions
autopilot_repair_node_status Node status
autopilot_repair_duration Duration of repair task
autopilot_update_count Count of update actions
autopilot_update_duration Duration of last exec

AzureTracing metrics

see armclient tracing documentation

azure-k8s-autopilot's People

Contributors

aslafy-z avatar h4wkmoon avatar mblaschke avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

aslafy-z h4wkmoon

azure-k8s-autopilot's Issues

Is this project still maintained?

I'd love to use this project for my infrastructure. Seeing that there were no updates since last year, do you still use it? Will you continue maintaining it? I would love to contribute on it if I find anything to do.

go stack trace when losing election

Hi,
with an image built with "master", I receive this message when using lease with 2 pods.
BTW, lease feature is not enable by default in last docker image. I had to use LEASE_ENABLE=true

this is log of the election loser. The other one won, and works properly.
{"level":"info","caller":"autopilot/main.go:360","msg":"trying to become leader"}
[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
goroutine 97 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24 +0x65
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/log.go:59 +0xbd
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Enabled(0xc0003c64c0, 0xedc939edd?)
/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/log/deleg.go:111 +0x3d
github.com/go-logr/logr.Logger.Enabled(...)
/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:261
github.com/go-logr/logr.Logger.Info({{0x1b6d368?, 0xc0003c64c0?}, 0x1b71d80?}, {0x190fddc, 0x18}, {0x0, 0x0, 0x0})
/go/pkg/mod/github.com/go-logr/[email protected]/logr.go:274 +0x78
github.com/operator-framework/operator-lib/leader.Become({0x1b6a490, 0xc000138028}, {0xc00004e03b, 0x25}, {0x0, 0x0, 0x0?})
/go/pkg/mod/github.com/operator-framework/[email protected]/leader/leader.go:192 +0xa0e
github.com/webdevopos/azure-k8s-autopilot/autopilot.(*AzureK8sAutopilot).leaderElect(0xc000250840)
/go/src/github.com/webdevops/azure-k8s-autopilot/autopilot/main.go:369 +0x167
github.com/webdevopos/azure-k8s-autopilot/autopilot.(*AzureK8sAutopilot).Start.func1()
/go/src/github.com/webdevops/azure-k8s-autopilot/autopilot/main.go:248 +0x26
created by github.com/webdevopos/azure-k8s-autopilot/autopilot.(*AzureK8sAutopilot).Start
/go/src/github.com/webdevops/azure-k8s-autopilot/autopilot/main.go:247 +0x56

Crashes on API server timeout

When the API server is not reachable (timeout) for any reason, this controller crashes. It's then restarted and will resume its work, so it's not a critical issue. I think some work can be done to retry with backoff instead of crashing. I'll post some logs in there when I reproduce the issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.