Giter Club home page Giter Club logo

rollout-operator's Introduction

Kubernetes Rollout Operator

This operator coordinates the rollout of pods between different StatefulSets within a specific namespace and can be used to manage multi-AZ deployments where pods running in each AZ are managed by a dedicated StatefulSet.

How updates work

The operator coordinates the rollout of pods belonging to StatefulSets with the rollout-group label and updates strategy set to OnDelete. The label value should identify the group of StatefulSets to which the StatefulSet belongs to. Make sure the statefulset has a label name in its spec.template, the operator uses it to find pods belonging to it.

For example, given the following StatefulSets in a namespace:

  • ingester-zone-a with rollout-group: ingester
  • ingester-zone-b with rollout-group: ingester
  • compactor-zone-a with rollout-group: compactor
  • compactor-zone-b with rollout-group: compactor

The operator independently coordinates the rollout of pods of each group:

  • Rollout group: ingester
    • ingester-zone-a
    • ingester-zone-b
  • Rollout group: compactor
    • compactor-zone-a
    • compactor-zone-b

For each rollout group, the operator guarantees:

  1. Pods in 2 different StatefulSets are not rolled out at the same time
  2. Pods in a StatefulSet are rolled out if and only if all pods in all other StatefulSets of the same group are Ready (otherwise it will start or continue the rollout once this check is satisfied)
  3. Pods are rolled out if and only if all StatefulSets in the same group have OnDelete update strategy (otherwise the operator will skip the group and log an error)
  4. The maximum number of not-Ready pods in a StatefulSet doesn't exceed the value configured in the rollout-max-unavailable annotation (if not set, it defaults to 1). Values:
  • <= 0: invalid (will default to 1 and log a warning)
  • 1: pods are rolled out sequentially
  • > 1: pods are rolled out in parallel (honoring the configured number of max unavailable pods)

How scaling up and down works

The operator can also optionally coordinate scaling up and down of StatefulSets that are part of the same rollout-group based on the grafana.com/rollout-downscale-leader annotation. When using this feature, the grafana.com/min-time-between-zones-downscale label must also be set on each StatefulSet.

This can be useful for automating the tedious scaling of stateful services like Mimir ingesters. Making use of this feature requires adding a few annotations and labels to configure how it works. Examples for a multi-AZ ingester group are given below.

  • For ingester-zone-a, add the following:
    • Labels:
      • grafana.com/min-time-between-zones-downscale=12h (change the value here to an appropriate duration)
      • grafana.com/prepare-downscale=true (to allow the service to be notified when it will be scaled down)
    • Annotations:
      • grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown (to call a specific endpoint on the service)
      • grafana.com/prepare-downscale-http-port=80 (to call a specific endpoint on the service)
  • For ingester-zone-b, add the following:
    • Labels:
      • grafana.com/min-time-between-zones-downscale=12h (change the value here to an appropriate duration)
      • grafana.com/prepare-downscale=true (to allow the service to be notified when it will be scaled down)
    • Annotations:
      • grafana.com/rollout-downscale-leader=ingester-zone-a (zone b will follow zone a, after a delay)
      • grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown (to call a specific endpoint on the service)
      • grafana.com/prepare-downscale-http-port=80 (to call a specific endpoint on the service)
  • For ingester-zone-c, add the following:
    • Labels:
      • grafana.com/min-time-between-zones-downscale=12h (change the value here to an appropriate duration)
      • grafana.com/prepare-downscale=true (to allow the service to be notified when it will be scaled down)
    • Annotations:
      • grafana.com/rollout-downscale-leader=ingester-zone-b (zone c will follow zone b, after a delay)
      • grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown (to call a specific endpoint on the service)
      • grafana.com/prepare-downscale-http-port=80 (to call a specific endpoint on the service)

Scaling based on reference resource

Rollout-operator can use custom resource with scale and status subresources as a "source of truth" for number of replicas for target statefulset. "Source of truth" resource (or "reference resource") is configured using following annotations:

  • grafana.com/rollout-mirror-replicas-from-resource-name
  • grafana.com/rollout-mirror-replicas-from-resource-kind
  • grafana.com/rollout-mirror-replicas-from-resource-api-version

These annotations must be set on StatefulSet that rollout-operator will scale (ie. target statefulset). Number of replicas in target statefulset will follow replicas in reference resource (from scale subresource), while reference resource's status subresource will be updated with current number of replicas in target statefulset.

This is similar to using grafana.com/rollout-downscale-leader, but reference resource can be any kind of resource, not just statefulset. Furthermore grafana.com/min-time-between-zones-downscale is not respected when using scaling based on reference resource.

This can be used in combination with HorizontalPodAutoscaler, when it is undesireable to set number of replicas directly on target statefulset, because we want to add custom logic to the scaledown (see next point). In that case, HPA can update different "reference resource", and rollout-operator can "mirror" number of replicas from reference resource to target statefulset.

To support scaling based on reference resource, rollout-operator needs to be allowed to execute get and patch verbs on status and scale subresources of the custom resource. For example when using custom resource replica-templates from API group rollout-operator.grafana.com, you can add following to the RBAC:

- apiGroups:
  - rollout-operator.grafana.com
  resources:
  - replica-templates/scale
  - replica-templates/status
  verbs:
  - get
  - patch

Delayed scaledown

When using "Scaling based on reference resource", rollout-operator can be configured to delay the actual scaledown, and ask individual pods to prepare for delayed-scaledown.

This is configured using grafana.com/rollout-delayed-downscale and grafana.com/rollout-prepare-delayed-downscale-url annotations on target statefulset. First annotation specificies minimum delay duration between call to "prepare-delayed-downscale-url" and actual scaledown, while the second annotation specifies the URL that is called on each pod. (URL is used as-is, but host is replaced with pod's fully qualified domain name.)

Rollout operator has special requirements on the configured endpoint:

  • Endpoint must support POST and DELETE methods.
  • On POST method, pod is supposed to prepare for delayed downscale. Endpoint must also return 200 if preparation succeeded, and JSON body in format: {"timestamp": 123456789}, where timestamp is Unix timestamp in seconds when the preparation has been done.
  • Repeated calls with POST method should return the same timestamp, unless preparation was done again, and new waiting must start.
  • On DELETE method, pod should cancel the preparation for delayed downscale. If there's nothing to do, pod should ignore such DELETE request.

Rollout-operator does NOT remember any state of "delayed scaledown" preparation. It relies on timestamps returned from the pod endpoints on POST method. When no delayed scaledown is taking place, rollout-operator still keeps calling DELETE method regularly, to make sure that there is all pods have cancelled any previous "preparation of delayed scaledown".

How is this different from grafana.com/prepare-downscale label used by /admission/prepare-downscale webhook? That webhook calls the "prepare-downscale" endpoint called just before the downscale is done, and pods are shutdown right after. On the other hand delayed downscale can take many hours. Delayed downscale and "prepare downscale" features can be used together.

Operations

HTTP endpoints

The operator runs an HTTP server listening on port -server.port=8001 exposing the following endpoints.

/ready

Readiness probe endpoint.

/metrics

Prometheus metrics endpoint.

HTTPS endpoints

/admission/no-downscale

Offers a ValidatingAdmissionWebhook that rejects the requests that decrease the number of replicas in objects labeled as grafana.com/no-downscale: true. See Webhooks section below.

RBAC

When running the rollout-operator as a pod, it needs a Role with at least the following privileges:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - list
  - get
  - watch
  - delete
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - list
  - get
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets/status
  verbs:
  - update

(Please see Webhooks section below for extra roles required when using the HTTPS server for webhooks.)

Webhooks

Dynamic Admission Control webhooks are offered on the HTTPS server of the rollout-operator. You can enable HTTPS by setting the flag -server-tls.enabled=true. The HTTPS server will listen on port -server-tls.port=8443 and expose the following endpoints.

/admission/no-downscale

This webhook offers a ValidatingAdmissionWebhook that rejects the requests that decrease the number of replicas in objects labeled as grafana.com/no-downscale: true. An example webhook configuration would look like this:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  labels:
    grafana.com/inject-rollout-operator-ca: "true"
    grafana.com/namespace: default
  name: no-downscale-default
webhooks:
- name: no-downscale-default.grafana.com
  admissionReviewVersions: [v1]
  clientConfig:
    service:
      name: rollout-operator
      namespace: default
      path: /admission/no-downscale
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  rules:
  - apiGroups: [apps]
    apiVersions: [v1]
    operations: [UPDATE]
    resources:
    - statefulsets
    - deployments
    - replicasets
    - statefulsets/scale
    - deployments/scale
    - replicasets/scale
    scope: Namespaced
  sideEffects: None
  timeoutSeconds: 10

This webhook configuration should point to a Service that points to the rollout-operator's HTTPS server exposed on port -server-tls.port=8443. For example:

apiVersion: v1
kind: Service
metadata:
  name: rollout-operator
spec:
  selector:
    name: rollout-operator
  type: ClusterIP
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 8443

No-downscale webhook details

Matching objects

Please note that the webhook will NOT receive the requests for /scale operations if an objectSelector is provided. For this reason the webhook will perform the check of the grafana.com/no-downscale label on the object itself on every received request. When an object like StatefulSet, DeploymentSet or ReplicaSet is changed itself, the validation request will include the changed object and the webhook will be able to check the label on it. When a /scale subresouce is changed (for example by running kubectl scale ...) the request will not contain the changed object, and rollout-operator will use the Kubernetes API to retrieve the parent object and check the label on it.

You will see in TLS Certificates section below that this label is also used to inject the CA bundle into the webhook configuration.

Note: if you plan running validations on DeploymentSet or ReplicaSet objects, you need to make sure that the rollout-operator has the privileges to list and get those objects.

Matching namespaces

Since the ValidatingAdmissionWebhook is cluster-wide, it's a good idea to at least include a namespaceSelector in the webhook configuration to limit the scope of the webhook to a specific namespace. If you want to restrict the webhook to a specific namespace, you can use the namespaceSelector in the webhook configuration and match on the kubernetes.io/metadata.name label, which contains the namespace name.

Handling errors

The webhook is conservative and allows changes whenever an error occurs:

  • When parent object can't be retrieved from the API.
  • When the validation request can't be decoded or includes an unsupported type.
Special cases

Changing the replicas number to null (or from null) is allowed.

/admission/prepare-downscale

This webhook offers a MutatingAdmissionWebhook that calls a downscale preparation endpoint on the pods for requests that decrease the number of replicas in objects labeled as grafana.com/prepare-downscale: true. An example webhook configuration would look like this:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  labels:
    grafana.com/inject-rollout-operator-ca: "true"
    grafana.com/namespace: default
  name: prepare-downscale-default
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: rollout-operator
      namespace: default
      path: /admission/prepare-downscale
      port: 443
  failurePolicy: Fail
  matchPolicy: Equivalent
  name: prepare-downscale-default.grafana.com
  rules:
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - UPDATE
    resources:
    - statefulsets
    - statefulsets/scale
    scope: Namespaced
  sideEffects: NoneOnDryRun
  timeoutSeconds: 10

This webhook configuration should point to a Service that points to the rollout-operator's HTTPS server exposed on port -server-tls.port=8443. For example:

apiVersion: v1
kind: Service
metadata:
  name: rollout-operator
spec:
  selector:
    name: rollout-operator
  type: ClusterIP
  ports:
  - name: https
    port: 443
    protocol: TCP
    targetPort: 8443

Note that the Service created for the /admission/no-downscale can be reused if already present.

Prepare-downscale webhook details

Upscaling requests or requests that don't change the number of replicas are approved. For downscaling requests the following labels have to be present on the object:

  • grafana.com/prepare-downscale

The following annotations also have to be present:

  • grafana.com/prepare-downscale-http-path
  • grafana.com/prepare-downscale-http-port

If the grafana.com/last-downscale annotation is present on any of the stateful sets in the same rollout group it's value will be checked against the current time. If the difference is less than the grafana.com/min-time-between-zones-downscale label (if present) then the request is rejected. Otherwise the request is approved. This mechanism can be used to maintain a time between downscales of the stateful sets in a rollout group.

The endpoint created from grafana.com/prepare-downscale-http-path and grafana.com/prepare-downscale-http-port will be called for each of the pods that have to be downscaled. If any of these requests fail the downscaling request is rejected.

The grafana.com/last-downscale annotation is added to the stateful set mentioned in the validation request.

TLS Certificates

Note that both the ValidatingAdmissionWebhook and the MutatingAdmissionWebhook require a TLS connection, so the HTTPS server should either use a certificate signed by a well-known CA or a self-signed certificate. You can either issue a Certificate Signing Request or use an existing approach for issuing self-signed certificates, like cert-manager. You can set the options -server-tls.cert-file and -server-tls.key-file to point to the certificate and key files respectively.

For convenience, rollout-operator offers a self-signed certificates generator that is enabled by default. This generator will generate a self-signed certificate and store it in a secret specified by the flag -server-tls.self-signed-cert.secret-name. The certificate is stored in a secret in order to reuse it across restarts of the rollout-operator.

rollout-operator will list all the ValidatingWebhookConfiguration or MutatingWebhookConfiguration objects in the cluster that are labeled with grafana.com/inject-rollout-operator-ca: true and grafana.com/namespace: <value of -kubernetes-namespace> and will inject the CA certificate in the caBundle field of the webhook configuration. This mechanism can be disabled by setting -webhooks.update-ca-bundle=false.

This signing and injecting is performed at service startup once, so you would need to restart rollout-operator if you want to inject the CA certificate in a new ValidatingWebhookConfiguration object.

In order to perform the self-signed certificate generation, rollout-operator needs a Role that would allow it to list and update the secrets, as well as a ClusterRole that would allow listing and patching the ValidationWebhookConfigurations or MutatingWebhookConfigurations. An example of those could be:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: rollout-operator-webhook-role
  namespace: default
rules:
  - apiGroups: [""]
    resources: [secrets]
    verbs: [create]
  - apiGroups: [""]
    resources: [secrets]
    resourceNames: [rollout-operator-self-signed-certificate]
    verbs: [update, get]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: rollout-operator-webhook-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: rollout-operator-webhook-role
subjects:
  - kind: ServiceAccount
    name: rollout-operator
    namespace: default

And:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: rollout-operator-webhook-default-clusterrole
rules:
- apiGroups: [admissionregistration.k8s.io]
  resources: [validatingwebhookconfigurations]
  verbs: [list, patch]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: rollout-operator-webhook-default-clusterrolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: rollout-operator-webhook-default-clusterrole
subjects:
  - kind: ServiceAccount
    name: rollout-operator
    namespace: default

Certificate expiration

Whenever the certificate expires, the rollout-operator will detect it and will restart, which will trigger the self-signed certificate generation again if it's configured. The default expiration for the self-signed certificate is 1 year and it can be changed by setting the flag -server-tls.self-signed-cert.expiration.

rollout-operator's People

Contributors

56quarters avatar aknuds1 avatar andyasp avatar charleskorn avatar colega avatar dimitarvdimitrov avatar fusakla avatar jhalterman avatar johannaratliff avatar jordanrushing avatar krajorama avatar michelhollands avatar pr00se avatar pracucci avatar pstibrany avatar richih avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rollout-operator's Issues

Allow setting a specific webhook to update the certificate

We don't expect a single rollout-operator to update multiple webhooks, especially since we'll be looking for one with grafana.com/namespace label.

So, instead of listing them, we can specify just one name, and restrict the ClusterRole to that specific webhook name.

Rollout operator with mimir-distributed helm chart not upgrading Pods

Reproduction steps:

Install mimir from grafana/helm-charts#1205 , enable for example store-gateway zone aware replication , i.e. via custome values.yaml:

rollout_operator:
  enabled: true
store_gateway:
  zone_aware_replication:
    enabled: true

After installation, write a letter into the mimir.config , just to alter its checksum.

Expected (works without rollout op): store-gateway Pods are restarted to take in the new configuration.

Actual: nothing happens, Pods are not restarted.

Additional info:
Rollout operator prints reconciled store-gateway statefulsets messages.

Before change to config, the statefullset state is:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: 7fc741ee52baf2c3d69f77aed2cda62113c36f9c878b86742f5f02a4cfd1427a
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 2
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2897289"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: 7fc741ee52baf2c3d69f77aed2cda62113c36f9c878b86742f5f02a4cfd1427a
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6795c75577
  observedGeneration: 2
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-6795c75577

After the upgrade:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    checksum/config: 58576abe6a9fb051e078095d03be1c4c3906e36da1651cf7fdf6cd8c39c30171
    meta.helm.sh/release-name: krajo
    meta.helm.sh/release-namespace: dev
    rollout-max-unavailable: "10"
  creationTimestamp: "2022-04-28T16:16:58Z"
  generation: 3
  labels:
    app.kubernetes.io/component: store-gateway
    app.kubernetes.io/instance: krajo
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: mimir
    app.kubernetes.io/part-of: memberlist
    app.kubernetes.io/version: 2.0.0
    helm.sh/chart: mimir-distributed-2.0.9
    rollout-group: store-gateway
    zone: zone-a
  name: krajo-mimir-store-gateway-zone-a
  namespace: dev
  resourceVersion: "2902316"
  selfLink: /apis/apps/v1/namespaces/dev/statefulsets/krajo-mimir-store-gateway-zone-a
  uid: 88d76d77-f8e2-4023-8775-b460b078d4a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/component: store-gateway
      app.kubernetes.io/instance: krajo
      app.kubernetes.io/name: mimir
      rollout-group: store-gateway
      zone: zone-a
  serviceName: krajo-mimir-store-gateway-headless
  template:
    metadata:
      annotations:
        checksum/config: 58576abe6a9fb051e078095d03be1c4c3906e36da1651cf7fdf6cd8c39c30171
      creationTimestamp: null
      labels:
        app.kubernetes.io/component: store-gateway
        app.kubernetes.io/instance: krajo
        app.kubernetes.io/managed-by: Helm
        app.kubernetes.io/name: mimir
        app.kubernetes.io/part-of: memberlist
        app.kubernetes.io/version: 2.0.0
        helm.sh/chart: mimir-distributed-2.0.9
        rollout-group: store-gateway
        zone: zone-a
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: target
                operator: In
                values:
                - store-gateway
              - key: target
                operator: NotIn
                values:
                - store-gateway-zone-a
            topologyKey: kubernetes.io/hostname
      containers:
      - args:
        - -target=store-gateway
        - -config.file=/etc/mimir/mimir.yaml
        - -store-gateway.sharding-ring.instance-availability-zone=zone-a
        image: grafana/mimir:2.0.0
        imagePullPolicy: IfNotPresent
        name: store-gateway
        ports:
        - containerPort: 8080
          name: http-metrics
          protocol: TCP
        - containerPort: 9095
          name: grpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /ready
            port: http-metrics
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          requests:
            cpu: 100m
            memory: 512Mi
        securityContext:
          readOnlyRootFilesystem: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/mimir
          name: config
        - mountPath: /var/mimir
          name: runtime-config
        - mountPath: /data
          name: storage
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: krajo-mimir
      serviceAccountName: krajo-mimir
      terminationGracePeriodSeconds: 240
      volumes:
      - name: config
        secret:
          defaultMode: 420
          secretName: krajo-mimir-config
      - configMap:
          defaultMode: 420
          name: krajo-mimir-runtime
        name: runtime-config
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: storage
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 5Gi
      storageClassName: microk8s-hostpath
      volumeMode: Filesystem
    status:
      phase: Pending
status:
  availableReplicas: 1
  collisionCount: 0
  currentRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98
  observedGeneration: 3
  readyReplicas: 1
  replicas: 1
  updateRevision: krajo-mimir-store-gateway-zone-a-6b64ccdc98

I've added the checksum on statefulset itself as annotation but didn't help.

Add `make format` and lint import ordering

Import ordering is not currently enforced which is causing noise. This repository should introduce formatting and linting for import ordering like the Mimir repository does.

Downscale webhook fails when currently upscaling

If a pod is in the process of being created for a statefulset, the downscale webhook will reject an attempt to change a resource that would cause pods to downscale:

level=error ts=2023-07-18T01:23:33.406992007Z name=ingester-zone-a resource=statefulsets namespace=mimir-dev-11 request_gvk="apps/v1, Kind=StatefulSet" old_replicas=225 new_replicas=5 msg="downscale not allowed due to error" err="Post "http://ingester-zone-a-218.ingester-zone-a.mimir-dev-11.svc.cluster.local:80/ingester/prepare-shutdown": dial tcp: lookup ingester-zone-a-218.ingester-zone-a.mimir-dev-11.svc.cluster.local on 10.188.0.10:53: no such host"

This was discovered when an HPA was scaling up too aggressively, and when trying to revert the change that caused that, the downscale webhook rejected the change since the statefulset was currently upscaling.

Add a mechanism to prevent certain resources from being downscaled

Context

There are some resources in Mimir that can't be downscaled directly: the ingesters. While we're developing an automated way to perform the downscaling operation (that involves calling and endpoint on the resources and waiting a certain amount of time), we would like to find a way that would prevent them from being accidentally downscaled (manual operations on the cluster, gitops revert, etc.)

It would also be great to find a way that would be extensible to the automated solution in the future.

Challenges

One of the challenges to have in mind is to distinguish whether a resource is being deleted because it's being re-scheduled or updated (a rolling release, for instance) or it's being downscaled.

Solution alternatives

Using kubernetes finalizers

A finalizer can be set on a kubernetes resource to prevent it from being deleted (until the finalizer is removed). We considered using this mechanism and allow only rollouts and re-schedulings. We would prevent the pods from being deleted if a downscale is detected, triggering an alert metric and requiring an operator intervention.

However, we had to discard this approach because even though Kubernetes doesn't delete the pod with a finalizer, it does stop the containers on the deletion request.

Use a preStop hook and a huge terminationGracePeriodSeconds

Another alternative would be to use a preStop hook on the ingesters, that would detect a downscale operation and a huge terminationGracePeriodSeconds period that would be used to avoid the downscaling and alerting the operator about that fact.

This approach has some downsides:

  • ingesters would need to know about their surroundings: either they would talk directly to k8s api or they would need to talk to the Kubernetes API. Both solutions don't seem to be great.
  • misuse of terminationGracePeriodSeconds could create different issues.
  • in case of downscale, the operator would need to manually intervent the pod (as it would be already in process of deletion)

Using a dynamic admission controller

Kubernetes can be configured to call a webhook whenever a change is applied, and reject it if it doesn't meet certain requirements. We think that this solution fits the best our requirements.

We would add a special label to the resources that we don't want to be downscaled, let's say grafana.com/no-downscale and create a webhook configuration that would call a rollout-operator's endpoint, like this:

kind: ValidatingWebhookConfiguration
metadata:
  name: no-downscale-default
webhooks:
- admissionReviewVersions:
  - v1
  clientConfig:
    service:
      name: rollout-operator
      namespace: default
      path: /admission/no-downscale
      port: 443
  failurePolicy: Fail
  name: no-downscale.default.namespace.local
  objectSelector:
    matchLabels:
      grafana.com/namespace: default
      grafana.com/no-downscale: "true"
  rules:
  - apiGroups:
    - apps
    apiVersions:
    - v1
    operations:
    - UPDATE
    resources:
    - statefulsets
    - deployments
    - replicasets
    scope: Namespaced
  sideEffects: None

Since webhooks are not namespaced, but we probably don't want to handle the entire cluster with just one rollout-operator, I'm proposing here to add a grafana.com/namespace label to the objects that we want to validate, and match them in the objectSelector. Unfortunately, even though there is a namespaceSelector property for webhooks, it doesn't seem to be able to match the namespace's name, only its labels, and it seems more reasonable to ask someone to add a label on the resources they want to validate rather than on the entire namespace. In any case this decision is only up to the webhook configuration and wouldn't change the webhook implementation in the rollout-operator.

There would be an /admission/no-downscale endpoint on the rollout-operator that would validate the admission requests from Kubernetes on the matched updates.

So far so good, but I found several challenges for this implementation.

Challenge: webhooks have to be served over https

Webhooks need to be served over https, so we need either to configure some certificates on the rollout-operator or to generate some self-signed ones and set them on the webhook configurations. Both approaches are non-trivial.

Configuring certificates on the rollout-operator

There are plenty of examples on the internet on how to generate certificates and provision them in a webhook server, either using a set of manual operations involving a Certificate Signing Requests from Kuberentes, or using something like cert-manager, to generate it and automatically inject it to both rollout-operator deployment and the webhook configuration.

Even though it's possible, I think it's too much hassle for someone deploying the rollout operator, and while this approach should be supported, there should be also an easier self-configurable way.

Self-signed certificates generation

It's possible to generate self-signed certificates for the webhook, configuring the CA Bundle on the webhook.

This presents two further challenges:

Persisting the self-signed certificate upon restarts

Given we have a webhook configured with a self-signed certificate, we should reuse the same certificate for the webhook server (served by our rollout-operator) after a restart. The solution that we can use for this is a Kubernetes secret: letting rollout-operator a RBAC authorization to read, create and update a secret, it can check whether a secret already exists, and use that certificate, or create one and persist it. The implementation of this is fairly simple.

There's a downside of this which is the fact that the secret would stay in Kuberentes if the rollout-operator is removed, since rollout-operator doesn't have enough self-awareness to set the proper ownerReference on that secret. This is something we think we can live with.

Using the self-signed certificate on the webhook configuration

Once we have the self-signed certificate, we could just create the webhook in our Kubernetes cluster from the rollout operator itself. This would however give rollout-operator the ability of creating cluster-wide webhooks, which seems dangerous, and it would also have the downside mentioned before of leaving the webhook dangling whenever the webhook is removed.

We decided to go another way, and inspired by the way cert-manager injects signed certificates on certain objects, we could have a special label on the webhooks configuration that need a self-signed certificate to be injected:

metadata:
  labels:
    grafana.com/inject-rollout-operator-ca: "true"
    grafana.com/namespace: default

This seems to be less intrusive and more compatible with alternatives like using cert-manager.

Challenge: matching scale subresource changes

After impementing a working solution, in the manual testing phase I realized that when the scale subresource is being modified (for example, by doing kubectl scale ...) the webhook does not match the resource, as the Scale subresouce doesn't have any labels at all. I have created an issue on the Kubernetes project for this.

(Note also that matching the scale subresouces requires declaring them as ["statefulsets/scale", "deployments/scale", "replicasets/scale"] in the resouces section of the webhook rule).

A possible workaround would be to match all scale operations in the namespace, retrieving the parent resource and manually matching its labels. In order to match all scale operations in the namespace, we would need to make a label on the namespace itself and use namespaceSelector, instead of the custom grafana.com/namespace approach mentioned above, since again, Scale sub-resource doesn't have any labels.

Questions

  • Can we do that?
    • Yes, it doesn't seem very complicated.
  • Should we do that?
    • It seems that this should be a responsibility of Kubernetes itself.
  • Will Kubernetes do this change?
    • Maybe. Or maybe not.
  • Should we wait until Kubernetes does this?
    • Probably not, because even if Kubernetes would do the change, it would require rollout-operator to be ran on a very new Kubernetes version, which is a big requirement.

Implementation

I have a working code in #25 but it's still pending some tests and solving the scale issue.

Remove self-signed TLS cert integration test and replace with unit test

I haven't been able to get CI to pass on the first try with this project yet. It's nearly always an issue with the integration test for ensuring automatically renewing the self-signed TLS cert works. In order to improve the reliability and our confidence in rollout-operator tests, the test and code under test should be reworked so it can be tested without spinning up an entire Kubernetes cluster.

Guard against a concurrent downscaling and rollout

At present the rollout operator prevents concurrent rollout operations, but doesn’t prevent concurrent downscale + rollout operations. This can happen, for example, if the controller is currently rolling zone B or C, and then something downscales zone A.

In order to guard against this, I propose using an annotation on the resource being scaled to represent a “lock” across the scaling group. Ex:

grafana.com/scaling-lock: rollout

Creating a scaling-lock

When the rollout operator reconciles, it will choose a resource, if any, that already has a scaling lock to reconcile first, allowing reentrancy. If none exist, it will add a scaling-lock annotation to the sts when it begins rolling its pods. A scaling lock should not be needed though if the controller is just adjusting replicas up/down (since last-downscaled will already provide protection here), only when rolling pods.

Validating the scaling-lock

When the scaling-lock is present on any resource in the rollout-group, the downscale webhook will reject downscale attempts for resources that don't have the scaling lock. For example: if the scaling lock is present on the zone B statefulset, zone A's statefulset cannot be downscaled.

Removing the scaling-lock

When the rollout-operator reconciles a resource, if it sees no non-ready pods and has nothing to do for a resource, it will clear the scaling-lock for that resource if any is present.

Other alternatives

While this could also be accomplished by extending the last-updated annotation to be updated by the rollout-operator’s controller, that would block downscales from taking place for some time after rollouts are complete, which is not necessary. We just want to prevent concurrent scaling from occurring.

Guard against race conditions

While #79 described a situation that could happen commonly, there are some less likely but still possible race conditions that we might consider guarding against.

Concurrent downscale + rollout

If a downscale is in progress but hasn't updated the last-downscale annotation yet, it's possible that a rollout could start concurrently, bypassing the min-time-between-zones-downscale check in the controller.

Concurrent downscale + downscale

If downscales for two zones occur concurrently, they can both be accepted before either has updated the last-downscale annotation, bypassing the min-time-between-zones-downscale check in the prepare-downscale webhook.

Potential fix

One reliable way to serialize operations across different statefulsets is to lock on a common resource, such as with an annotation on a common CRD, using a resourceVersion on updates. Absent that, we could add a lock annotation to a statefulset and use a double check (check, lock, check) when adding it to guard against races, though I'm not positive this is completely safe.

Operator assumes all pods are ready when there are no pods at all

We're currently checking hasStatefulSetNotReadyPods before considering the zone reconciled:

func (c *RolloutController) hasStatefulSetNotReadyPods(sts *v1.StatefulSet) (bool, error) {
// We can quickly check the number of ready replicas reported by the StatefulSet.
// If they don't match the total number of replicas, then we're sure there are some
// not ready pods.
if sts.Status.Replicas != sts.Status.ReadyReplicas {
return true, nil
}
// The number of ready replicas reported by the StatefulSet matches the total number of
// replicas. However, there's still no guarantee that all pods are running. For example,
// a terminating pod (which we don't consider "ready") may have not yet failed the
// readiness probe for the consecutive number of times required to switch its status
// to not-ready. For this reason, we list all StatefulSet pods and check them one-by-one.
notReadyPods, err := c.listNotReadyPodsByStatefulSet(sts)
if err != nil {
return false, err
}
if len(notReadyPods) == 0 {
return false, nil
}
// Log which pods have been detected as not-ready. This may be useful for debugging.
level.Info(c.logger).Log(
"msg", "StatefulSet status is reporting all pods ready, but the rollout operator has found some not-Ready pods",
"statefulset", sts.Name,
"not_ready_pods", strings.Join(util.PodNames(notReadyPods), " "))
return true, nil
}

However, we've seen a case where all pods of a statefulset were deleted, but no new pods were created yet. This caused the listNotReadyPodsByStatefulSet method return no pods at all, as there were no pods in any state.

This caused an outage, as next zone was terminated immediately, before no pods were available in the first one.

I think that the code should check that there are at least as many pods as replicas desired by the statefulset, and then it should check whether some of them are unready.

Remove the "name" label requirement on StatefulSet's pods

The rollout-operator currently requires that each pod controlled by a StatefulSet managed by rollout-operator has the name label and its value is the name of the StatefulSet controlling it:

mustNewLabelsRequirement("name", selection.Equals, []string{sts.Spec.Template.Labels["name"]}),

I think we can get rid of this requirement, looking for the StatefulSet controlling the pod in a different way. For example, when running kubectl describe pod ... it shows a "Controlled By" field with the name of the StatefulSet controlling it.

ARM Support for docker image

Current rollout image does not have an arm64 compatible image

standard_init_linux.go:228: exec user process caused: exec format error

I just moved the build to the dockerfile using the builder pattern and use the builtin support for multi arch from docker buidlx

FROM --platform=$BUILDPLATFORM golang:1.17-alpine AS builder

ADD . /app
WORKDIR /app

ARG TARGETOS
ARG TARGETARCH

RUN CGO_ENABLED=0 GOOS=${TARGETOS} GOARCH=${TARGETARCH} \
    go build -ldflags="-s -w" ./cmd/rollout-operator

FROM alpine:3.14

COPY --from=builder /app/rollout-operator /bin/rollout-operator
ENTRYPOINT [ "/bin/rollout-operator" ]

# Create rollout-operator user to run as non-root.
RUN addgroup -g 10000 -S rollout-operator && \
    adduser  -u 10000 -S rollout-operator -G rollout-operator
USER rollout-operator:rollout-operator

ARG revision
LABEL org.opencontainers.image.title="rollout-operator" \
      org.opencontainers.image.source="https://github.com/grafana/rollout-operator" \
      org.opencontainers.image.revision="${revision}"

And to build just specify the architectures (Note arm builds require push ~ images aren't stored locally)

docker buildx build --platform linux/amd64,linux/arm64 --tag your-registry.tld/repo/grafana/rollout-operator:latest --push .

min-time-between-zones-downscale label should be an annotation

It's a form of configuration for the statefulsets and doesn't ever need to be used for selecting them. We should release a version of the rollout-operator that supports either a label or annotation as a migration path before only supporting use of an annotation.

Operator calling the wrong pod endpoint when trying to scale down ingesters

We are trying to scale down ingesters using the MutatingAdmissionWebhook as described here.

The required labels and annotations were added to the objects as shown below:

# labels
grafana.com/prepare-downscale=true

# annotations
grafana.com/prepare-downscale-http-path=ingester/prepare-shutdown
grafana.com/prepare-downscale-http-port=8080

When trying the scale down the statefulset/mimir-ingester-zone-a, the operator failing to resolve the pod when sending HTTP post request, as the fqdn is constructed as <pod_name>.<service_name>.<namespace>.svc.cluster.local.

mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local

These are the existing services:

mimir-ingester-headless                  ClusterIP   None             <none>        8080/TCP,9095/TCP   19d
mimir-ingester-zone-a                    ClusterIP   <ip_address>     <none>        8080/TCP,9095/TCP   19d
mimir-ingester-zone-b                    ClusterIP   <ip_address>    <none>        8080/TCP,9095/TCP   19d
mimir-ingester-zone-c                    ClusterIP   <ip_address>    <none>        8080/TCP,9095/TCP   19d

In order to resolve the pod, the headless service should be used instead (mimir-ingester-headless). More info

Operator logs

level=error ts=2024-01-16T13:29:51.838387358Z name=mimir-ingester-zone-a resource=statefulsets namespace=mimir request_gvk="autoscaling/v1, Kind=Scale" old_replicas=2 new_replicas=1 url=mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local:443/ingester/prepare-shutdown index=1 msg="error sending HTTP post request" err="Post \"http://mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local:443/ingester/prepare-shutdown\": dial tcp: lookup mimir-ingester-zone-a-1.mimir-ingester-zone-a.mimir.svc.cluster.local on <name_server_ip>:53: no such host"

Automate release process

It'd be great if the release process was automated, so that images are built in a repeatable way in a controlled environment.

I'm imagining a GitHub Actions workflow that would respond to a release being created, and automatically build and push the images to Docker Hub.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.