In general We need to implement checking quorum status of an initi

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Internaly we agreed to implement just single condition for now - <code class="notransl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Implement Ready status condition about etcd-operator HOT 12 CLOSED

aenix-io commented on August 11, 2024 1

Implement Ready status condition

from etcd-operator.

Comments (12)

aobort commented on August 11, 2024 1

Any option like "pods & sts are ready but quorum is lost"? Shouldn't we check the health of the cluster throughout its entire lifecycle? I mean, that etcd cluster controller will react the changes of sts state and act accordingly since controller reference is set, but in case loss of quorum or anything else happen to etcd app which will not affect pod/sts state operator will not be informed. Thinking out loud.

from etcd-operator.

sircthulhu commented on August 11, 2024 1

@aobort
Thanks for your opinion, I really appreciate it :)

If I got it right, the case you mentioned is the first if. We should check that sts is ready AND check if cluster has quorum at all and there're desired number of members. If not, update cluster state accordingly.
In future, the controller will perform actions if possible

from etcd-operator.

aobort commented on August 11, 2024 1

@Kirill-Garbar if I got things right, the goal at the moment is to implement such a check on etcd cluster creation:

create sts and related stuff
when sts becomes ready, check etcd state

Therefore, from my perspective, the workflow should be like the following:

check conditions, if conditions list len eq 0 -> put all possible conditions with state eq "False" to the status
check sts and related stuff, if nothing found -> create, set "Initialized" condition to "True"
update status and exit

Next reconciliation run schould be triggered by the update of sts state - NotReady -> Ready:

check conditions (already filled)
check sts and related stuff (already created) -> check sts state
check etcd state
update condition "Ready" to True

Continuous state tracking is definitely should be implemented. But, from my perspective it should be implemented separately from the controller's code. May be as another service which implements Runnable interface, so it could be added to manager with mgr.Add() call. It could just check etcd cluster's health and in case something happen, update sts annotation which should trigger the reconciliation of etcd cluster object.

from etcd-operator.

sircthulhu commented on August 11, 2024 1

@aobort
You're definetly right about the workflow.

I'd like to make it more precise

Next reconciliation run schould be triggered by the update of sts state - NotReady -> Ready:

check conditions (already filled)

We do not rely on what's written in status to get the current state, Reconcile function should be idempotent

I agree with the point that continuous tracking should be implemented as Runnable interface, as it better for architecture. For cluster bootstrap it is just enough to get update when StatefulSet is changed somehow

from etcd-operator.

aobort commented on August 11, 2024 1

@sircthulhu

We do not rely on what's written in status to get the current state, Reconcile function should be idempotent

Absolutely. I meant, that the update of sts object (it does not matter whether it will be regular status update or manual or whatnot) will trigger etcd cluster object reconciliation. 'Cuz there is the controller reference applied to sts.

from etcd-operator.

kvaps commented on August 11, 2024 1

Internaly we agreed to implement just single condition for now - Ready: True|False

EtcdCluster should be considered Ready when StatefulSet is Ready
EtcdCluster should be considered NotReady when StatefulSet is NotReady

Every pod in StatefulSet should have healthchecks defined:

We can borrow the logic from Kubeadm, for example:

https://github.com/kubernetes/kubernetes/blob/89b1db79d75e367f567ea06c0528ef4b8f3fecb4/cmd/kubeadm/app/phases/etcd/local.go#L227-L228

Here is latest issue where they were updated:
kubernetes/kubeadm#2567

Hovewer @Uburro noted that:

Since v3.5.12, two new endpoints /livez and /readyz are added.

the /livez endpoint reflects whether the process is alive or if it needs a restart.
the /readyz endpoint reflects whether the process is ready to serve traffic.

from etcd-operator.

gaalw commented on August 11, 2024 1

Any option like "pods & sts are ready but quorum is lost"? Shouldn't we check the health of the cluster throughout its entire lifecycle?

operator is not monitoring solution. Also if you would implement it, the data in CR will be always stale.

from etcd-operator.

Kirill-Garbar commented on August 11, 2024

@sircthulhu I didn't get from the description and from the answer should this status be continiously updated or just once? I see the etcd-operator (controller) as a "watchdog" who updates the status by some events/regularly. I will try to dive deeper into the docs and code to get such answers but currently it is not clear should this logic be implemented while working on this issue or not.

from etcd-operator.

kvaps commented on August 11, 2024

If some pods in statefulset aren't ready but quorum is fine, set Ready, but mark cluster as degraded in Reason field.

This requires additional condition, eg: Quorum: True|False with reason ClusterHasQuorum

from etcd-operator.

aobort commented on August 11, 2024

@gaalw

operator is not monitoring solution

Agree. However according to plans to implement maintenance of the cluster monitoring of its state becomes mandatory capability.

Also if you would implement it, the data in CR will be always stale.

It depends. In case cluster health was not changed, there would be no reason for watchdog to update CR.

from etcd-operator.

Uburro commented on August 11, 2024

According to this KEP, the ETCD introduced full-fledged checks of readiness and liveness probes. /readyz for us mention that no raft leader, raft loop deadlock, data coraption. In this case, it will disappear from the endpoints and for our controller this will be a signal that everything is ok or not ok with the cluster. We also subscribe to events from the shared informer so there is no need for the controller to go through the clusters and poll their status. This will be done by Kubernetes native way itself using kubelet. /livez will be a signal about a problem with the database (apps) itself and a trigger for its restart. This way we avoid unnecessary restarting of the etcd. This is useful in cases where the etcd will be used outside the kubernetes.
in this case we have to start to support since 3.5.12 and 3.4.29 vertion of etcd

from etcd-operator.

Uburro commented on August 11, 2024

etcd-io/etcd#16666 we can think about --experimental-wait-cluster-ready-timeout with starupprobe in a feature

from etcd-operator.

Implement Ready status condition about etcd-operator HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent