Giter Club home page Giter Club logo

etcd-operator's People

Contributors

alexgluck avatar andreykont avatar aobort avatar gecube avatar hiddenmarten avatar ilya-hontarau avatar kirill-garbar avatar kvaps avatar sergeyshevch avatar sircthulhu avatar uburro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etcd-operator's Issues

Make an option to create PDB for cluster

I think we should add an option to create PDB for cluster.

 ---
 apiVersion: etcd.aenix.io/v1alpha1
 kind: EtcdCluster
 metadata:
   name: test
   namespace: ns1
 spec:
   image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
  storage:
    persistence: true # default: true, immutable
    storageClass: local-path
    size: 10Gi
+ enablePDB:  true
 status:
   conditions:
   - lastProbeTime: null
 	lastTransitionTime: "2024-03-06T18:39:45Z"
 	status: "True"
 	type: Ready

By default it should be true meaning that PDB resource will be created with maxUnavailable field equal to the maximum number of members can die without losing quorum

Implement ".spec.image"

Internaly we agreed to extend spec like this:

 ---
 apiVersion: etcd.aenix.io/v1alpha1
 kind: EtcdCluster
 metadata:
   name: test
   namespace: ns1
 spec:
+  image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
 status:
   conditions:
   - lastProbeTime: null
 	lastTransitionTime: "2024-03-06T18:39:45Z"
 	status: "True"
 	type: Ready

Helm chart

Helm chart for installing etcd-operator

Improve conditions processing

Fill basic project information

There is a need to create pubic:

  • README.md
  • LICENSE
  • Roadmap
  • Contributing Guide
  • Code of Conduct (CoC)
  • Adopters
  • Maintainers file

Additional storage parameters

From here #67 (comment)

The new spec looks good good. The only question if we do this:

storage:
   volumeClaimTemplate:
     metadata:
       labels:
         env:prod
       annotations:
         example.com/annotation: "true"
     spec: # core.v1.PersistentVolumeClaimSpec Ready k8s type
       storageClassName: gp3
       accessModes: ["ReadWriteOnce"]
       resources:
         requests:
           storage: 10Gi
   emptyDir: {} # core.v1.EmptyDirVolumeSource Ready k8s type

then where will we add the possible options in the future:

  1. auto-deletion of PVC after deleting the EtcdCluster object
  2. options for resizing PVC (resizeInUseVolumes as CNPG and mariadb-operator do
  3. option to manage storage migration
  4. any other options for storage in the future

[Proposal] API Design for EtcdCluster resource

Here is a design proposal for EtcdCluster resource that will cover more cases of real usage. Requesting for comments
Inspired by https://docs.victoriametrics.com/operator/api/#vmstorage

Cover scope of #61

---
-apiVersion: etcd.aenix.io/v1alpha1
+apiVersion: etcd.aenix.io/v1alpha2      
 kind: EtcdCluster
 metadata:
   name: test
   namespace: default
 spec:
   image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
+  imagePullSecrets:  # core.v1.LocalObjectReference Ready k8s type
   - name: myregistrykey
+  serviceAccountName: default
+  podMetadata:
+    labels:
+      env: prod
+    annotations:
+      example.com/annotation: "true"
+  resources: # core.v1.ResourceRequirements Ready k8s type
+    requests:
+      cpu: 100m
+      memory: 100Mi
+    limits:
+      cpu: 200m
+      memory: 200Mi
+  affinity: {} # core.v1.Affinity Ready k8s type
+  nodeSelector: {} # map[string]string
+  tolerations: [] # core.v1.Toleration Ready k8s type
+  securityContext: {} # core.v1.PodSecurityContext Ready k8s type
+  priorityClassName: "low"
+  topologySpreadConstraints: [] # core.v1.TopologySpreadConstraint Ready k8s type
+  terminationGracePeriodSeconds: 30 # int64
+  schedulerName: "default-scheduler"
+  runtimeClassName: "legacy"
+  extraArgs: # map[string]string
+    arg1: "value1"
+    arg2: "value2"
+  extraEnvs: # []core.v1.EnvVar Ready k8s type
+    - name: MY_ENV
+      value: "my-value"
+  serviceSpec:
+    metadata:
+      labels:
+        env: prod
+      annotations:
+        example.com/annotation: "true"
+    spec: # core.v1.ServiceSpec Ready k8s type
+  podDisruptionBudget:
+    maxUnavailable: 1 # intstr.IntOrString
+    minAvailable: 2
+    selectorLabels: # If not set, the operator will use the labels from the EtcdCluster
+      env: prod
+  readinessGates: [] # core.v1.PodReadinessGate Ready k8s type
+  storage:
+    volumeClaimTemplate:
+      metadata:
+        labels:
+          env: prod
+        annotations:
+          example.com/annotation: "true"
+      spec: # core.v1.PersistentVolumeClaimSpec Ready k8s type
+        storageClassName: gp3
+        accessModes: [ "ReadWriteOnce" ]
+        resources:
+          requests:
+            storage: 10Gi
+    emptyDir: {} # core.v1.EmptyDirVolumeSource Ready k8s type
-  storage:
-    persistence: true # default: true, immutable
-    storageClass: local-path
-    size: 10Gi
 status:
   conditions:
   - lastProbeTime: null
   lastTransitionTime: "2024-03-06T18:39:45Z"
  	status: "True"
  	type: Ready

E2e tests

Please collect the cases and write an approach for running e2e tests.

Internally we agreed:

  • To use built-in kubebuilder oportunity for running e2e tests
  • Use kind to run one-time Kubernetes cluster in a pipeline
  • @sergeyshevch suggested to implement smoke tests and run them as part of e2e

Pass all initial options as envs included with envFrom

I'll just copy&paste our chat with @sircthulhu


kvaps:
I've got an idea, what if we move all four of these flags into a configMap included with envFrom?

  • --initial-advertise-peer-urls
  • --initial-cluster
  • --initial-cluster-state
  • --initial-cluster-token

It seems they are only relevant during initialization and might change over the lifetime of the cluster itself.

Kir:
We can remove them after the cluster is created :)

Except for the state.

kvaps:
Hold on, but what about adding new replicas and re-bootstrapping the old ones? How else will they know where to join?

With this approach, they will always have up-to-date information at startup.

Kir:
Overall, you're right :)

kvaps:
From the operator's side, we would only need to implement the deletion of old replicas.

Scaling up should handle itself.

Make e2e more representative

I think we should write e2e for basic functionality

  • create etcd cluster
  • wait for EtcdCluster CR to be ready
  • check that pods are ready
  • try to connect to etcd cluster and check that count of alive members is equal to number of replicas
  • try to put and get some basic data

CRDs deployment approach

The discussion comes from this issue.

Preconditions:

  • Issue about CRDs upgrading in helm using crds directory. link.
  • ArgoCD uses its hooks instead of helm-hooks. link.

Considering these factors, we need to choose a method for deploying CRDs. The options are:

  • Utilize the template directory in helm-chart (as an example, see: link).
  • Integrate this functionality directly into the application, which would include deploying CRDs to the cluster as part of the application's startup process (as an example, see: link.).
  • Develop a separate Helm chart exclusively for CRDs (as an example, see: link).

Use `go install` instead of `go-install-tool`

Consider using simple go install instead of go-install-tool function in Makefile

etcd-operator/Makefile

Lines 191 to 203 in 94c538d

# go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist
# $1 - target path with name of binary (ideally with version)
# $2 - package url which can be installed
# $3 - specific version of package
define go-install-tool
@[ -f $(1) ] || { \
set -e; \
package=$(2)@$(3) ;\
echo "Downloading $${package}" ;\
GOBIN=$(LOCALBIN) go install $${package} ;\
mv "$$(echo "$(1)" | sed "s/-$(3)$$//")" $(1) ;\
}
endef

One time we had similar discussion for piraeus-operator, so we can borrow it's logic:

Write instructions how to run controller locally

We need to write section in doc explaining how to run project in local Kind cluster.
currently blocked by #11

Steps to run projects:

  • kind create cluster
  • switch kubectl context to kind (important! otherwise you can damage production cluster)
  • install cert-manager using helm or kustomize cert-manager documentation
  • make docker-build to build docker image with tag controller:latest
  • kind load docker-image controller:latest to load image into cluster
  • make install (this installs CRD)
  • make deploy (this deploys controller, roles, certificates)

After making changes in code to redeploy them run

  • make docker-build, this will regenerate manifests and build image
  • make deploy to change YAML manifests if necessary
  • kubectl rollout restart -n etcd-operator-system deploy/etcd-operator-controller-manager

Implement Ready status condition

In general

We need to implement checking quorum status of an initialized cluster and update Ready status condition in accordance. After initializing cluster and making sure pods found each other and formed a cluster, controller must update cluster state configmap to set cluster state existing (from new)

Design proposal

When cluster is initialized, we should check if:

  1. StatefulSet is Ready
  2. Connect to any etcd and check if quorum is fine
  • if everything is fine, set status condition Ready and update value in cluster state configmap to existing.
  • If some pods in statefulset aren't ready but quorum is fine, set Ready, but mark cluster as degraded in Reason field.
  • If pods in sts aren't ready and quorum is lost, mark cluster as not Ready

If configmap already has existing state, do not change it anymore, as cluster should be already initialized

Design and Implement a Cluster Scale Up/Down Mechanism

We need to design a mechanism for scaling a cluster up and down.

When a user modifies spec.replicas, the cluster should scale to the required number of replicas accordingly. Currently, we are utilizing a StatefulSet, but we understand that we might have to move away from it in favor of a custom pod controller.

Scaling up should work out of the box, but scaling down might be more complex due to several considerations:

  • The need to check the quorum state
  • The process of removing a replica from the cluster
  • Maybe should we disallow scaling down the etcd

We're open to suggestions on how to address these challenges and implement an efficient and reliable scaling mechanism.

Add dependabot

Internally we aggreed that it would be nice to run dependabot or renovate for our repository.
The team have to decide which tool is better and implement it.

Remove cert-manager dependency (webhooks)

Reasons why we need to remove cert-manager dependency:

  • Openshift doesn't use cert-manager. It has its own solution for certificates handling.
  • Cert-manager runtime dependency (another product) seems too heavy for etcd-operator.

Options to replace cert-manager

Move project to Kubernetes-SIGs

Hey everyone,

We've established a community effort to develop a unified etcd-operator. This project is entirely community-driven and currently comprises mainly members of the Russian-speaking Kubernetes community. Although we've just begun, there's already significant activity, with approximately 10 active developers onboard.

We want two things:

  1. To donate this project to CNCF:
  2. Get a place under kubernetes-sigs:

Implement serviceAccountSpec

as part of #109

serviceAccount: # TBD. How to represent it? Do we need ability to specify existing service account?
  create: true
  metadata:
    labels:
      env: prod
    annotations:
      example.com/annotation: "true"

Configure cosign

We should configure keys for cosign and add section in README for verifying images.

After we published release v0.0.1, pipeline successfully built image but cosign failed.

Problem with pod restart stability

Standard life cycle implies that any members may be restarted.

How to reproduce:
Restart any pod

Log:
{"level":"fatal","ts":"2024-03-31T06:04:27.129434Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"member 21cdb8e5d2d72088 has already been bootstrapped","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}

Write basic etcd-operator logic

We agreed on the folowing logic:

  • Create configmap with env ETCD_INITIAL_CLUSTER_STATE=new
  • Create StatefulSet with needed amount of replicas and envFrom: <configmap>, and podManagementPolicy: Parallel
  • Wait unitl all replicas become ready
  • Update configmap to ETCD_INITIAL_CLUSTER_STATE=existing
  • Consider that cluster as bootstrapped

currently locked by #9

E2e tests and pipeline

We discussed internally the need to implement end-to-end (E2E) tests and establish a pipeline for them on pull requests (PRs). To initiate the pipeline, we could use a specific keyword in the commit message or manually trigger it within GitHub.

Implement ".spec.storage"

Internally we agreed to extend spec like this:

 ---
 apiVersion: etcd.aenix.io/v1alpha1
 kind: EtcdCluster
 metadata:
   name: test
   namespace: ns1
 spec:
   image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
+  storage:
+    persistence: true # default: true, immutable
+    storageClass: local-path
+    size: 10Gi
 status:
   conditions:
   - lastProbeTime: null
 	lastTransitionTime: "2024-03-06T18:39:45Z"
 	status: "True"
 	type: Ready

Move EtcdCluster update call out of defer function

In case an error occur on EtcdCluster resource update (https://github.com/aenix-io/etcd-operator/blob/main/internal/controller/etcdcluster_controller.go#L84-L88):

...
defer func() {
	if err := r.Status().Update(ctx, instance); err != nil && !errors.IsConflict(err) {
		logger.Error(err, "unable to update cluster")
	}
}()
...
return ctrl.Result{}, nil

it will not be returned and request will not be requeued. Therefore update call should be moved out of defer function.

Steps to publish new release

I have published releases v0.0.1 and v0.0.2 and now we should write documentation about how to publish a release

CreateOrUpdate cluster auxiliary entities

On cluster bootstrap we create 2 Service, ConfigMap and StatefulSet. After they are created, they won't be updated unless they're deleted by user. We should fix it and use CreateOrUpdate syntax

Write basic etcd-operator spec

We agreed on the next spec:

---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
  name: test
  namespace: ns1
spec:
  replicas: 3
  storage:
    storageClass: local-path
    size: 10Gi
status:
  conditions:
  - lastProbeTime: null
     lastTransitionTime: "2024-03-06T18:39:39Z"
     status: "True"
     type: Initialized
  - lastProbeTime: null
     lastTransitionTime: "2024-03-06T18:39:45Z"
     status: "True"
     type: Ready
  - lastProbeTime: null
     lastTransitionTime: "2024-03-06T18:39:45Z"
     status: "True"
     type: Synchronized

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.