aenix-io / etcd-operator Goto Github PK

View Code? Open in Web Editor NEW

53.0 53.0 10.0 610 KB

New generation community-driven etcd-operator!

Home Page: https://etcd.aenix.io

License: Apache License 2.0

Dockerfile 0.87% Makefile 7.34% Go 84.27% SCSS 3.15% HTML 3.40% Smarty 0.98%

etcd kubernetes operator

etcd-operator's People

Contributors

Stargazers

Watchers

Forkers

kvaps aobort solyard ilya-hontarau andreykont hiddenmarten nakedfly oxxenix

etcd-operator's Issues

Make an option to create PDB for cluster

I think we should add an option to create PDB for cluster.

 ---
 apiVersion: etcd.aenix.io/v1alpha1
 kind: EtcdCluster
 metadata:
   name: test
   namespace: ns1
 spec:
   image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
  storage:
    persistence: true # default: true, immutable
    storageClass: local-path
    size: 10Gi
+ enablePDB:  true
 status:
   conditions:
   - lastProbeTime: null
 	lastTransitionTime: "2024-03-06T18:39:45Z"
 	status: "True"
 	type: Ready

By default it should be true meaning that PDB resource will be created with maxUnavailable field equal to the maximum number of members can die without losing quorum

CI: Permissions problem on publishing docker image

I just tried to publish tag v0.0.0 on this PR #21, and got the error:

https://github.com/aenix-io/etcd-operator/actions/runs/8292778100/job/22694650154

Our project settings have no oportunity to enable write permisions via UI

So it seems should be solved via pipeline's YAML
@hiddenmarten could you please elaborate?

Fix PR e2e tests label trigger

As we found out from PR #41, if I add any other label in addition to ok-to-test, autotests will be skipped. We should fix it

Implement ".spec.image"

Internaly we agreed to extend spec like this:

 ---
 apiVersion: etcd.aenix.io/v1alpha1
 kind: EtcdCluster
 metadata:
   name: test
   namespace: ns1
 spec:
+  image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
 status:
   conditions:
   - lastProbeTime: null
 	lastTransitionTime: "2024-03-06T18:39:45Z"
 	status: "True"
 	type: Ready

Improve conditions processing

If conditions list is empty, fill it with all possible conditions in "False" state. This will allow to observe which phases cluster object should pass and already passed before it's in ready state (https://github.com/aenix-io/etcd-operator/blob/main/internal/controller/etcdcluster_controller.go#L75-L84).
Move initiail conditions' list filling and further condition(s) state update to separate function. This will decrease complexity of Reconcile method, improve code readability and make it easier to add new conditions. (https://github.com/aenix-io/etcd-operator/blob/main/internal/controller/etcdcluster_controller.go#L102-L125)
Update condition's lastTransitionTime only if the status changed

Fill basic project information

There is a need to create pubic:

Additional storage parameters

From here #67 (comment)

The new spec looks good good. The only question if we do this:

storage:
   volumeClaimTemplate:
     metadata:
       labels:
         env:prod
       annotations:
         example.com/annotation: "true"
     spec: # core.v1.PersistentVolumeClaimSpec Ready k8s type
       storageClassName: gp3
       accessModes: ["ReadWriteOnce"]
       resources:
         requests:
           storage: 10Gi
   emptyDir: {} # core.v1.EmptyDirVolumeSource Ready k8s type

then where will we add the possible options in the future:

auto-deletion of PVC after deleting the EtcdCluster object
options for resizing PVC (resizeInUseVolumes as CNPG and mariadb-operator do
option to manage storage migration
any other options for storage in the future

Create documentation website

We have to create a documentation website for our etcd operator

[Proposal] API Design for EtcdCluster resource

Here is a design proposal for EtcdCluster resource that will cover more cases of real usage. Requesting for comments
Inspired by https://docs.victoriametrics.com/operator/api/#vmstorage

Cover scope of #61

---
-apiVersion: etcd.aenix.io/v1alpha1
+apiVersion: etcd.aenix.io/v1alpha2      
 kind: EtcdCluster
 metadata:
   name: test
   namespace: default
 spec:
   image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
+  imagePullSecrets:  # core.v1.LocalObjectReference Ready k8s type
   - name: myregistrykey
+  serviceAccountName: default
+  podMetadata:
+    labels:
+      env: prod
+    annotations:
+      example.com/annotation: "true"
+  resources: # core.v1.ResourceRequirements Ready k8s type
+    requests:
+      cpu: 100m
+      memory: 100Mi
+    limits:
+      cpu: 200m
+      memory: 200Mi
+  affinity: {} # core.v1.Affinity Ready k8s type
+  nodeSelector: {} # map[string]string
+  tolerations: [] # core.v1.Toleration Ready k8s type
+  securityContext: {} # core.v1.PodSecurityContext Ready k8s type
+  priorityClassName: "low"
+  topologySpreadConstraints: [] # core.v1.TopologySpreadConstraint Ready k8s type
+  terminationGracePeriodSeconds: 30 # int64
+  schedulerName: "default-scheduler"
+  runtimeClassName: "legacy"
+  extraArgs: # map[string]string
+    arg1: "value1"
+    arg2: "value2"
+  extraEnvs: # []core.v1.EnvVar Ready k8s type
+    - name: MY_ENV
+      value: "my-value"
+  serviceSpec:
+    metadata:
+      labels:
+        env: prod
+      annotations:
+        example.com/annotation: "true"
+    spec: # core.v1.ServiceSpec Ready k8s type
+  podDisruptionBudget:
+    maxUnavailable: 1 # intstr.IntOrString
+    minAvailable: 2
+    selectorLabels: # If not set, the operator will use the labels from the EtcdCluster
+      env: prod
+  readinessGates: [] # core.v1.PodReadinessGate Ready k8s type
+  storage:
+    volumeClaimTemplate:
+      metadata:
+        labels:
+          env: prod
+        annotations:
+          example.com/annotation: "true"
+      spec: # core.v1.PersistentVolumeClaimSpec Ready k8s type
+        storageClassName: gp3
+        accessModes: [ "ReadWriteOnce" ]
+        resources:
+          requests:
+            storage: 10Gi
+    emptyDir: {} # core.v1.EmptyDirVolumeSource Ready k8s type
-  storage:
-    persistence: true # default: true, immutable
-    storageClass: local-path
-    size: 10Gi
 status:
   conditions:
   - lastProbeTime: null
   lastTransitionTime: "2024-03-06T18:39:45Z"
  	status: "True"
  	type: Ready

Add ability to customize probe settings in podSpec

Follow up for: #72 (comment)

We need to add new field to podSpec, agree with maintainers on further spec. And implement it as DoD for this issue

Revisit ResourceVersion usage in resource factories

@aobort in #84 (comment) found that resource update correctly works without copying of ResourceVersion to newly generated structure in k8s 1.29+

We need to verify it for different k8s versions and remove if it isn't required

E2e tests

Please collect the cases and write an approach for running e2e tests.

Internally we agreed:

To use built-in kubebuilder oportunity for running e2e tests
Use kind to run one-time Kubernetes cluster in a pipeline
@sergeyshevch suggested to implement smoke tests and run them as part of e2e

Pass all initial options as envs included with envFrom

I'll just copy&paste our chat with @sircthulhu

kvaps:
I've got an idea, what if we move all four of these flags into a configMap included with envFrom?

--initial-advertise-peer-urls
--initial-cluster
--initial-cluster-state
--initial-cluster-token

It seems they are only relevant during initialization and might change over the lifetime of the cluster itself.

Kir:
We can remove them after the cluster is created :)

Except for the state.

kvaps:
Hold on, but what about adding new replicas and re-bootstrapping the old ones? How else will they know where to join?

With this approach, they will always have up-to-date information at startup.

Kir:
Overall, you're right :)

kvaps:
From the operator's side, we would only need to implement the deletion of old replicas.

Scaling up should handle itself.

Make e2e more representative

I think we should write e2e for basic functionality

create etcd cluster
wait for EtcdCluster CR to be ready
check that pods are ready
try to connect to etcd cluster and check that count of alive members is equal to number of replicas
try to put and get some basic data

CRDs deployment approach

The discussion comes from this issue.

Preconditions:

Issue about CRDs upgrading in helm using crds directory. link.
ArgoCD uses its hooks instead of helm-hooks. link.

Considering these factors, we need to choose a method for deploying CRDs. The options are:

Utilize the template directory in helm-chart (as an example, see: link).
Integrate this functionality directly into the application, which would include deploying CRDs to the cluster as part of the application's startup process (as an example, see: link.).
Develop a separate Helm chart exclusively for CRDs (as an example, see: link).

Use `go install` instead of `go-install-tool`

Consider using simple go install instead of go-install-tool function in Makefile

etcd-operator/Makefile

Lines 191 to 203 in 94c538d

 # go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist 

 # $1 - target path with name of binary (ideally with version) 

 # $2 - package url which can be installed 

 # $3 - specific version of package 

 define go-install-tool 

 @[ -f $(1) ] || { \ 

 set -e; \ 

 package=$(2)@$(3) ;\ 

 echo "Downloading $${package}" ;\ 

 GOBIN=$(LOCALBIN) go install $${package} ;\ 

 mv "$$(echo "$(1)" | sed "s/-$(3)$$//")" $(1) ;\ 

 } 

 endef

One time we had similar discussion for piraeus-operator, so we can borrow it's logic:

Write instructions how to run controller locally

We need to write section in doc explaining how to run project in local Kind cluster.
currently blocked by #11

Steps to run projects:

kind create cluster
switch kubectl context to kind (important! otherwise you can damage production cluster)
install cert-manager using helm or kustomize cert-manager documentation
make docker-build to build docker image with tag controller:latest
kind load docker-image controller:latest to load image into cluster
make install (this installs CRD)
make deploy (this deploys controller, roles, certificates)

After making changes in code to redeploy them run

make docker-build, this will regenerate manifests and build image
make deploy to change YAML manifests if necessary
kubectl rollout restart -n etcd-operator-system deploy/etcd-operator-controller-manager

Validate extraArgs field in kind:EtcdCluster

We should validate extraArgs field and prevent usage of args generated by operator itself.
Follow up for #72 #69

Implement Ready status condition

In general

We need to implement checking quorum status of an initialized cluster and update Ready status condition in accordance. After initializing cluster and making sure pods found each other and formed a cluster, controller must update cluster state configmap to set cluster state existing (from new)

Design proposal

When cluster is initialized, we should check if:

StatefulSet is Ready
Connect to any etcd and check if quorum is fine

if everything is fine, set status condition Ready and update value in cluster state configmap to existing.
If some pods in statefulset aren't ready but quorum is fine, set Ready, but mark cluster as degraded in Reason field.
If pods in sts aren't ready and quorum is lost, mark cluster as not Ready

If configmap already has existing state, do not change it anymore, as cluster should be already initialized

Design and Implement a Cluster Scale Up/Down Mechanism

We need to design a mechanism for scaling a cluster up and down.

When a user modifies spec.replicas, the cluster should scale to the required number of replicas accordingly. Currently, we are utilizing a StatefulSet, but we understand that we might have to move away from it in favor of a custom pod controller.

Scaling up should work out of the box, but scaling down might be more complex due to several considerations:

The need to check the quorum state
The process of removing a replica from the cluster
Maybe should we disallow scaling down the etcd

We're open to suggestions on how to address these challenges and implement an efficient and reliable scaling mechanism.

[epic] Secure communication with x509 certificates

Secure communication between etcd members and etcd clients with x509 certificates.

This is a last task remained before we start using etcd-operator in our cozystack platform.

Add dependabot

Internally we aggreed that it would be nice to run dependabot or renovate for our repository.
The team have to decide which tool is better and implement it.

Remove cert-manager dependency (webhooks)

Reasons why we need to remove cert-manager dependency:

Openshift doesn't use cert-manager. It has its own solution for certificates handling.
Cert-manager runtime dependency (another product) seems too heavy for etcd-operator.

Options to replace cert-manager

Use helm hook (risk with argocd where helm hooks do not work). Nginx-ingress controller did it this way.
Deploy empty secret with helm/kustomize, mount it to operator pods. Use k8s job to update k8s secret with CA and server tls certificate + patch webhook configuration with CA cert.
Use cert-controller provided by OPA project: https://github.com/open-policy-agent/cert-controller/tree/master. Example: https://github.com/open-policy-agent/gatekeeper/blob/982a31dbc7d04ef17ef5d6281307df8b55d05c4c/main.go#L269. They use it as a package in gatekeeper.
Implement own solution.

Fix logger prefix

etcd-operator/cmd/main.go

Line 44 in 94c538d

setupLog = ctrl.Log.WithName("setup")

Site: Google Group

Please share an information about our google group:

https://groups.google.com/u/1/g/etcd-operator-dev

make defaults for Image Pull Policy = Always

Update helm-docs configuration, custom template

In the scope of this issue:

remove a badge about the version from Readme.md because we are updating it dynamically,
add comments from helm values.

Move defaulting from webhooks to CRD

Consider replacing defaulting webhook with native CRD validation:

etcd-operator/api/v1alpha1/etcdcluster_webhook.go

Lines 45 to 47 in 94c538d

 if r.Spec.Replicas == 0 { 

 r.Spec.Replicas = 3 

 }

etcd-operator/api/v1alpha1/etcdcluster_types.go

Line 33 in 94c538d

// +kubebuilder:default:=3

Implement PodSpec for EtcdCluster CRD

Implement podspec according to proposal #62

Move project to Kubernetes-SIGs

Hey everyone,

We've established a community effort to develop a unified etcd-operator. This project is entirely community-driven and currently comprises mainly members of the Russian-speaking Kubernetes community. Although we've just begun, there's already significant activity, with approximately 10 active developers onboard.

We want two things:

To donate this project to CNCF:
- cncf/sandbox#91
Get a place under kubernetes-sigs:
- kubernetes/community#7796
- kubernetes/org#4839

Route etcd.aenix.io domain to this repository

I have to configure routing of etcd.aenix.io to the documentation for this repo

Curently locked by #11

Pre-commit works incorrectly if modules not installed

If someone doesn't make go mod tidy locally then the pre-commit step vet hangs.

Implement serviceAccountSpec

as part of #109

serviceAccount: # TBD. How to represent it? Do we need ability to specify existing service account?
  create: true
  metadata:
    labels:
      env: prod
    annotations:
      example.com/annotation: "true"

Site: add quick start page

Right now the main button on our documentation website directs to 404 page:

https://etcd.aenix.io/docs/v0.1/quickstart

Configure cosign

We should configure keys for cosign and add section in README for verifying images.

After we published release v0.0.1, pipeline successfully built image but cosign failed.

Problem with pod restart stability

Standard life cycle implies that any members may be restarted.

How to reproduce:
Restart any pod

Log:
{"level":"fatal","ts":"2024-03-31T06:04:27.129434Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"member 21cdb8e5d2d72088 has already been bootstrapped","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}

Write basic etcd-operator logic

We agreed on the folowing logic:

Create configmap with env ETCD_INITIAL_CLUSTER_STATE=new
Create StatefulSet with needed amount of replicas and envFrom: <configmap>, and podManagementPolicy: Parallel
Wait unitl all replicas become ready
Update configmap to ETCD_INITIAL_CLUSTER_STATE=existing
Consider that cluster as bootstrapped

currently locked by #9

E2e tests and pipeline

We discussed internally the need to implement end-to-end (E2E) tests and establish a pipeline for them on pull requests (PRs). To initiate the pipeline, we could use a specific keyword in the commit message or manually trigger it within GitHub.

Implement ".spec.storage"

Internally we agreed to extend spec like this:

 ---
 apiVersion: etcd.aenix.io/v1alpha1
 kind: EtcdCluster
 metadata:
   name: test
   namespace: ns1
 spec:
   image: "quay.io/coreos/etcd:v3.5.12"
   replicas: 3
+  storage:
+    persistence: true # default: true, immutable
+    storageClass: local-path
+    size: 10Gi
 status:
   conditions:
   - lastProbeTime: null
 	lastTransitionTime: "2024-03-06T18:39:45Z"
 	status: "True"
 	type: Ready

Development documentation

Please create a DEVELOPMENT.md file where we will collect all the infromation about our project development.

Initially it should contain information about how we make releases.
Just an infromartion that we use standard Github release feature.

Additionally It can also include link to https://github.com/aenix-io/etcd-operator/blob/main/CONTRIBUTING.md

Add missing fields for podTemplate.spec

as part of #109

spec:
  podTemplate:
    spec:
      serviceAccountName: default
      imagePullSecrets:
      readinessGates: [] # core.v1.PodReadinessGate Ready k8s type

Add image field to spec and implement it

We need to add image field in spec and implement it propagation to StatefulSet

For now we have 2 options about field placement:

spec.image
spec.podSpec.image

We will decide it with community vote in https://t.me/etcd_operator/4054 and comment decision here

Fix LICENSE headers

It should be

Copyright 2024 The etcd-operator Authors.

In boilerplate and all corresponding files

etcd-operator/hack/boilerplate.go.txt

Line 2 in 94c538d

Move EtcdCluster update call out of defer function

In case an error occur on EtcdCluster resource update (https://github.com/aenix-io/etcd-operator/blob/main/internal/controller/etcdcluster_controller.go#L84-L88):

...
defer func() {
	if err := r.Status().Update(ctx, instance); err != nil && !errors.IsConflict(err) {
		logger.Error(err, "unable to update cluster")
	}
}()
...
return ctrl.Result{}, nil

it will not be returned and request will not be requeued. Therefore update call should be moved out of defer function.

---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
  name: test
  namespace: ns1
spec:
  replicas: 3
  storage:
    storageClass: local-path
    size: 10Gi
status:
  conditions:
  - lastProbeTime: null
     lastTransitionTime: "2024-03-06T18:39:39Z"
     status: "True"
     type: Initialized
  - lastProbeTime: null
     lastTransitionTime: "2024-03-06T18:39:45Z"
     status: "True"
     type: Ready
  - lastProbeTime: null
     lastTransitionTime: "2024-03-06T18:39:45Z"
     status: "True"
     type: Synchronized

	# go-install-tool will 'go install' any package with custom target and name of binary, if it doesn't exist
	# $1 - target path with name of binary (ideally with version)
	# $2 - package url which can be installed
	# $3 - specific version of package
	define go-install-tool
	@[ -f $(1) ] \|\| { \
	set -e; \
	package=$(2)@$(3) ;\
	echo "Downloading $${package}" ;\
	GOBIN=$(LOCALBIN) go install $${package} ;\
	mv "$$(echo "$(1)" \| sed "s/-$(3)$$//")" $(1) ;\
	}
	endef