aenix-io / etcd-operator Goto Github PK
View Code? Open in Web Editor NEWNew generation community-driven etcd-operator!
Home Page: https://etcd.aenix.io
License: Apache License 2.0
New generation community-driven etcd-operator!
Home Page: https://etcd.aenix.io
License: Apache License 2.0
From here #67 (comment)
The new spec looks good good. The only question if we do this:
storage:
volumeClaimTemplate:
metadata:
labels:
env:prod
annotations:
example.com/annotation: "true"
spec: # core.v1.PersistentVolumeClaimSpec Ready k8s type
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
emptyDir: {} # core.v1.EmptyDirVolumeSource Ready k8s type
then where will we add the possible options in the future:
@aobort in #84 (comment) found that resource update correctly works without copying of ResourceVersion
to newly generated structure in k8s 1.29+
We need to verify it for different k8s versions and remove if it isn't required
We agreed on the next spec:
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
name: test
namespace: ns1
spec:
replicas: 3
storage:
storageClass: local-path
size: 10Gi
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-06T18:39:39Z"
status: "True"
type: Initialized
- lastProbeTime: null
lastTransitionTime: "2024-03-06T18:39:45Z"
status: "True"
type: Ready
- lastProbeTime: null
lastTransitionTime: "2024-03-06T18:39:45Z"
status: "True"
type: Synchronized
Reasons why we need to remove cert-manager dependency:
Options to replace cert-manager
as part of #109
serviceAccount: # TBD. How to represent it? Do we need ability to specify existing service account?
create: true
metadata:
labels:
env: prod
annotations:
example.com/annotation: "true"
In case an error occur on EtcdCluster resource update (https://github.com/aenix-io/etcd-operator/blob/main/internal/controller/etcdcluster_controller.go#L84-L88):
...
defer func() {
if err := r.Status().Update(ctx, instance); err != nil && !errors.IsConflict(err) {
logger.Error(err, "unable to update cluster")
}
}()
...
return ctrl.Result{}, nil
it will not be returned and request will not be requeued. Therefore update call should be moved out of defer function.
I think we should write e2e for basic functionality
as part of #109
spec:
podTemplate:
spec:
serviceAccountName: default
imagePullSecrets:
readinessGates: [] # core.v1.PodReadinessGate Ready k8s type
Please collect the cases and write an approach for running e2e tests.
Internally we agreed:
Line 44 in 94c538d
Here is a design proposal for EtcdCluster resource that will cover more cases of real usage. Requesting for comments
Inspired by https://docs.victoriametrics.com/operator/api/#vmstorage
Cover scope of #61
---
-apiVersion: etcd.aenix.io/v1alpha1
+apiVersion: etcd.aenix.io/v1alpha2
kind: EtcdCluster
metadata:
name: test
namespace: default
spec:
image: "quay.io/coreos/etcd:v3.5.12"
replicas: 3
+ imagePullSecrets: # core.v1.LocalObjectReference Ready k8s type
- name: myregistrykey
+ serviceAccountName: default
+ podMetadata:
+ labels:
+ env: prod
+ annotations:
+ example.com/annotation: "true"
+ resources: # core.v1.ResourceRequirements Ready k8s type
+ requests:
+ cpu: 100m
+ memory: 100Mi
+ limits:
+ cpu: 200m
+ memory: 200Mi
+ affinity: {} # core.v1.Affinity Ready k8s type
+ nodeSelector: {} # map[string]string
+ tolerations: [] # core.v1.Toleration Ready k8s type
+ securityContext: {} # core.v1.PodSecurityContext Ready k8s type
+ priorityClassName: "low"
+ topologySpreadConstraints: [] # core.v1.TopologySpreadConstraint Ready k8s type
+ terminationGracePeriodSeconds: 30 # int64
+ schedulerName: "default-scheduler"
+ runtimeClassName: "legacy"
+ extraArgs: # map[string]string
+ arg1: "value1"
+ arg2: "value2"
+ extraEnvs: # []core.v1.EnvVar Ready k8s type
+ - name: MY_ENV
+ value: "my-value"
+ serviceSpec:
+ metadata:
+ labels:
+ env: prod
+ annotations:
+ example.com/annotation: "true"
+ spec: # core.v1.ServiceSpec Ready k8s type
+ podDisruptionBudget:
+ maxUnavailable: 1 # intstr.IntOrString
+ minAvailable: 2
+ selectorLabels: # If not set, the operator will use the labels from the EtcdCluster
+ env: prod
+ readinessGates: [] # core.v1.PodReadinessGate Ready k8s type
+ storage:
+ volumeClaimTemplate:
+ metadata:
+ labels:
+ env: prod
+ annotations:
+ example.com/annotation: "true"
+ spec: # core.v1.PersistentVolumeClaimSpec Ready k8s type
+ storageClassName: gp3
+ accessModes: [ "ReadWriteOnce" ]
+ resources:
+ requests:
+ storage: 10Gi
+ emptyDir: {} # core.v1.EmptyDirVolumeSource Ready k8s type
- storage:
- persistence: true # default: true, immutable
- storageClass: local-path
- size: 10Gi
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-06T18:39:45Z"
status: "True"
type: Ready
There is a need to create pubic:
As we found out from PR #41, if I add any other label in addition to ok-to-test
, autotests will be skipped. We should fix it
Hey everyone,
We've established a community effort to develop a unified etcd-operator. This project is entirely community-driven and currently comprises mainly members of the Russian-speaking Kubernetes community. Although we've just begun, there's already significant activity, with approximately 10 active developers onboard.
We want two things:
Let's collect feedback from potencial adopters:
And infrom previous etcd-operator's developers:
Follow up for: #72 (comment)
We need to add new field to podSpec, agree with maintainers on further spec. And implement it as DoD for this issue
Internally we aggreed that it would be nice to run dependabot or renovate for our repository.
The team have to decide which tool is better and implement it.
Consider replacing defaulting webhook with native CRD validation:
etcd-operator/api/v1alpha1/etcdcluster_webhook.go
Lines 45 to 47 in 94c538d
I think we should add an option to create PDB for cluster.
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
name: test
namespace: ns1
spec:
image: "quay.io/coreos/etcd:v3.5.12"
replicas: 3
storage:
persistence: true # default: true, immutable
storageClass: local-path
size: 10Gi
+ enablePDB: true
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-06T18:39:45Z"
status: "True"
type: Ready
By default it should be true
meaning that PDB resource will be created with maxUnavailable
field equal to the maximum number of members can die without losing quorum
I just tried to publish tag v0.0.0
on this PR #21, and got the error:
https://github.com/aenix-io/etcd-operator/actions/runs/8292778100/job/22694650154
Our project settings have no oportunity to enable write permisions via UI
So it seems should be solved via pipeline's YAML
@hiddenmarten could you please elaborate?
We need to implement checking quorum status of an initialized cluster and update Ready
status condition in accordance. After initializing cluster and making sure pods found each other and formed a cluster, controller must update cluster state configmap to set cluster state existing
(from new
)
When cluster is initialized, we should check if:
Ready
existing
.If configmap already has existing
state, do not change it anymore, as cluster should be already initialized
Right now the main button on our documentation website directs to 404 page:
Implement podspec according to proposal #62
Please create a DEVELOPMENT.md file where we will collect all the infromation about our project development.
Initially it should contain information about how we make releases.
Just an infromartion that we use standard Github release feature.
Additionally It can also include link to https://github.com/aenix-io/etcd-operator/blob/main/CONTRIBUTING.md
We need to design a mechanism for scaling a cluster up and down.
When a user modifies spec.replicas
, the cluster should scale to the required number of replicas accordingly. Currently, we are utilizing a StatefulSet, but we understand that we might have to move away from it in favor of a custom pod controller.
Scaling up should work out of the box, but scaling down might be more complex due to several considerations:
We're open to suggestions on how to address these challenges and implement an efficient and reliable scaling mechanism.
Internaly we agreed to extend spec like this:
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
name: test
namespace: ns1
spec:
+ image: "quay.io/coreos/etcd:v3.5.12"
replicas: 3
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-06T18:39:45Z"
status: "True"
type: Ready
We should configure keys for cosign and add section in README
for verifying images.
After we published release v0.0.1, pipeline successfully built image but cosign failed.
Helm chart for installing etcd-operator
I have published releases v0.0.1
and v0.0.2
and now we should write documentation about how to publish a release
On cluster bootstrap we create 2 Service, ConfigMap and StatefulSet. After they are created, they won't be updated unless they're deleted by user. We should fix it and use CreateOrUpdate
syntax
In the scope of this issue:
Readme.md
because we are updating it dynamically,If someone doesn't make go mod tidy
locally then the pre-commit step vet
hangs.
The discussion comes from this issue.
Preconditions:
crds
directory. link.Considering these factors, we need to choose a method for deploying CRDs. The options are:
template
directory in helm-chart (as an example, see: link).Please share an information about our google group:
Consider using simple go install
instead of go-install-tool
function in Makefile
Lines 191 to 203 in 94c538d
One time we had similar discussion for piraeus-operator, so we can borrow it's logic:
Standard life cycle implies that any members may be restarted.
How to reproduce:
Restart any pod
Log:
{"level":"fatal","ts":"2024-03-31T06:04:27.129434Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"member 21cdb8e5d2d72088 has already been bootstrapped","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\tgo.etcd.io/etcd/server/v3/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\tgo.etcd.io/etcd/server/v3/etcdmain/main.go:40\nmain.main\n\tgo.etcd.io/etcd/server/v3/main.go:31\nruntime.main\n\truntime/proc.go:250"}
We need to write section in doc explaining how to run project in local Kind cluster.
currently blocked by #11
Steps to run projects:
kind create cluster
make docker-build
to build docker image with tag controller:latest
kind load docker-image controller:latest
to load image into clustermake install
(this installs CRD)make deploy
(this deploys controller, roles, certificates)After making changes in code to redeploy them run
make docker-build
, this will regenerate manifests and build imagemake deploy
to change YAML manifests if necessarykubectl rollout restart -n etcd-operator-system deploy/etcd-operator-controller-manager
Internally we agreed to extend spec like this:
---
apiVersion: etcd.aenix.io/v1alpha1
kind: EtcdCluster
metadata:
name: test
namespace: ns1
spec:
image: "quay.io/coreos/etcd:v3.5.12"
replicas: 3
+ storage:
+ persistence: true # default: true, immutable
+ storageClass: local-path
+ size: 10Gi
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2024-03-06T18:39:45Z"
status: "True"
type: Ready
Currently locked by #11
I'll just copy&paste our chat with @sircthulhu
kvaps:
I've got an idea, what if we move all four of these flags into a configMap included with envFrom
?
--initial-advertise-peer-urls
--initial-cluster
--initial-cluster-state
--initial-cluster-token
It seems they are only relevant during initialization and might change over the lifetime of the cluster itself.
Kir:
We can remove them after the cluster is created :)
Except for the state
.
kvaps:
Hold on, but what about adding new replicas and re-bootstrapping the old ones? How else will they know where to join?
With this approach, they will always have up-to-date information at startup.
Kir:
Overall, you're right :)
kvaps:
From the operator's side, we would only need to implement the deletion of old replicas.
Scaling up should handle itself.
We need to add image field in spec and implement it propagation to StatefulSet
For now we have 2 options about field placement:
spec.image
spec.podSpec.image
We will decide it with community vote in https://t.me/etcd_operator/4054 and comment decision here
We have to create a documentation website for our etcd operator
We discussed internally the need to implement end-to-end (E2E) tests and establish a pipeline for them on pull requests (PRs). To initiate the pipeline, we could use a specific keyword in the commit message or manually trigger it within GitHub.
Reconcile
method, improve code readability and make it easier to add new conditions. (https://github.com/aenix-io/etcd-operator/blob/main/internal/controller/etcdcluster_controller.go#L102-L125)lastTransitionTime
only if the status changedIt should be
Copyright 2024 The etcd-operator Authors.
In boilerplate and all corresponding files
etcd-operator/hack/boilerplate.go.txt
Line 2 in 94c538d
I have to configure routing of etcd.aenix.io to the documentation for this repo
Curently locked by #11
We agreed on the folowing logic:
ETCD_INITIAL_CLUSTER_STATE=new
envFrom: <configmap>
, and podManagementPolicy: Parallel
ETCD_INITIAL_CLUSTER_STATE=existing
currently locked by #9
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.