quentin-m / etcd-cloud-operator Goto Github PK
View Code? Open in Web Editor NEWDeploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.
License: Apache License 2.0
Deploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.
License: Apache License 2.0
Thanks for this project. I want to use etcd with Kubernetes. Do you see using the embedded server a problem given the speed to release of etcd?
Great work!
I have had some issues bootstrapping my etcd cluster, which seems to be due to the etcd-member service being down.
$ systemctl status eco
msg="STATUS: Healthy + Not running -> Join"
msg="failed to join the cluster" error="failed to add ourselves as a member ..."
This resolves instantly after running systemctl start etcd-member
. Is eco supposed to start etcd itself, or should etcd-member be enabled and automatically run on startup?
I've got 3 etcd nodes running behind a consul load balancer, so I set:
etcd:
advertise-address: eco.service.consul
I can also confirm all the hosts are registered:
[root@i-033aa5b5aafb1f2b5 lbriggs]# host eco.service.consul
eco.service.consul has address <addr>
eco.service.consul has address <addr>
eco.service.consul has address <addr>
However, for some reason, the nodes never create a cluster. I have 3 nodes in 3 availability zones, and I would expect them to form a cluster but I end up with 3 distinct single node clusters.
Is there any magic I'm missing?
error job
level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: rpc error: code = Unknown desc = raft: stopped
Right now the entire ignition config is
data "ignition_config" "main" {
files = [
"${data.ignition_file.eco-config.id}",
"${data.ignition_file.eco-ca.id}",
"${data.ignition_file.eco-crt.id}",
"${data.ignition_file.eco-key.id}",
"${data.ignition_file.eco-health.id}",
"${data.ignition_file.e.id}"
]
systemd = [
"${data.ignition_systemd_unit.docker.id}",
"${data.ignition_systemd_unit.locksmithd.id}",
"${data.ignition_systemd_unit.update-engine.id}",
"${data.ignition_systemd_unit.eco.id}",
"${data.ignition_systemd_unit.eco-health.id}",
"${data.ignition_systemd_unit.node-exporter.id}",
]
users = ["${data.ignition_user.core.id}"]
}
terraform ignition supports the append
block, which can be used to add arbitary ignition configuration. This would allow us to pass in additional users, groups or any other configuration without forking the repo.
If this sounds good, I can file a PR for this.
Add a Kubernetes provider that can be deployed using statefulsets and provide an example helm chart for installation + docs
My cluster is failing to seed.
Host: CentOS 7, Docker version 18.04.0-ce, build 3d479c0
eco.yaml:
eco:
# The interval between each cluster verification by the operator.
check-interval: 15s
# The time after which, an unhealthy member will be removed from the cluster.
unhealthy-member-ttl: 30s
# Defines whether the operator will attempt to seed a new cluster from a
# snapshot after the managed cluster has lost quorum.
auto-disaster-recovery: true
# Configuration of the etcd instance.
etcd:
# The address that clients should use to connect to the etcd cluster (i.e.
# load balancer public address).
advertise-address: http://<aws LB dns>:2379
# The directory where the etcd data is stored.
data-dir: /var/lib/etcd
# The TLS configuration for clients communication.
client-transport-security:
auto-tls: false
client-cert-auth: false
# The TLS configuration for peers communication.
peer-transport-security:
auto-tls: false
peer-client-cert-auth: false
# Defines the maximum amount of data that etcd can store, in bytes, before going into maintenance mode
backend-quota: 2147483648
# Configuration of the auto-scaling group provider.
asg:
provider: aws
# Configuration of the snapshot provider.
snapshot:
provider: s3
# The interval between each snapshot.
interval: 30m
# The time after which a backup has to be deleted.
ttl: 24h
# The bucket where snapshots are stored when using the S3 provider.
bucket: myorg.eco-test
I start all three processes, then I get the log output:
time="2018-04-18T22:04:10Z" level=info msg="STATUS: Unhealthy + Not running + All ready + Seeder status -> Seeding cluster"
time="2018-04-18T22:04:10Z" level=error msg="failed to seed the cluster" error="failed to start etcd: --advertise-client-urls is required when --listen-client-urls is set explicitly"
Any idea why? Do I need to specify the advertise-client-url in some fashion in the cfg?
Creating this issue to start a discussion.
--cacert
on the client side (possible today, but not sure if can be supported)Reference:
In my testing, I've observed that the cold start time for a new 3 node cluster can range from 1 to 2 minutes depending on the timing of instance startup. This seems to result from the unconditional and fixed etcd health checking interval (worst case 30s given retries and timeouts) and the absence of any assumptions which would allow the program to distinguish between an unreachable cluster and one that never existed (i.e. not worth checking and assumed to be unhealthy). 2 minutes is the average I observe over repeated testing in a more real world environment (statefulset on Kubernetes with persistent volumes, etc.). The usual case is that each instance must incur multiple futile etcd health checks before the state machines converge to the startup conditions.
I've been thinking about how to optimize initial startup time but haven't thought of a good solution yet that doesn't introduce some kind of shared state or that doesn't come with some other poor tradeoff (e.g. reducing the etcd health retries/timeouts and increasing the possibility of false negative health check results).
I suspect that the delay is basically a tradeoff that comes with the stateless architecture, but I wonder if this is something you've thought about before. Very curious to hear what you think!
Hi,
Is there a reason for implicit import of https://github.com/Quentin-M/etcd-cloud-operator/blob/master/pkg/etcd/server.go#L34 ?
Thanks.
Currently the S3 snapshot provider requires the presence of the EC2 metadata service, limiting its use to only instances deployed within AWS. Many cloud providers provide S3 compatible APIs, and it may be possible to allow the use of the S3 snapshot provider to send snapshot to GCS for example (https://cloud.google.com/storage/docs/interoperability).
Another use case is people running in hybrid cloud environments where while the etcd cluster runs on Openstack or bare metal but an S3 compatible API is available for backups etc.
Is there a support for backup and restore for this operator. I can see the backup is supported but don't seem to see restore
Open to any suggestions really, my documentation foo is rusty.
It would be very nice, if you could also build and publish images for different architectures, e.g. arm64
(aarch64)
I'm building a pilot ECO cluster. However, I'm getting "failed to create etcd cluster client - error=dial tcp :2379: getsockopt: connection refused".
Host: CentOS 7, Docker version 18.04.0-ce, build 3d479c0
eco.yaml:
# The interval between each cluster verification by the operator.
check-interval: 15s
# The time after which, an unhealthy member will be removed from the cluster.
unhealthy-member-ttl: 30s
# Defines whether the operator will attempt to seed a new cluster from a
# snapshot after the managed cluster has lost quorum.
auto-disaster-recovery: true
# Configuration of the etcd instance.
etcd:
# The address that clients should use to connect to the etcd cluster (i.e.
# load balancer public address).
advertise-address: http://<aws nlb>:2379
# The directory where the etcd data is stored.
data-dir: /var/lib/etcd
# The TLS configuration for clients communication.
client-transport-security:
auto-tls: false
cert-file: /crt/myorg.crt
key-file: /crt/myorg.pem
trusted-ca-file: /crt/ca.crt
client-cert-auth: false
# The TLS configuration for peers communication.
peer-transport-security:
auto-tls: true
cert-file: /crt/myorg.crt
key-file: /crt/myorg.key
trusted-ca-file: /crt/ca.crt
peer-client-cert-auth: false
# Defines the maximum amount of data that etcd can store, in bytes, before going into maintenance mode
backend-quota: 2147483648
# Configuration of the auto-scaling group provider.
asg:
provider: aws
# Configuration of the snapshot provider.
snapshot:
provider: s3
# The interval between each snapshot.
interval: 30m
# The time after which a backup has to be deleted.
ttl: 24h
# The bucket where snapshots are stored when using the S3 provider.
bucket: myorg.eco-test
Running with: docker run -e "AWS_ACCESS_KEY_ID=<access key>" -e "AWS_SECRET_ACCESS_KEY=<secret key>" etcd-cloud-operator
Getting this output:
time="2018-04-18T21:02:13Z" level=info msg="loaded configuration file /etc/eco/eco.yaml"
time="2018-04-18T21:02:19Z" level=warning msg="failed to create etcd cluster client" error="dial tcp <private IP of host>:2379: getsockopt: connection refused"
time="2018-04-18T21:02:19Z" level=warning msg="failed to query <host instance>'s ECO instance" error="Get http://<private IP of host>:2378/status: dial tcp <private IP of host>:2378: getsockopt: connection refused"
time="2018-04-18T21:02:19Z" level=warning msg="failed to query <other asg instance>'s ECO instance" error="Get http://<other asg instance private ip>:2378/status: dial tcp <other asg instance private ip>:2378: getsockopt: connection refused"
time="2018-04-18T21:02:19Z" level=warning msg="failed to query <other asg instance>'s ECO instance" error="Get http://<other asg instance private ip>:2378/status: dial tcp <other asg instance private ip>:2378: getsockopt: connection refused"
panic: runtime error: index out of range
goroutine 1 [running]:
github.com/quentin-m/etcd-cloud-operator/pkg/operator.fetchStatuses(0xc420151470, 0x0, 0xc4204d51c0, 0x3, 0x4, 0x1dd10a0, 0xc4204eed80, 0xc42025e140, 0x14)
/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/misc.go:120 +0x494
github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).evaluate(0xc420068d00, 0x14668a8, 0xc420068d00)
/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:134 +0x247
github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).Run(0xc420068d00)
/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:103 +0x51
main.main()
/go/src/github.com/quentin-m/etcd-cloud-operator/cmd/operator/operator.go:60 +0x399
Any idea what gives? I tried passing log-level debug but no more helpful info on WHY exactly the etcd cluster client failed to create. Also, I've verified selinux, iptables are disabled.
health check for peer 1db1c9009069daa could not connect: dial tcp 10.194.214.86:2380: i/o timeout
2019-11-04 19:15:41.141040 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2019-11-04 19:15:41.141076 W | etcdserver: not enough started members, rejecting member add {ID:9d2e3521595e019 RaftAttributes:{PeerURLs:[https://10.194.211.163:2380]} Attributes:{Name: ClientURLs:[]}}
A node terminated (the one with the timeout errors) and was replaced by the autoscaling group ( the one that was rejected), but the replacement couldn't join. The only way I was able to fix it was by stopping the eco container on all the instances at the same time.
Is this something that should have been corrected automatically or was there another way to recover it?
Hi ,
First of all thanks for starting this project . Thumbs up for this great initiative
I am trying to setup an etcd cluster in the region us-east-1 using terraform.
terraform plan -out=etcd-cluster.tfplan
is outputing me the terraform plan file properly
But when I am applying changes to it I am getting the following error :
**Error: Error applying plan:
1 error(s) occurred:
provider.aws: Not a valid region:**
I am unable to figure out what is the issue here. I have provided the right region and all the configuration. I dont want a route53 entry and I have added true for internal load balancer and I dont want TLS for this cluster .
Can anyone please help me out , what needs to be done here ?
Thanks
etcd-cloud-operator/terraform/modules/ignition/ignition.tf
Lines 121 to 129 in 78eee7e
I was trying to think of ways to get the key pushed to the instance without writing it via user-data (AWS-only). The concern stems from people/tools having access to the DescribeInstanceAttribute API call which will gladly return this back.
The possible approach I was considering was:
For (2), another option is to add a secret-fetch systemd service which does this, but then figuring out how to get aws cli running on the instance is another problem.
Thoughts?
Getting this issue etcd-io/etcd#13119. Is there an upgrade happening? If not what are the things I can do for an upgrade without breaking
If you switch from Launch Configs to Launch Templates the operator fails with:
Time aws_private_ip SYSTEMD_UNIT MESSAGE
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service github.com/quentin-m/etcd-cloud-operator/pkg/operator.fetchStatuses(0xc00081c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0001f0280)
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/misc.go:120 +0x406
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:134 +0x313
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/cmd/operator/operator.go:61 +0x4ba
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:103 +0x51
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service main.main()
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).evaluate(0xc00038a180, 0x0, 0x0)
May 7, 2020 @ 20:26:25.247 xx.xx.xxx.116 eco.service github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).Run(0xc00038a180)
May 7, 2020 @ 20:26:25.246 xx.xx.xxx.116 eco.service panic: runtime error: index out of range [-1]
May 7, 2020 @ 20:26:25.246 xx.xx.xxx.116 eco.service goroutine 1 [running]:
May 7, 2020 @ 20:26:25.244 xx.xx.xxx.116 eco.service time="2020-05-07T18:26:25Z" level=warning msg="failed to create etcd cluster client" error="etcdclient: no available endpoints"
May 7, 2020 @ 20:26:25.112 xx.xx.xxx.116 eco.service 2020-05-07 18:26:25.111773 W | rafthttp: lost the TCP streaming connection with peer 508fa2d34d1b809d (stream Message writer)
May 7, 2020 @ 20:26:24.751 xx.xx.xxx.116 eco.service 2020-05-07 18:26:24.750870 W | rafthttp: lost the TCP streaming connection with peer 508fa2d34d1b809d (stream MsgApp v2 reader)
May 7, 2020 @ 20:26:24.751 xx.xx.xxx.116 eco.service 2020-05-07 18:26:24.751207 E | rafthttp: failed to read 508fa2d34d1b809d on stream MsgApp v2 (unexpected EOF)
May 7, 2020 @ 20:26:24.751 xx.xx.xxx.116 eco.service 2020-05-07 18:26:24.751696 W | rafthttp: lost the TCP streaming connection with peer 508fa2d34d1b809d (stream Message reader)
May 7, 2020 @ 20:26:24.728 xx.xx.xxx.62 eco.service github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).Run(0xc000001c80)
May 7, 2020 @ 20:26:24.728 xx.xx.xxx.62 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:134 +0x313
May 7, 2020 @ 20:26:24.728 xx.xx.xxx.62 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:103 +0x51
May 7, 2020 @ 20:26:24.728 xx.xx.xxx.62 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/cmd/operator/operator.go:61 +0x4ba
May 7, 2020 @ 20:26:24.728 xx.xx.xxx.62 eco.service main.main()
May 7, 2020 @ 20:26:24.727 xx.xx.xxx.62 eco.service github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).evaluate(0xc000001c80, 0x0, 0x0)
May 7, 2020 @ 20:26:24.727 xx.xx.xxx.62 eco.service github.com/quentin-m/etcd-cloud-operator/pkg/operator.fetchStatuses(0xc0008c6900, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0002e4880)
May 7, 2020 @ 20:26:24.727 xx.xx.xxx.62 eco.service /go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/misc.go:120 +0x406
May 7, 2020 @ 20:26:24.726 xx.xx.xxx.62 eco.service panic: runtime error: index out of range [-1]
May 7, 2020 @ 20:26:24.726 xx.xx.xxx.62 eco.service goroutine 1 [running]:
May 7, 2020 @ 20:26:24.724 xx.xx.xxx.62 eco.service time="2020-05-07T18:26:24Z" level=warning msg="failed to create etcd cluster client" error="etcdclient: no available endpoints"
May 7, 2020 @ 20:26:24.511 xx.xx.xxx.116 eco.service 2020-05-07 18:26:24.509500 W | rafthttp: lost the TCP streaming connection with peer 859d080ca38f4cc1 (stream MsgApp v2 writer)
May 7, 2020 @ 20:26:21.861 xx.xx.xxx.116 eco.service 2020-05-07 18:26:21.861385 W | etcdserver: cannot get the version of member 859d080ca38f4cc1 (Get https://xx.xx.xxx.77:2380/version: dial tcp xx.xx.xxx.77:2380: connect: connection refused)
May 7, 2020 @ 20:26:21.861 xx.xx.xxx.116 eco.service 2020-05-07 18:26:21.861362 W | etcdserver: failed to reach the peerURL(https://xx.xx.xxx.77:2380) of member 859d080ca38f4cc1 (Get https://xx.xx.xxx.77:2380/version: dial tcp xx.xx.xxx.77:2380: connect: connection refused)
thank you very much for this NICE project!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.