quentin-m / etcd-cloud-operator Goto Github PK

Deploying and managing production-grade etcd clusters on cloud providers: failure recovery, disaster recovery, backups and resizing.

License: Apache License 2.0

Go 68.10% HCL 29.88% Shell 0.63% Dockerfile 0.72% Mustache 0.67%

etcd operator aws high-availability etcd-operator etcd-cluster disaster-recovery failure-recovery kubernetes k8s

etcd-cloud-operator's People

Contributors

Stargazers

Watchers

etcd-cloud-operator's Issues

Support Arbitary Ignition Config

Right now the entire ignition config is

data "ignition_config" "main" {
  files = [
    "${data.ignition_file.eco-config.id}",
    "${data.ignition_file.eco-ca.id}",
    "${data.ignition_file.eco-crt.id}",
    "${data.ignition_file.eco-key.id}",
    "${data.ignition_file.eco-health.id}",
    "${data.ignition_file.e.id}"
  ]

  systemd = [
    "${data.ignition_systemd_unit.docker.id}",
    "${data.ignition_systemd_unit.locksmithd.id}",
    "${data.ignition_systemd_unit.update-engine.id}",
    "${data.ignition_systemd_unit.eco.id}",
    "${data.ignition_systemd_unit.eco-health.id}",
    "${data.ignition_systemd_unit.node-exporter.id}",
  ]

  users = ["${data.ignition_user.core.id}"]
}

terraform ignition supports the append block, which can be used to add arbitary ignition configuration. This would allow us to pass in additional users, groups or any other configuration without forking the repo.

If this sounds good, I can file a PR for this.

[aws] Switch to TLS NLB

Creating this issue to start a discussion.

Pros

L4 Load Balancer
You can assign internal static IPs to the NLB
Certs can be assigned to the NLB instead via ACM, and remove the burden of setting --cacert on the client side (possible today, but not sure if can be supported)
Because this retains TCP Source IP, etcd logs actual IPs

Cons

Pricing is significantly different, might need to be re-evaluated.

Reference:

nodeSelector差数无效

etcd-member is not running on startup

Great work!

I have had some issues bootstrapping my etcd cluster, which seems to be due to the etcd-member service being down.

$ systemctl status eco
msg="STATUS: Healthy + Not running -> Join"
msg="failed to join the cluster" error="failed to add ourselves as a member ..."

This resolves instantly after running systemctl start etcd-member. Is eco supposed to start etcd itself, or should etcd-member be enabled and automatically run on startup?

Question using embedded server

Thanks for this project. I want to use etcd with Kubernetes. Do you see using the embedded server a problem given the speed to release of etcd?

Cluster lost quorum and didn't recover automatically.

health check for peer 1db1c9009069daa could not connect: dial tcp 10.194.214.86:2380: i/o timeout
2019-11-04 19:15:41.141040 W | etcdserver/membership: Reject add member request: the number of started member (2) will be less than the quorum number of the cluster (3)
2019-11-04 19:15:41.141076 W | etcdserver: not enough started members, rejecting member add {ID:9d2e3521595e019 RaftAttributes:{PeerURLs:[https://10.194.211.163:2380]} Attributes:{Name: ClientURLs:[]}}

A node terminated (the one with the timeout errors) and was replaced by the autoscaling group ( the one that was rejected), but the replacement couldn't join. The only way I was able to fix it was by stopping the eco container on all the instances at the same time.

Is this something that should have been corrected automatically or was there another way to recover it?

Failed to create etcd cluster client

I'm building a pilot ECO cluster. However, I'm getting "failed to create etcd cluster client - error=dial tcp :2379: getsockopt: connection refused".

Host: CentOS 7, Docker version 18.04.0-ce, build 3d479c0

eco.yaml:

  # The interval between each cluster verification by the operator.
  check-interval: 15s
  # The time after which, an unhealthy member will be removed from the cluster.
  unhealthy-member-ttl: 30s
  # Defines whether the operator will attempt to seed a new cluster from a
  # snapshot after the managed cluster has lost quorum.
  auto-disaster-recovery: true
  # Configuration of the etcd instance.
  etcd:
    # The address that clients should use to connect to the etcd cluster (i.e.
    # load balancer public address).
    advertise-address: http://<aws nlb>:2379
    # The directory where the etcd data is stored.
    data-dir: /var/lib/etcd
    # The TLS configuration for clients communication.
    client-transport-security:
      auto-tls: false
      cert-file: /crt/myorg.crt
      key-file: /crt/myorg.pem
      trusted-ca-file: /crt/ca.crt
      client-cert-auth: false
    # The TLS configuration for peers communication.
    peer-transport-security:
      auto-tls: true
      cert-file: /crt/myorg.crt
      key-file: /crt/myorg.key
      trusted-ca-file: /crt/ca.crt
      peer-client-cert-auth: false
    # Defines the maximum amount of data that etcd can store, in bytes, before going into maintenance mode
    backend-quota: 2147483648
  # Configuration of the auto-scaling group provider.
  asg:
    provider: aws
  # Configuration of the snapshot provider.
  snapshot:
    provider: s3
    # The interval between each snapshot.
    interval: 30m
    # The time after which a backup has to be deleted.
    ttl: 24h
    # The bucket where snapshots are stored when using the S3 provider.
    bucket: myorg.eco-test

Running with: docker run -e "AWS_ACCESS_KEY_ID=<access key>" -e "AWS_SECRET_ACCESS_KEY=<secret key>" etcd-cloud-operator

Getting this output:

time="2018-04-18T21:02:13Z" level=info msg="loaded configuration file /etc/eco/eco.yaml"
time="2018-04-18T21:02:19Z" level=warning msg="failed to create etcd cluster client" error="dial tcp <private IP of host>:2379: getsockopt: connection refused"
time="2018-04-18T21:02:19Z" level=warning msg="failed to query <host instance>'s ECO instance" error="Get http://<private IP of host>:2378/status: dial tcp <private IP of host>:2378: getsockopt: connection refused"
time="2018-04-18T21:02:19Z" level=warning msg="failed to query <other asg instance>'s ECO instance" error="Get http://<other asg instance private ip>:2378/status: dial tcp <other asg instance private ip>:2378: getsockopt: connection refused"
time="2018-04-18T21:02:19Z" level=warning msg="failed to query <other asg instance>'s ECO instance" error="Get http://<other asg instance private ip>:2378/status: dial tcp <other asg instance private ip>:2378: getsockopt: connection refused"
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/quentin-m/etcd-cloud-operator/pkg/operator.fetchStatuses(0xc420151470, 0x0, 0xc4204d51c0, 0x3, 0x4, 0x1dd10a0, 0xc4204eed80, 0xc42025e140, 0x14)
	/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/misc.go:120 +0x494
github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).evaluate(0xc420068d00, 0x14668a8, 0xc420068d00)
	/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:134 +0x247
github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).Run(0xc420068d00)
	/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:103 +0x51
main.main()
	/go/src/github.com/quentin-m/etcd-cloud-operator/cmd/operator/operator.go:60 +0x399

Any idea what gives? I tried passing log-level debug but no more helpful info on WHY exactly the etcd cluster client failed to create. Also, I've verified selinux, iptables are disabled.

Allow S3 snapshot provider to work in non AWS environments

Currently the S3 snapshot provider requires the presence of the EC2 metadata service, limiting its use to only instances deployed within AWS. Many cloud providers provide S3 compatible APIs, and it may be possible to allow the use of the S3 snapshot provider to send snapshot to GCS for example (https://cloud.google.com/storage/docs/interoperability).

Another use case is people running in hybrid cloud environments where while the etcd cluster runs on Openstack or bare metal but an S3 compatible API is available for backups etc.

How does clustering work? (Also, joining an additional cluster)

I've got 3 etcd nodes running behind a consul load balancer, so I set:

etcd:
    advertise-address: eco.service.consul

I can also confirm all the hosts are registered:

[root@i-033aa5b5aafb1f2b5 lbriggs]# host eco.service.consul
eco.service.consul has address <addr>
eco.service.consul has address <addr>
eco.service.consul has address <addr>

However, for some reason, the nodes never create a cluster. I have 3 nodes in 3 availability zones, and I would expect them to form a cluster but I end up with 3 distinct single node clusters.

Is there any magic I'm missing?

Backup and restore support

Is there a support for backup and restore for this operator. I can see the backup is supported but don't seem to see restore

Please also support AWS Launch Templates

If you switch from Launch Configs to Launch Templates the operator fails with:

	Time                            aws_private_ip  SYSTEMD_UNIT    MESSAGE
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service	github.com/quentin-m/etcd-cloud-operator/pkg/operator.fetchStatuses(0xc00081c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0001f0280)
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/misc.go:120 +0x406
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:134 +0x313
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/cmd/operator/operator.go:61 +0x4ba
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:103 +0x51
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service	main.main()
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service	github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).evaluate(0xc00038a180, 0x0, 0x0)
	May 7, 2020 @ 20:26:25.247	xx.xx.xxx.116	eco.service	github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).Run(0xc00038a180)
	May 7, 2020 @ 20:26:25.246	xx.xx.xxx.116	eco.service	panic: runtime error: index out of range [-1]
	May 7, 2020 @ 20:26:25.246	xx.xx.xxx.116	eco.service	goroutine 1 [running]:
	May 7, 2020 @ 20:26:25.244	xx.xx.xxx.116	eco.service	time="2020-05-07T18:26:25Z" level=warning msg="failed to create etcd cluster client" error="etcdclient: no available endpoints"
	May 7, 2020 @ 20:26:25.112	xx.xx.xxx.116	eco.service	2020-05-07 18:26:25.111773 W | rafthttp: lost the TCP streaming connection with peer 508fa2d34d1b809d (stream Message writer)
	May 7, 2020 @ 20:26:24.751	xx.xx.xxx.116	eco.service	2020-05-07 18:26:24.750870 W | rafthttp: lost the TCP streaming connection with peer 508fa2d34d1b809d (stream MsgApp v2 reader)
	May 7, 2020 @ 20:26:24.751	xx.xx.xxx.116	eco.service	2020-05-07 18:26:24.751207 E | rafthttp: failed to read 508fa2d34d1b809d on stream MsgApp v2 (unexpected EOF)
	May 7, 2020 @ 20:26:24.751	xx.xx.xxx.116	eco.service	2020-05-07 18:26:24.751696 W | rafthttp: lost the TCP streaming connection with peer 508fa2d34d1b809d (stream Message reader)
	May 7, 2020 @ 20:26:24.728	xx.xx.xxx.62	eco.service	github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).Run(0xc000001c80)
	May 7, 2020 @ 20:26:24.728	xx.xx.xxx.62	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:134 +0x313
	May 7, 2020 @ 20:26:24.728	xx.xx.xxx.62	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/operator.go:103 +0x51
	May 7, 2020 @ 20:26:24.728	xx.xx.xxx.62	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/cmd/operator/operator.go:61 +0x4ba
	May 7, 2020 @ 20:26:24.728	xx.xx.xxx.62	eco.service	main.main()
	May 7, 2020 @ 20:26:24.727	xx.xx.xxx.62	eco.service	github.com/quentin-m/etcd-cloud-operator/pkg/operator.(*Operator).evaluate(0xc000001c80, 0x0, 0x0)
	May 7, 2020 @ 20:26:24.727	xx.xx.xxx.62	eco.service	github.com/quentin-m/etcd-cloud-operator/pkg/operator.fetchStatuses(0xc0008c6900, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0xc0002e4880)
	May 7, 2020 @ 20:26:24.727	xx.xx.xxx.62	eco.service		/go/src/github.com/quentin-m/etcd-cloud-operator/pkg/operator/misc.go:120 +0x406
	May 7, 2020 @ 20:26:24.726	xx.xx.xxx.62	eco.service	panic: runtime error: index out of range [-1]
	May 7, 2020 @ 20:26:24.726	xx.xx.xxx.62	eco.service	goroutine 1 [running]:
	May 7, 2020 @ 20:26:24.724	xx.xx.xxx.62	eco.service	time="2020-05-07T18:26:24Z" level=warning msg="failed to create etcd cluster client" error="etcdclient: no available endpoints"
	May 7, 2020 @ 20:26:24.511	xx.xx.xxx.116	eco.service	2020-05-07 18:26:24.509500 W | rafthttp: lost the TCP streaming connection with peer 859d080ca38f4cc1 (stream MsgApp v2 writer)
	May 7, 2020 @ 20:26:21.861	xx.xx.xxx.116	eco.service	2020-05-07 18:26:21.861385 W | etcdserver: cannot get the version of member 859d080ca38f4cc1 (Get https://xx.xx.xxx.77:2380/version: dial tcp xx.xx.xxx.77:2380: connect: connection refused)
	May 7, 2020 @ 20:26:21.861	xx.xx.xxx.116	eco.service	2020-05-07 18:26:21.861362 W | etcdserver: failed to reach the peerURL(https://xx.xx.xxx.77:2380) of member 859d080ca38f4cc1 (Get https://xx.xx.xxx.77:2380/version: dial tcp xx.xx.xxx.77:2380: connect: connection refused)

thank you very much for this NICE project!

Panic when using etcd snapshot provider?

Is this expected behavior?

Can initial cluster startup time be improved?

In my testing, I've observed that the cold start time for a new 3 node cluster can range from 1 to 2 minutes depending on the timing of instance startup. This seems to result from the unconditional and fixed etcd health checking interval (worst case 30s given retries and timeouts) and the absence of any assumptions which would allow the program to distinguish between an unreachable cluster and one that never existed (i.e. not worth checking and assumed to be unhealthy). 2 minutes is the average I observe over repeated testing in a more real world environment (statefulset on Kubernetes with persistent volumes, etc.). The usual case is that each instance must incur multiple futile etcd health checks before the state machines converge to the startup conditions.

I've been thinking about how to optimize initial startup time but haven't thought of a good solution yet that doesn't introduce some kind of shared state or that doesn't come with some other poor tradeoff (e.g. reducing the etcd health retries/timeouts and increasing the possibility of false negative health check results).

I suspect that the delay is basically a tradeoff that comes with the stateless architecture, but I wonder if this is something you've thought about before. Very curious to hear what you think!

Publish muti-arch images

It would be very nice, if you could also build and publish images for different architectures, e.g. arm64 (aarch64)

Fix condition where the server could fail to completely stop when started only partially & Remove workaround

See etcd-io/etcd#9533

provider.aws: Not a valid region:

Hi ,

First of all thanks for starting this project . Thumbs up for this great initiative

I am trying to setup an etcd cluster in the region us-east-1 using terraform.

terraform plan -out=etcd-cluster.tfplan

is outputing me the terraform plan file properly

But when I am applying changes to it I am getting the following error :

**Error: Error applying plan:

1 error(s) occurred:

provider.aws: Not a valid region:**

I am unable to figure out what is the issue here. I have provided the right region and all the configuration. I dont want a route53 entry and I have added true for internal load balancer and I dont want TLS for this cluster .

Can anyone please help me out , what needs to be done here ?

Thanks

failed to join the cluster

error job

level=error msg="failed to join the cluster" error="failed to add ourselves as a member of the cluster: rpc error: code = Unknown desc = raft: stopped

[aws] Avoid writing secrets to user-data

etcd-cloud-operator/terraform/modules/ignition/ignition.tf

Lines 121 to 129 in 78eee7e

 data "ignition_file" "eco-key" { 

 filesystem = "root" 

 path = "/etc/eco/eco.key" 

 mode = 0644 

 content { 

 content = "${var.eco_key}" 

 } 

 }

I was trying to think of ways to get the key pushed to the instance without writing it via user-data (AWS-only). The concern stems from people/tools having access to the DescribeInstanceAttribute API call which will gladly return this back.

The possible approach I was considering was:

Write the secret to a secret version resource.
Add a new flag on the eco binary to instead fetch the secret from the secretmanager instead of reading from disk.

For (2), another option is to add a secret-fetch systemd service which does this, but then figuring out how to get aws cli running on the instance is another problem.

Thoughts?

Add notes on existing production users
Improve use-cases description
Clarify the section redirecting to the AWS TF module's documentation
Reword/Move notes about potentially required IAM permissions on AWS

Open to any suggestions really, my documentation foo is rusty.

Update the AWS SDK to enable checksums on S3 downloads/uploads

aws/aws-sdk-go#1827

ETCD UPGRADE To 3.5.1

Getting this issue etcd-io/etcd#13119. Is there an upgrade happening? If not what are the things I can do for an upgrade without breaking

Failed to seed the cluster - failed to start etcd: --advertise-client-urls is required when --listen-client-urls is set explicitly

My cluster is failing to seed.

Host: CentOS 7, Docker version 18.04.0-ce, build 3d479c0

eco.yaml:

eco:
  # The interval between each cluster verification by the operator.
  check-interval: 15s
  # The time after which, an unhealthy member will be removed from the cluster.
  unhealthy-member-ttl: 30s
  # Defines whether the operator will attempt to seed a new cluster from a
  # snapshot after the managed cluster has lost quorum.
  auto-disaster-recovery: true
  # Configuration of the etcd instance.
  etcd:
    # The address that clients should use to connect to the etcd cluster (i.e.
    # load balancer public address).
    advertise-address: http://<aws LB dns>:2379
    # The directory where the etcd data is stored.
    data-dir: /var/lib/etcd
    # The TLS configuration for clients communication.
    client-transport-security:
      auto-tls: false
      client-cert-auth: false
    # The TLS configuration for peers communication.
    peer-transport-security:
      auto-tls: false
      peer-client-cert-auth: false
    # Defines the maximum amount of data that etcd can store, in bytes, before going into maintenance mode
    backend-quota: 2147483648
  # Configuration of the auto-scaling group provider.
  asg:
    provider: aws
  # Configuration of the snapshot provider.
  snapshot:
    provider: s3
    # The interval between each snapshot.
    interval: 30m
    # The time after which a backup has to be deleted.
    ttl: 24h
    # The bucket where snapshots are stored when using the S3 provider.
    bucket: myorg.eco-test

I start all three processes, then I get the log output:

time="2018-04-18T22:04:10Z" level=info msg="STATUS: Unhealthy + Not running + All ready + Seeder status -> Seeding cluster"
time="2018-04-18T22:04:10Z" level=error msg="failed to seed the cluster" error="failed to start etcd: --advertise-client-urls is required when --listen-client-urls is set explicitly"

Any idea why? Do I need to specify the advertise-client-url in some fashion in the cfg?

	data "ignition_file" "eco-key" {
	filesystem = "root"
	path = "/etc/eco/eco.key"
	mode = 0644

	content {
	content = "${var.eco_key}"
	}
	}

quentin-m / etcd-cloud-operator Goto Github PK

etcd-cloud-operator's People

Contributors

Stargazers

Watchers

Forkers

etcd-cloud-operator's Issues

Pros

Cons

Recommend Projects

Recommend Topics

Recommend Org