mesosphere-backup / etcd-mesos Goto Github PK

View Code? Open in Web Editor NEW

68.0 146.0 19.0 1.6 MB

self-healing etcd on mesos!

License: Apache License 2.0

Go 95.54% Makefile 1.87% HTML 2.59%

etcd mesos cluster framework dcos

etcd-mesos's People

Contributors

Stargazers

Watchers

Forkers

insequent dallasmarlow puppetizeme cloudxtreme ivucica lclarkmichalek cadelaren yestab123 tookko eronwright adobe-platform goofiva mgatien nugasescuserbun waldow90 https-waldoww90-wadewilson-com

etcd-mesos's Issues

add support for mesos persistent volumes for optional recovery

Optionally provide the ability to use persistent volumes, which are used for storage. When they are in use:

etcd-mesos scheduler checks zk to see if persistent volumes have been employed previously
if so, only start an initial etcd node if it can use a previous persistent volume. perform the same --force-new-cluster setup that is in use with the reseed codepath to clear previous cluster metadata.
if not, spin them up when starting nodes for use with them

other required changes:

when a total cluster loss occurs, don't lock, but wait indefinitely for a volume to return for use
when a majority of previous volumes are available, and no nodes are alive, start a new cluster using all of them - raft should facilitate
when a minority are alive, this gets trickier, do raft-comparison in a safe way without exposing ports to clients, and pick the longest to act as a reseed. terminate other volumes.

proxy should use zk state, not state.json

This becomes a bug with 0.26, but rather than changing state.json->state we should just use the json that is persisted in ZK.

address multiverse packaging feedback

In mesosphere-backup/multiverse#12 some feedback was given that needs to be addressed. Another ticket, #66, exists for the bug that came up during the initial install.

Support Framework Authentication

Support a framework principal and path to secret file to support basic Mesos Framework authentication.

simple http stats endpoint

desirable stats:

raft index for each etcd peer
is the last health check passing?
counter of reseed events since scheduler started
counter of etcd tasks launched
current etcd term for cluster
aggregated etcd stats (reads, writes, latency)

PumpTheBrakes was added as a dev stopgap before proper state verification was performed prior to launching new tasks. It's probably safe to remove now, but we need to do fault injection without it to increase our confidence.

rebuild etcd-mesos DCOS package w/ image version 0.1.1

support manual cluster reseed

situation:

previous etcd-mesos cluster got into unrecoverable state
operator performed manual backup of etcd server

Currently, there's no way to manually seed a new cluster except by manually traversing and duplicating keys in an etcd server. We need to support an operator supplying a restore argument that contains a compressed etcd storage directory which will be used to initialize a new cluster.

add CONTRIBUTING.md

xref https://github.com/mesosphere/mesos-dns/blob/master/CONTRIBUTING.md

bump version to 2.2.1 when possible

2.2.1 was released about a week ago
https://github.com/coreos/etcd/releases/tag/v2.2.1

bump mesos-go to version to include fix for cpu hogging

The current included version of mesos-go will spin in the message decoder loop, consuming a cpu core without doing any work in some situations. The current master fixes it. We should use it.

configurable static port

Some environments simply can't rely on SRV or mesos state.json. We should provide the option to only accept offers using a specific port for those in this situation, with the caveat that it will not be as easy to schedule on an arbitrary node.

automatic backups to s3/hdfs/nfs/gcs

At backup-interval seconds, an etcd executor should check to see if the etcd server running under it is the elected leader. If it is, it should backup, compress, and upload a copy of the storage to s3/hdfs/nfs/gcs.

support full-cluster loss with automated backup and recovery

Currently, etcd-mesos cannot survive a total cluster loss. We should provide optional automatic backups and automated(operator triggered)/automatic restoration.

rebuild etcd-mesos dcos image and publish new version to multiverse

interested in fixes from #93

Accept a flag to set framework name

The framework name is used to build Mesos-DNS A records. By default Mesos will set this to the FrameworkID, which is impossible for dependent services to predict.

etcd deployment fails with DCOS if framework found in Zookeeper

I setup etcd on my cluster using DCOS CLI a first time and it worked. I then uninstalled it. A couple days later I decided to reinstall but since, every installation is failing.
It seems that the reason for this is that the framework is found in Zookeeper but fails at restoring. Here is the failure trace I got through the stderr file in mesos (just changed the IPs with x.x.x.x (agent) and y.y.y.y(mesos master):

+ /work/bin/etcd-mesos-scheduler -alsologtostderr=true -framework-name=etcd -cluster-size=3 -master=zk://master.mesos:2181/mesos -zk-framework-persist=zk://master.mesos:2181/etcd -v=1 -auto-reseed=true -reseed-timeout=240 -sandbox-disk-limit=4096 -sandbox-cpu-limit=1 -sandbox-mem-limit=2048 -admin-port=3356 -driver-port=3357 -artifact-port=3358 -framework-weburi=http://etcd.marathon.mesos:3356/stats
I0222 04:14:30.573426       7 app.go:218] Found stored framework ID in Zookeeper, attempting to re-use: b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003
I0222 04:14:30.575267       7 scheduler.go:209] found failover_timeout = 168h0m0s
I0222 04:14:30.575363       7 scheduler.go:323] Initializing mesos scheduler driver
I0222 04:14:30.575473       7 scheduler.go:792] Starting the scheduler driver...
I0222 04:14:30.575552       7 http_transporter.go:407] listening on x.x.x.x port 3357
I0222 04:14:30.575588       7 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@10.32.0.4:3357
I0222 04:14:30.575625       7 scheduler.go:821] starting master detector *zoo.MasterDetector: &{client:<nil> leaderNode: bootstrapLock:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} bootstrapFunc:0x7991c0 ignoreInstalled:0 minDetectorCyclePeriod:1000000000 done:0xc2080548a0 cancel:0x7991b0}
I0222 04:14:30.575746       7 scheduler.go:999] Scheduler driver running.  Waiting to be stopped.
I0222 04:14:30.575776       7 scheduler.go:663] running instances: 0 desired: 3 offers: 0
I0222 04:14:30.575799       7 scheduler.go:671] PeriodicLaunchRequestor skipping due to Immutable scheduler state.
I0222 04:14:30.575811       7 scheduler.go:1033] Admin HTTP interface Listening on port 3356
I0222 04:14:30.607180       7 scheduler.go:374] New master [email protected]:5050 detected
I0222 04:14:30.607306       7 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
I0222 04:14:30.607466       7 scheduler.go:922] Reregistering with master: [email protected]:5050
I0222 04:14:30.607656       7 scheduler.go:881] will retry registration in 1.254807398s if necessary
I0222 04:14:30.610527       7 scheduler.go:769] Handling framework error event.
I0222 04:14:30.610636       7 scheduler.go:1081] Aborting framework [&FrameworkID{Value:*b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003,XXX_unrecognized:[],}]
I0222 04:14:30.610890       7 scheduler.go:1062] stopping messenger
I0222 04:14:30.610985       7 messenger.go:269] stopping messenger..
I0222 04:14:30.611076       7 http_transporter.go:476] stopping HTTP transport
I0222 04:14:30.611168       7 scheduler.go:1065] Stop() complete with status DRIVER_ABORTED error <nil>
I0222 04:14:30.611262       7 scheduler.go:1051] Sending error via withScheduler: Framework has been removed
I0222 04:14:30.611366       7 scheduler.go:298] stopping scheduler event queue..
I0222 04:14:30.611504       7 http_transporter.go:450] HTTP server stopped because of shutdown
I0222 04:14:30.611598       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611687       7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611779       7 scheduler.go:250] finished processing scheduler events

Any suggestions on how to fix the deployment?

investigate patching + upstreaming proxy to bail out or requery when reconnected to a different clusterid

The current etcd proxy, when using a discovery-srv to find peers from srv records, will never requery the srv record. As a result, when it loses connectivity to the cluster's servers, it will block forever trying to reconnect to the last known nodes. If a different etcd cluster comes up and happens to reuse the same ports on one of the hosts, the proxy will start using it as the source of truth. Note that reseeding will also generate a new clusterid. Can we have the proxy bail out if it detects that it has talked to nodes with two different cluster ID's ever?

add scheduler web ui to framework info for link from the mesos ui

determine healthy disk size

during load testing, some etcd nodes crashed during log compaction with the following error:

2015/08/17 17:24:41 etcdserver: raft save state and entries error: write etcd_data/member/wal/000000000000001f-00000000007582d2.wal: no space left on device

As can be seen at etcd-io/etcd#3300, etcd can use double the required disk space to complete a compaction. We need to determine a safe size for most workloads to safely exist in the mesos sandbox.

release channels for 0.22 and 0.24+0.23

We should create branches for different release channels. My thinking is that we could have branches corresponding to mesos versions, and they pull in changes from master which are mesos version agnostic, while applying version-specific logic in each (no persistent volume stuff in the 0.22 branch, etc...)

stricter threshold for lock

Lock (https://github.com/mesosphere/etcd-mesos/blob/master/scheduler/scheduler.go#L342) should only occur if a previous node (cluster?) was able to instantiate first, otherwise it's needlessly becoming unavailable.

tune down backoffs

In some cases, progress cannot be made until an RPC to an unhealthy etcd instance times out. Investigate turning down some of the backoff timeouts.

bail out if reregistered version != registered version

Because mesos-go bindings are version-specific, and we'd like to support down to version 0.22 up through 0.24, if we reregister and the master version has increased then we should bail out so that a wrapper script can start with the proper mesos-go version. This is a race condition, as the master may have been upgraded after the detector wrapper ran but before the registered call completed, but the window is small.

automatic re-seeding sounds dangerous

Hey, I just stumbled across the project, and I'm concerned about the automatic re-seeding. If you can't access a majority of an etcd/raft cluster, you don't know if you've lost committed data. Rolling with it breaks the promise that etcd/raft makes to its clients, so automatically trying to recover seems like it could cause a lot of trouble. Making this the default behavior is even worse.

Do you have evidence that automating this is even necessary in practice? I like automated systems too, but I'd rather get human approval any time I'm admitting possible data loss.

cc @philips

etcd panics on DCOS startup and the scheduler fails to reseed

When etcd was launched on DCOS, it triggered an etcd panic and then a scheduler failed to rectify the situation.
etcd panic:
https://gist.github.com/spacejam/d56a34798174a95a30b3
Scheduler failure to deconfigure panicked etcd instance:
https://gist.github.com/spacejam/efbaa5479e770b98f8b5

soak test /reseed api

run /reseed soak test

executor starts on random port

needs to use a port from an offer

document how to migrate backing ZK data to a new chroot/cluster

increase logging for task_error

mesos version detector script

Create a script that detects the mesos version and launches the appropriate etcd-mesos binary compiled for that version of mesos-go.

missing latest mesos-go in Godeps?

$ make
rm bin/etcd-*
rm -f /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos/Godeps/_workspace/src/""github.com/mesosphere"/etcd-mesos"
mkdir -p /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos/Godeps/_workspace/src/"github.com/mesosphere"
ln -s /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos/Godeps/_workspace/src/""github.com/mesosphere"/etcd-mesos"
go build -o bin/etcd-mesos-executor cmd/etcd-mesos-executor/app.go
go build -o bin/etcd-mesos-scheduler cmd/etcd-mesos-scheduler/app.go
# github.com/mesosphere/etcd-mesos/scheduler
Godeps/_workspace/src/github.com/mesosphere/etcd-mesos/scheduler/scheduler.go:964: unknown mesosproto.TaskInfo field 'Discovery' in struct literal
make: *** [bin/etcd-mesos-scheduler] Error 2

support external URL's for etcd, etcdctl, executor bins

versioned alpha release branches compatible with 0.22, 0.23, 0.24 versions of mesos

due to mesos-go handling a single version of mesos, we need to have different releases for each mesos version. For our initial alpha open-source release, we need to create a branch that is based on each version of mesos-go.

blockers:

mesos-go compatible with 0.23
mesos-go compatible with 0.24

add docs

architecture
differences between ectd and etcd-mesos
configuration
deployment & administration guide

bump version to 2.2.0 when possible

Version 2.2.0 will include the fix for restoration that is important for proper backup/restore functionality.

zombie schedulers (and possibly executors)

When testing the web UI it was discovered that schedulers fail to die when a new master is elected and a new scheduler is brought up by marathon.

scheduler should property allocate driver-port for etcd-mesos-executor

... and then the custom executor should bind to that port when setting up the driver instance.

claim zk ephemeral znode before registering

This allows multiple instances of the scheduler to be run in HA mode if desired, and will prevent multiple nodes alternatively kicking each other off the mesos master when they register.

use mesos DiscoveryInfo to cut down on the number of returned ports in mesos-dns

Kubernetes-mesos is currently unable to use etcd-mesos, as the etcd proxy that will be local to the k8s components fails to parse the mesos-dns response if it is larger than 512 bytes (presumed 512, 768 failed but 112 worked). We need to cut down on the number of ports returned, which is currently 6 per etcd instance (udp, tcp * client, peer, and reseed listener).

proxy v2 api from etcd-mesos scheduler to backing cluster

This can facilitate a dcos etcdctl command.

Add / handler to admin server with a minimal dashboard

This is the weburi the user arrives at from dcos.

add tests for reseed logic

tests should cover:

ensuring that the node with the highest raft index is chosen
ensuring that when the first node fails to come online, that the next node is attempted
ensuring that no nodes are killed unless a node is successfully determined to be a new seed
ensuring that when a new seed is chosen, and the scheduler dies before killing old nodes, that when it comes back it knows to kill the non-healthy stale cluster (it should perform a new reseed, but pick the exact same node as was picked before due to higher raft index)
ensuring that when the scheduler starts and nodes were in livelocked state, that the above logic works

Package etcd-mesos for dcos

Universe PR: mesosphere/universe#215

Add /healthz handler to the admin server

framework removal detection broken starting with mesos-0.24.0

see ba6b390#commitcomment-15749141

bad peer created on master restart

Bad peer brought up when master reconnects. Steps:

2015-09-28 18:39:19.540160 I | etcdmain: etcd Version: 2.2.0
2015-09-28 18:39:40.070822 I | etcdmain: Git SHA: e4561dd
2015-09-28 18:39:15.417100 I | etcdmain: Go Version: go1.5
2015-09-28 18:39:15.417108 I | etcdmain: Go OS/Arch: linux/amd64
2015-09-28 18:39:15.417120 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2015-09-28 18:39:15.417148 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2015-09-28 18:39:15.417490 I | etcdmain: listening for peers on http://localhost.localdomain:31003
2015-09-28 18:39:15.417604 I | etcdmain: listening for client requests on http://localhost.localdomain:31004
2015-09-28 18:39:15.437030 I | netutil: resolving localhost.localdomain:31000 to 127.0.0.1:31000
2015-09-28 18:39:15.437101 I | netutil: resolving localhost.localdomain:31003 to 127.0.0.1:31003
2015-09-28 18:39:15.437187 I | etcdmain: stopping listening for client requests on http://localhost.localdomain:31004
2015-09-28 18:39:15.437204 I | etcdmain: stopping listening for peers on http://localhost.localdomain:31003
2015-09-28 18:39:15.437220 C | etcdmain: error validating peerURLs {ClusterID:36e76486cf49fa77 Members:[&{ID:6429236700d6f390 RaftAttributes:{PeerURLs:[http://localhost.localdomain:32000]} Attributes:{Name:etcd-1443480720 ClientURLs:[http://localhost.localdomain:32001]}} &{ID:aaccb2ea791d9ae1 RaftAttributes:{PeerURLs:[http://localhost.localdomain:31000]} Attributes:{Name:etcd-1443480719 ClientURLs:[http://localhost.localdomain:31001]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs

This is possibly the result of clearing peers on disconnect/reconnect, and then syncing with master which itself has an incomplete perspective. The quick fix is to hold off before reconciling to allow slaves to check in. A better fix may be to persist known tasks in ZK.

fix logging and log rotation

executor logging is borked currently. need to log:

fetcher info
executor stuff
etcd stuff
with rotation enabled for the etcd log, and possibly the executor log

cluster re-seed support

When a cluster experiences livelock for reseed-timeout seconds, the scheduler should:

determine liveness of each etcd server
determine raft index of each live etcd server
for each server, starting with the highest raft index and ending with the lowest, try to reseed a cluster using that node. If it succeeds, try to create at least cluster-size / 2 NEW slave instances, and kill the other previous members if that is successful.
If none of the reseeds succeed, do NOTHING - an operator needs to perform a manual backup and restore, and we don't want to kill our tasks which will cause their mesos sandbox to be rm'd in the mean time.