mesosphere-backup / etcd-mesos Goto Github PK
View Code? Open in Web Editor NEWself-healing etcd on mesos!
License: Apache License 2.0
self-healing etcd on mesos!
License: Apache License 2.0
Optionally provide the ability to use persistent volumes, which are used for storage. When they are in use:
other required changes:
This becomes a bug with 0.26, but rather than changing state.json->state we should just use the json that is persisted in ZK.
In mesosphere-backup/multiverse#12 some feedback was given that needs to be addressed. Another ticket, #66, exists for the bug that came up during the initial install.
Support a framework principal and path to secret file to support basic Mesos Framework authentication.
desirable stats:
PumpTheBrakes was added as a dev stopgap before proper state verification was performed prior to launching new tasks. It's probably safe to remove now, but we need to do fault injection without it to increase our confidence.
situation:
Currently, there's no way to manually seed a new cluster except by manually traversing and duplicating keys in an etcd server. We need to support an operator supplying a restore
argument that contains a compressed etcd storage directory which will be used to initialize a new cluster.
2.2.1 was released about a week ago
https://github.com/coreos/etcd/releases/tag/v2.2.1
The current included version of mesos-go will spin in the message decoder loop, consuming a cpu core without doing any work in some situations. The current master fixes it. We should use it.
Some environments simply can't rely on SRV or mesos state.json. We should provide the option to only accept offers using a specific port for those in this situation, with the caveat that it will not be as easy to schedule on an arbitrary node.
At backup-interval
seconds, an etcd executor should check to see if the etcd server running under it is the elected leader. If it is, it should backup, compress, and upload a copy of the storage to s3/hdfs/nfs/gcs.
Currently, etcd-mesos cannot survive a total cluster loss. We should provide optional automatic backups and automated(operator triggered)/automatic restoration.
The framework name is used to build Mesos-DNS A records. By default Mesos will set this to the FrameworkID, which is impossible for dependent services to predict.
I setup etcd on my cluster using DCOS CLI a first time and it worked. I then uninstalled it. A couple days later I decided to reinstall but since, every installation is failing.
It seems that the reason for this is that the framework is found in Zookeeper but fails at restoring. Here is the failure trace I got through the stderr file in mesos (just changed the IPs with x.x.x.x (agent) and y.y.y.y(mesos master):
+ /work/bin/etcd-mesos-scheduler -alsologtostderr=true -framework-name=etcd -cluster-size=3 -master=zk://master.mesos:2181/mesos -zk-framework-persist=zk://master.mesos:2181/etcd -v=1 -auto-reseed=true -reseed-timeout=240 -sandbox-disk-limit=4096 -sandbox-cpu-limit=1 -sandbox-mem-limit=2048 -admin-port=3356 -driver-port=3357 -artifact-port=3358 -framework-weburi=http://etcd.marathon.mesos:3356/stats
I0222 04:14:30.573426 7 app.go:218] Found stored framework ID in Zookeeper, attempting to re-use: b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003
I0222 04:14:30.575267 7 scheduler.go:209] found failover_timeout = 168h0m0s
I0222 04:14:30.575363 7 scheduler.go:323] Initializing mesos scheduler driver
I0222 04:14:30.575473 7 scheduler.go:792] Starting the scheduler driver...
I0222 04:14:30.575552 7 http_transporter.go:407] listening on x.x.x.x port 3357
I0222 04:14:30.575588 7 scheduler.go:809] Mesos scheduler driver started with PID=scheduler(1)@10.32.0.4:3357
I0222 04:14:30.575625 7 scheduler.go:821] starting master detector *zoo.MasterDetector: &{client:<nil> leaderNode: bootstrapLock:{w:{state:0 sema:0} writerSem:0 readerSem:0 readerCount:0 readerWait:0} bootstrapFunc:0x7991c0 ignoreInstalled:0 minDetectorCyclePeriod:1000000000 done:0xc2080548a0 cancel:0x7991b0}
I0222 04:14:30.575746 7 scheduler.go:999] Scheduler driver running. Waiting to be stopped.
I0222 04:14:30.575776 7 scheduler.go:663] running instances: 0 desired: 3 offers: 0
I0222 04:14:30.575799 7 scheduler.go:671] PeriodicLaunchRequestor skipping due to Immutable scheduler state.
I0222 04:14:30.575811 7 scheduler.go:1033] Admin HTTP interface Listening on port 3356
I0222 04:14:30.607180 7 scheduler.go:374] New master [email protected]:5050 detected
I0222 04:14:30.607306 7 scheduler.go:435] No credentials were provided. Attempting to register scheduler without authentication.
I0222 04:14:30.607466 7 scheduler.go:922] Reregistering with master: [email protected]:5050
I0222 04:14:30.607656 7 scheduler.go:881] will retry registration in 1.254807398s if necessary
I0222 04:14:30.610527 7 scheduler.go:769] Handling framework error event.
I0222 04:14:30.610636 7 scheduler.go:1081] Aborting framework [&FrameworkID{Value:*b9ff885a-c67e-4ec5-89cc-3b9d8fc0ef54-0003,XXX_unrecognized:[],}]
I0222 04:14:30.610890 7 scheduler.go:1062] stopping messenger
I0222 04:14:30.610985 7 messenger.go:269] stopping messenger..
I0222 04:14:30.611076 7 http_transporter.go:476] stopping HTTP transport
I0222 04:14:30.611168 7 scheduler.go:1065] Stop() complete with status DRIVER_ABORTED error <nil>
I0222 04:14:30.611262 7 scheduler.go:1051] Sending error via withScheduler: Framework has been removed
I0222 04:14:30.611366 7 scheduler.go:298] stopping scheduler event queue..
I0222 04:14:30.611504 7 http_transporter.go:450] HTTP server stopped because of shutdown
I0222 04:14:30.611598 7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611687 7 scheduler.go:444] Scheduler received error: Framework has been removed
I0222 04:14:30.611779 7 scheduler.go:250] finished processing scheduler events
Any suggestions on how to fix the deployment?
The current etcd proxy, when using a discovery-srv to find peers from srv records, will never requery the srv record. As a result, when it loses connectivity to the cluster's servers, it will block forever trying to reconnect to the last known nodes. If a different etcd cluster comes up and happens to reuse the same ports on one of the hosts, the proxy will start using it as the source of truth. Note that reseeding will also generate a new clusterid. Can we have the proxy bail out if it detects that it has talked to nodes with two different cluster ID's ever?
during load testing, some etcd nodes crashed during log compaction with the following error:
2015/08/17 17:24:41 etcdserver: raft save state and entries error: write etcd_data/member/wal/000000000000001f-00000000007582d2.wal: no space left on device
As can be seen at etcd-io/etcd#3300, etcd can use double the required disk space to complete a compaction. We need to determine a safe size for most workloads to safely exist in the mesos sandbox.
We should create branches for different release channels. My thinking is that we could have branches corresponding to mesos versions, and they pull in changes from master which are mesos version agnostic, while applying version-specific logic in each (no persistent volume stuff in the 0.22 branch, etc...)
Lock (https://github.com/mesosphere/etcd-mesos/blob/master/scheduler/scheduler.go#L342) should only occur if a previous node (cluster?) was able to instantiate first, otherwise it's needlessly becoming unavailable.
In some cases, progress cannot be made until an RPC to an unhealthy etcd instance times out. Investigate turning down some of the backoff timeouts.
Because mesos-go bindings are version-specific, and we'd like to support down to version 0.22 up through 0.24, if we reregister and the master version has increased then we should bail out so that a wrapper script can start with the proper mesos-go version. This is a race condition, as the master may have been upgraded after the detector wrapper ran but before the registered call completed, but the window is small.
Hey, I just stumbled across the project, and I'm concerned about the automatic re-seeding. If you can't access a majority of an etcd/raft cluster, you don't know if you've lost committed data. Rolling with it breaks the promise that etcd/raft makes to its clients, so automatically trying to recover seems like it could cause a lot of trouble. Making this the default behavior is even worse.
Do you have evidence that automating this is even necessary in practice? I like automated systems too, but I'd rather get human approval any time I'm admitting possible data loss.
cc @philips
When etcd was launched on DCOS, it triggered an etcd panic and then a scheduler failed to rectify the situation.
etcd panic:
https://gist.github.com/spacejam/d56a34798174a95a30b3
Scheduler failure to deconfigure panicked etcd instance:
https://gist.github.com/spacejam/efbaa5479e770b98f8b5
run /reseed soak test
needs to use a port from an offer
Create a script that detects the mesos version and launches the appropriate etcd-mesos binary compiled for that version of mesos-go.
$ make
rm bin/etcd-*
rm -f /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos/Godeps/_workspace/src/""github.com/mesosphere"/etcd-mesos"
mkdir -p /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos/Godeps/_workspace/src/"github.com/mesosphere"
ln -s /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos /home/vagrant/etcd-mesos-workspace/src/github.com/mesosphere/etcd-mesos/Godeps/_workspace/src/""github.com/mesosphere"/etcd-mesos"
go build -o bin/etcd-mesos-executor cmd/etcd-mesos-executor/app.go
go build -o bin/etcd-mesos-scheduler cmd/etcd-mesos-scheduler/app.go
# github.com/mesosphere/etcd-mesos/scheduler
Godeps/_workspace/src/github.com/mesosphere/etcd-mesos/scheduler/scheduler.go:964: unknown mesosproto.TaskInfo field 'Discovery' in struct literal
make: *** [bin/etcd-mesos-scheduler] Error 2
due to mesos-go handling a single version of mesos, we need to have different releases for each mesos version. For our initial alpha open-source release, we need to create a branch that is based on each version of mesos-go.
blockers:
Version 2.2.0 will include the fix for restoration that is important for proper backup/restore functionality.
When testing the web UI it was discovered that schedulers fail to die when a new master is elected and a new scheduler is brought up by marathon.
... and then the custom executor should bind to that port when setting up the driver instance.
This allows multiple instances of the scheduler to be run in HA mode if desired, and will prevent multiple nodes alternatively kicking each other off the mesos master when they register.
Kubernetes-mesos is currently unable to use etcd-mesos, as the etcd proxy that will be local to the k8s components fails to parse the mesos-dns response if it is larger than 512 bytes (presumed 512, 768 failed but 112 worked). We need to cut down on the number of ports returned, which is currently 6 per etcd instance (udp, tcp * client, peer, and reseed listener).
This can facilitate a dcos etcdctl
command.
This is the weburi the user arrives at from dcos.
tests should cover:
Universe PR: mesosphere/universe#215
Bad peer brought up when master reconnects. Steps:
2015-09-28 18:39:19.540160 I | etcdmain: etcd Version: 2.2.0
2015-09-28 18:39:40.070822 I | etcdmain: Git SHA: e4561dd
2015-09-28 18:39:15.417100 I | etcdmain: Go Version: go1.5
2015-09-28 18:39:15.417108 I | etcdmain: Go OS/Arch: linux/amd64
2015-09-28 18:39:15.417120 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2015-09-28 18:39:15.417148 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2015-09-28 18:39:15.417490 I | etcdmain: listening for peers on http://localhost.localdomain:31003
2015-09-28 18:39:15.417604 I | etcdmain: listening for client requests on http://localhost.localdomain:31004
2015-09-28 18:39:15.437030 I | netutil: resolving localhost.localdomain:31000 to 127.0.0.1:31000
2015-09-28 18:39:15.437101 I | netutil: resolving localhost.localdomain:31003 to 127.0.0.1:31003
2015-09-28 18:39:15.437187 I | etcdmain: stopping listening for client requests on http://localhost.localdomain:31004
2015-09-28 18:39:15.437204 I | etcdmain: stopping listening for peers on http://localhost.localdomain:31003
2015-09-28 18:39:15.437220 C | etcdmain: error validating peerURLs {ClusterID:36e76486cf49fa77 Members:[&{ID:6429236700d6f390 RaftAttributes:{PeerURLs:[http://localhost.localdomain:32000]} Attributes:{Name:etcd-1443480720 ClientURLs:[http://localhost.localdomain:32001]}} &{ID:aaccb2ea791d9ae1 RaftAttributes:{PeerURLs:[http://localhost.localdomain:31000]} Attributes:{Name:etcd-1443480719 ClientURLs:[http://localhost.localdomain:31001]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs
This is possibly the result of clearing peers on disconnect/reconnect, and then syncing with master which itself has an incomplete perspective. The quick fix is to hold off before reconciling to allow slaves to check in. A better fix may be to persist known tasks in ZK.
executor logging is borked currently. need to log:
When a cluster experiences livelock for reseed-timeout
seconds, the scheduler should:
cluster-size
/ 2 NEW slave instances, and kill the other previous members if that is successful.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.