hseeberger / constructr Goto Github PK
View Code? Open in Web Editor NEWCoordinated (etcd, ...) cluster construction for dynamic (cloud, containers) environments
License: Apache License 2.0
Coordinated (etcd, ...) cluster construction for dynamic (cloud, containers) environments
License: Apache License 2.0
After fixing #59, a review of locking stage in consul coordination is needed. See if it's neccessary to update the full TTL after a retry.
In the case of a graceful removal of a member node, ConstructR needs to either remove the respective entry from KV or shutdown the system. I think I prefer the latter, because a system using ConstructR is intended as a cluster member, right?
This depends on merging akka/akka#18729 and a new Akka release (probably 2.4.1).
When trying to run the tests, I usually get:
[info] * de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpec
[JVM-5] Error saving host to store: remove /home/aengelen/.docker/machine/machines/default/config.json: no such file or directory
[JVM-5] *** RUN ABORTED ***
[JVM-5] java.lang.ExceptionInInitializerError:
[JVM-5] at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpec.<init>(MultiNodeConsulConstructrSpec.scala:67)
[JVM-5] at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpecMultiJvmNode5.<init>(MultiNodeConsulConstructrSpec.scala:65)
[JVM-5] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[JVM-5] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[JVM-5] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[JVM-5] at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
[JVM-5] at java.lang.Class.newInstance(Class.java:442)
[JVM-5] at org.scalatest.tools.Runner$.genSuiteConfig(Runner.scala:2644)
[JVM-5] at org.scalatest.tools.Runner$$anonfun$37.apply(Runner.scala:2461)
[JVM-5] at org.scalatest.tools.Runner$$anonfun$37.apply(Runner.scala:2460)
[JVM-5] ...
[JVM-5] Cause: java.lang.RuntimeException: Nonzero exit value: 1
[JVM-5] at scala.sys.package$.error(package.scala:27)
[JVM-5] at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.slurp(ProcessBuilderImpl.scala:132)
[JVM-5] at scala.sys.process.ProcessBuilderImpl$AbstractBuilder.$bang$bang(ProcessBuilderImpl.scala:102)
[JVM-5] at de.heikoseeberger.constructr.akka.ConsulConstructrMultiNodeConfig$.<init>(MultiNodeConsulConstructrSpec.scala:40)
[JVM-5] at de.heikoseeberger.constructr.akka.ConsulConstructrMultiNodeConfig$.<clinit>(MultiNodeConsulConstructrSpec.scala)
[JVM-5] at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpec.<init>(MultiNodeConsulConstructrSpec.scala:67)
[JVM-5] at de.heikoseeberger.constructr.akka.MultiNodeConsulConstructrSpecMultiJvmNode5.<init>(MultiNodeConsulConstructrSpec.scala:65)
[JVM-5] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
[JVM-5] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
[JVM-5] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
[JVM-5] ...
... which suggests 'docker-machine ip default' fails. The strange thing is it sometimes works for the Etcd test.
Running 'docker-machine ip default' on the command-line works fine.
I'm running on Linux
When hard-coding the docker-machine ip the tests run without further problems.
putKeyWithSession
returns Success(false)
when the PUT call to consul returnes status code 200, and indicates (the body of the response) that the update has not taken place (https://consul.io/docs/agent/http/kv.html)).
In that case, the if result
in ConsulCoordination.addSelf
will cause the result to be a Failure(_)
with the generic error message java.util.NoSuchElementException: Future.filter predicate is not satisfied
, and no obvious clue in the stacktrace either.
This should be configurable. Maybe use negative values for "no limit".
When an akka cluster node gracefully leaves the cluster, wouldn't it make sense to remove it from the KV store (instead of waiting until it times out)?
When using consul, ConstructR currently uses the consul KV store both for storing the lock that determines whether there is already a seed node initializing, and for storing the addresses of seed nodes.
Wouldn't it make sense to use the consul service catalog to keep track of the addresses of seed nodes?
In some case the refresh fails with a permanent error. Instead of going in the retry logic, we could directly stop the ConstructrMachine FSM.
As an example: in the consul-coordinator implementation renewing the session can fail with an unrecoverable error in which retrying is useless. See also: Tecsisa/constructr-consul#5.
If akka.cluster.seed-nodes
is defined, akka-constructr should keep still and just log at info level.
Currently is importing the etcd configuration object instead of the consul one. Minor change.
E.g. in AddressSerialization
and selfAddress
.
When starting multiple nodes at the same time I sometimes see this.
Currently ConstructR is pretty aggressive when it comes to failure: in many cases the system is terminated. Further on it's not easy to customize failure handling.
My current thoughts are:
Else constructr-coordination depends on Akka 2.3.
Any plans to add support for zookeeper?
Nice project!
In Constructr.scala
the Connection
is initialized and the solved IPs are cached but it's using the same one in every flow materialization. As a result, if that Consul node got down, ConstructR won't try with another IP.
val connection = Http()(context.system).outgoingConnection(host, port)
Coordination("akka", context.system.name, context.system.settings.config)
(connection, ActorMaterializer())
This behaviour might be observed during DNS resolution of coordination cluster nodes.
Possible related info:
akka/akka#19419
https://gitter.im/akka/akka?at=569552acee13050b38a32091
If a node is restarted quickly, addSelf
produces a 200 status code which should be considered valid.
As described in #22, etcd doesn't really care about refreshing an outdated (TTLed) entry, it simply creates a new one. Consul instead, when trying to refresh an outdated session, returns NotFound
.
Therefore we need to change the general behavior to transitioning back to AddingSelf
in case of such an issue with refreshing. In etcd we can simply add a prevExist=true
to the PUT to also get a NotFound
.
A RefreshResult
needs to be introduced with the existing Refreshed
and a new SelfNotFound
as subtypes.
It would be nice to have access to the Akka logger
in every particular coordination implementation.
I have an application that is intended to run clustered, but e.g. for local development it might be convenient to be able to run a single, isolated instance without connecting etcd/consul.
What would be a convenient way to achieve that? Perhaps we could introduce a 'dummy' coordination backend and I could introduce an option to select that one in my application?
When Constructr is terminated (see #92) and the node is member of the cluster, instead of just terminating the system, the node should leave the cluster so that is not marked as unreachable.
Excellent work!. Thanks for sharing this. Do you have any plans to support consul in addition to etcd in the near future?. Would you accept PRs in this regard?.
The README still mentions: If something goes wrong, e.g. a timeout (after configurable retries are exhausted) when interacting with the coordination service, ConstructR by default terminates its ActorSystem. At least for constructr-akka this can be changed by providing a custom SupervisorStrategy to the manually started Constructr actor, but be sure you know what you are doing.
The latter is no longer the case, right? de.heikoseeberger.constructr.akka.Constructr
is final, fixes its SupervisionStrategy to SupervisorStrategy.stoppingStrategy
and terminates its ActorSystem
when coordination terminates.
I'd like to have some more control over what to do when coordination fails.
Currently the name of the actor system is used, see https://github.com/hseeberger/constructr/blob/master/constructr-cassandra/src/main/scala/de/heikoseeberger/constructr/cassandra/Constructr.scala#L69.
The etcd based impl remains in this project, the Consul one is recreated by @juanjovazquez and others in their own repository.
If a node is restarted quickly, addSelf
doesn't behave correctly since a previous session can already have acquired the key so that a new session cannot obtain the lock. Therefore, the former session will expire after the ttl and the key will disappear.
https://github.com/Tecsisa/constructr-consul
Of high quality and uses Akka HTTP to talk to Consul.
I have learned that it's not a real live scenario to run Cassandra on top of an elastic topology. Therefore I want to simplify ConstructR by removing constructr-cassandra and folding constructr-akka into constructr-machine.
The README (and reference.conf) mention:
ttl-factor // Must be greater than 1 + (coordination-timeout * (1 + coordination-retries) / refresh-interval)!
However, ConstructrMachineSettings checks:
require(
ttlFactor > 1 + coordinationTimeout / refreshInterval,
s"ttl-factor must be greater than one plus coordination-timeout divided by refresh-interval, but was $ttlFactor!"
)
In other words: the automated check does not take into account the allowed number of coordination retries.
Do we want to make those consistent?
ConstructrMachine
is confusingFiniteDuration
Adding
resolvers += Resolver.bintrayRepo("hseeberger", "maven")
libraryDependencies ++= Vector(
"de.heikoseeberger" %% "constructr-cassandra" % "0.13.2",
...
)
in my build.sbt result in an unresolved dependency.
Note: Unresolved dependencies path:
[warn] de.heikoseeberger:constructr-cassandra_2.11:0.13.2 (/Users/elio/progetti/iptm/build.sbt#L29-70)
[warn] +- eu.sia.innhub:iptm_2.11:1.0
sbt.ResolveException: unresolved dependency: de.heikoseeberger#constructr-cassandra_2.11;0.13.2: not found
The library seems not to be present on bintray (http://dl.bintray.com/hseeberger/maven/de/heikoseeberger/)
To take advantage of etcd high availability cluster it would be nice to be able to define more than one endpoint. Should be not that hard to do.
@hseeberger If you agree, I would like to work on it and create a PR?
Currently we use hseeberger/cassandra
which is based on a PR against the official cassandra
image. Yet the PR won't be accepted and hseeberger/cassandra
dropped.
Using RUN
we can rewrite docker-entrypoint.sh
to use the seed-provider from ConstructR.
A feature complete beta of etcd 3.0 has been released: https://github.com/coreos/etcd/releases/tag/v3.0.0-beta.0. There seem to be significant changes in the API that probably affect ConstructR, e.g. "No key-value hierarchy / directories" (we use hierarchies), "Leases" (instead of TTL) and "Lock: mutex on top of etcd ".
While README.md
and the comments in reference.conf
say that "ttl-factor must be greater than one", that's never actually checked.
It is possible that akka cluster MemberJoined
and MemberUp
events are send while the ConstructrMachine
is in state AddingSelf
. In this state the events are unhandled and therefore the following messages are written as warnings to the log:
2016-09-01T13:38:46Z MacBook-Pro-6.local WARN ConstructrMachine [sourceThread=conductr-akka.actor.default-dispatcher-2, akkaTimestamp=13:38:46.609UTC, akkaSource=akka.tcp://[email protected]:9024/user/reaper/constructr/constructr-machine, sourceActorSystem=conductr] - unhandled event MemberUp(Member(address = akka.tcp://[email protected]:9044, status = Up)) in state AddingSelf
2016-09-01T13:38:46Z MacBook-Pro-6.local WARN ConstructrMachine [sourceThread=conductr-akka.actor.default-dispatcher-30, akkaSource=akka.tcp://[email protected]:9034/user/reaper/constructr/constructr-machine, sourceActorSystem=conductr, akkaTimestamp=13:38:46.141UTC] - unhandled event MemberJoined(Member(address = akka.tcp://[email protected]:9054, status = Joining)) in state AddingSelf
The current code first unsubscribes to the akka cluster events before moving into AddingSelf
state. However, it might be a raise condition that other MemberUp
or MemberJoined
events are received while Akka still hasn't processed the unsubscription.
Currently coordination timeouts are handled via uniform retries. But as most coordination operations aren't idempotent (yet), this is a source for trouble. Also, dealing with TTLs gets nasty in the face of retries. I suggest the following changes:
GettingNodes
: transition to BeforeGettingNodes
, i.e. effectively unlimited retriesLocking
: transition to BeforeGettingNodes
, i.e. effectively unlimited retries; also make lock
idempotent by first readingAddingSelf
: use explicit retries, fail after exhaustedRefreshing
: unlimited retriesIn all cases a warning or even an error should be logged.
When refreshing for some reason (GC pause or other delays) can't refresh timely enough, the entry in etcd store has disappeared and hence refresh results in a 201 Created instead of a 200 OK. This is perfectly fine, since the system is still up an running.
git.useGitDescribe := true
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.